In my last post I talked about the adage “Jack of all trades, master of none” so I decided to use a different adage for the theme of this post. This time I am going to talk about the adage that “History repeats itself”.
In the late 80’s and early 90’s DB2 for zOS and its parallel sysplex configuration ruled the high volume, highly available transaction processing market. (It still does today because of the inherent performance and extreme availability of the servers.) In order to try to gain more customers in the transaction processing space, Oracle decided to build a shared disk, scale out configuration on distributed servers (i.e. x86 and RISC/SPARC based servers).
On the zOS platform DB2’s parallel sysplex utilizes a hardware coupling facility to connect all of the servers, and exchange and maintain locking and page/block usage information. Instead of building a hardware coupling facility for the distributed platform, Oracle decided to try to emulate this functionality in software. They first built Oracle Parallel Server (OPS) that wrote locking information to the shared disks. That did not work well, so Oracle eventually built Real Application Clusters (RAC) to distribute the locking and usage information and coordination across all of the servers in the cluster.
I could write pages and pages comparing the efficiency of RAC to the zOS coupling facility, but the main point I want to make is that software is no where near as efficient as hardware for these repetitive, high speed, low latency operations. And don’t just take my word for it, look at network switches, backup appliances, email and message archivers; they are all hardware-based appliances, not software that runs on a general-purpose server.
Well, jump forward 15 years to the rise of big data and analytics, and Oracle Exadata. In the meantime Netezza brought to market a very successful, purpose built data warehousing and analytic appliance. And the magic behind the appliance is its hardware accelerated, asymmetric massively parallel processing (AMPP) platform. The key to the hardware acceleration inside the Netezza appliance is the field programmable gate array (FPGA) that filters out extraneous data as early in the data stream as possible and as fast as data streams off the disks. This process of data elimination close to the data source removes I/O bottlenecks and frees up downstream components (CPU, memory, and network) from processing superfluous data, and is therefore expected to have a significant multiplier effect on system performance.
In traditional databases like Oracle, a query is converted into an “access plan” that runs against the data after it has been moved into memory (buffer cache, buffer pool) on the database server(s). In Netezza the query is converted into what is called a snippet, which is compiled machine code that runs in silicon on the FPGAs. A snippet arriving at each IBM Netezza appliance S-Blade initiates reading of compressed data from disk. The FPGA then reads the data from memory buffers and utilizing its Compress Engine decompresses it, transforming each block from disk into the equivalent of 4-8 data blocks within the FPGA, accelerating the retrieval of the data. Next, within the FPGA data streams into the Project Engine which filters out columns based on parameters specified in the SELECT clause of the SQL query being processed. Only records fulfilling the SELECT clause are passed further downstream to the Restrict Engine where rows not needed to process the query are blocked from passing through gates, based on restrictions specified in the WHERE clause. The Visibility Engine maintains ACID (Atomicity, Consistency, Isolation and Durability) compliance at streaming speeds. All this work, the constant pruning of unneeded columns and rows, is achieved in an energy-efficient FPGA measuring just one square inch. As we stated earlier, if an IBM Netezza appliance doesn’t need to move data, it doesn’t.
Knowing what Oracle did with the parallel sysplex on DB2 zOS, what do you think they did with IBM Netezza? If you said, “they tried to copy/recreate all that magic in software”, you’re right. And on January 27th, 2010 Larry Ellison admitted “Netezza was part of the inspiration for Exadata”.
IBM Netezza has 3 key features that dramatically speed up data warehouse and analytic workloads:
o FPGA decompression and filtering
o Zone Maps
o Very efficient compression
Oracle Exadata claims four main features, but one of them is really only useful for OLTP workloads (and even there it is not really optimal as I discussed in my previous post). Exadata’s 3 main warehousing enhancements are:
o Storage indexes
o Smart Scans
o Hybrid Columnar Compression
Do these look familiar, you bet…
But, let’s look a little more in depth at these “features” and compare them for analytic workloads.
Storage indexes and zone maps are intended to do the same thing, reduce I/O by understanding where the data that is needed for a query can, and more importantly cannot, be on disk. That way the system can skip reading blocks of data that cannot contain the data needed. In Netezza, zone maps are always on, and are stored on disk. In Exadata, storage indexes are built on first accesses and maintained in memory, so that if/when the system is shut down they are lost and must be rebuilt. In addition, if the table has an index on it, Exadata will ignore the storage index all together. Advantage IBM Netezza.
Hybrid columnar compression in Exadata can get quite good compression rates, but it is only used for data that is bulk loaded, and not updated. If you bulk load your data and then update some of it, the updated rows will be moved out of the hybrid columnar compression area into the normal compression area, and get far less compression. And due to the impact on performance of hybrid columnar compression, even Oracle suggests that it be used for read only tables. In Netezza, compression is automatic, for all tables, whether they are read only, or read/write. Once again, advantage IBM Netezza.
Now, lets talk about the biggest feature, smart scans. Smart scans attempt to do what the FPGA does in Netezza. They will read data from disk (but in Exadata they read it into memory on the storage servers, not as a stream like the FPGA) and un-compress it, apply predicates and filters, and do some simple joins, but there are a number of factors that can disable smart scan processing. For example, if a table is being written to (insert/update/delete of some rows) then a smart scan is disabled during the duration of the write operation. If there are date or time functions (i.e. current date, current timestamp) in the predicates, they cannot be processed in the smart scan. With Netezza, all predicate filtering, including time/date function processing is done on the FPGA, even if there are write operations happening at the same time. Advantage IBM Netezza.
While I talked about hardware vs. software implementations in this post, I am also talking about a purpose built, expertly integrated system vs. an engineered system. IBM Netezza was built from the ground up for high performance data warehousing and analytics so incorporates the optimal set of components to do this. On the other had, Oracle Exadata was a system that was engineered to work better that it did in the past, by adding 14 more servers into a rack (in conjunction with the 8 database servers already there) whose sole purpose is to run the software that attempts to copy the magic that is done in Netezza’s FPGA. Oh, and you have to pay extra for the software that runs on these storage servers to boot.
Remember to join a live broadcast PureSystems event on October 9th at 2pm EDT- Expert IT 2012: Accelerate Big Data and Cloud with Expert Integrated Systems. Customers & prospects will learn first hand how to overcome the toughest challenges in IT. Built-for-the-cloud IBM PureSystems solutions make capturing value from their data faster, simpler and more cost-effective. Only IBM can deliver the built-in expertise, integration-by-design systems that simplify the entire IT experience, from data analysis to cloud computing.