Optimizing for Transactions, or Over Engineering for Everything

In this post I want to dig a little deeper into the PureData System for Transactions, and compare that to what Oracle was hyping at Oracle Open World just last month.

While Oracle Exadata adds an extra tier of storage to “speed up” processing, it is still a set of database servers using Oracle Real Application Clusters (RAC) that the applications interact with.

Oracle Real Application Clusters (RAC) utilize a shared-disk architecture and no matter how many database nodes are in the cluster, they all connect to, and access, on the same physical disks. In an Exadata system they all access the same Exadata storage servers using the same InfiniBand fabric and switches.

The component that makes RAC work is called cache fusion. Oracle touts this as:
• Unbreakable
• Scalable
• High Performance

It is not any of these. Since any node in the cluster can update any page (or data block) at any time, there is a requirement that access to the data is synchronized and serialized such that two nodes do not both update the same data page at the same time. This is needed to ensure that changes are not lost. Now, I know I would like the following scenario, but I’m sure the bank would not.

1. I got to bank branch 1, and check our account balance
2. At the same time my wife is at branch 2 and also checks our balance
3. We both see the balance is $500, so I decide to withdraw $300 and my wife also decides to withdraw $300.
4. If the database does not serialize access to the data as it does the withdrawal (and the withdrawal process validates the balance again before removing the withdrawal amount from the account) then, we would both be allowed to withdraw the money
5. AND, the second withdrawal process to run would still set the balance to $200, since $500 – $300 = $200.

Each node in a RAC cluster has two processes running to control this, GCS and GES. GES stands for Global Enqueue Service and it handles locking for non-data pages. We won’t talk much about this service as it is responsible only for locks on things other than data pages (like the catalog cache and package cache). Global Cache Service (GCS) manages data (and index) page synchronization and serialization of locks between the nodes, and will ship data pages between nodes if needed.

In order for any node in the cluster to access a piece of data, there must be some node in the cluster responsible for serializing access to that piece of data. Rather than have this work centralized, Oracle uses a distributed locking mechanism whereby every node in the cluster is responsible for a subset of the data pages in the database. When a RAC cluster is started, each node takes responsibility for managing access to 1/nth of the database (where n is the number of nodes in the cluster). If a node is added to, or removed from the cluster, the ownership of the data pages must be redistributed among the remaining nodes in the cluster.

Whenever a node in the cluster wants to access a data page, it must first ask permission from the node that is responsible for managing that data page. This way the master node for that data page knows which nodes have a copy of the data page, and which node has the most current copy, so that it can ensure that two nodes are not trying to update the same piece of data at the same time.

Let’s step through how RAC handles a number of requests to illustrate all of the work that must go on under the covers. In the first example, a query is executed on instance 1 and it wants to access a row on the page with an SCN of 575. We won’t explain the concept of a SCN in detail, but consider it as a timestamp for the last operation/transaction that updated the page.

If the data page where the required row exists is not in the buffer cache on Instance 1, so Instance 1 determines the master node for this data page (in this case instance 2). Instance 1 must make a call to Instance 2 to determine if any other node has a copy of the page in memory.

Instance 2 indicates that no other instance has a copy of the page in memory, so tells Instance 1 that it can read the page from disk.

Instance 1 performs the I/O and reads a copy of the page into its buffer cache, and obtains a share lock on the page. Instance 1 must then tell Instance 2 that it has a copy of the page in its buffer cache, and the type of lock it has on the page, since Instance 2 is the master node for this page and must keep track of what nodes or instances have copies of the page and what locks are held on the page.

This example shows that RAC requires FOUR operations to read a single row into the buffer cache.

Now, if another instance, say Instance 3, wants to read a row on the same page, it must make a call to the master node for that page. Instance 2 now knows that Instance 1 has the latest copy of the requested page in its buffer cache, so it forwards Instance 3’s request for the page to Instance 1.

Instance 1 then ships a copy of the page to Instance 3, and Instance 3 obtains a lock on the page based on the operation it wants to perform.

Instance 3 must now tell Instance 2 that it has a copy of the page in its buffer cache, and the type of lock it has on the page, since Instance 2 is the master node for this page and must keep track of what nodes or instances have copies of the page and what locks are held on the page.

Notice that the page is now in the buffer cache on two instances and this page could be up to 32K. As the nodes read more and more data, this can chew up a lot of extra memory.

Now what happens if another instance (Instance 1,4,5,6,7,… in this case) wants to update a row on the same page? The same thing all over again.

Unlike Oracle RAC, the PureData System for Transactions uses a centralized locking mechanism. Every time that a member (the PureData System term for instance or node) reads a page into its buffer pool (same as the buffer cache in Oracle), the centralized coordinating cluster facility (CF) becomes aware of it and keeps track of that fact. Anytime a member wants to update a row on a page, the CF is aware of that as well.

In order to reduce communication between nodes in the cluster for lock management and global caching services, PureData System for Transactions uses a cluster acceleration facility (referred to simply as the CF) along with Remote Direct Memory Access (RDMA) technology to deliver transparent application scalability.

RDMA allows each member in the cluster to directly access memory in the CF and for the CF to directly access the memory of each member. For example, assume that a member in a cluster (Member 1) wants to read a data page that is not in its local buffer pool. The PureData System for Transactions assigns an agent (or thread) to perform this transaction; the agent then uses RDMA to directly write into the memory of the CF to indicate that it has interest in a given page. If the page that Member 1 wants to read is already in the CF’s global centralized buffer pool, the CF will push that page directly into the memory of Member 1 instead of having the agent on that member perform the I/O operation to read it from disk. The use of RDMA allows Member 1’s agent to simply make a memcopy (memory copy) call to a remote server without the need for costly process-to-process communication calls, processor interrupts, IP stack calls, and so on. Quite simply, pureScale allows a member’s agent to perform what appears to be a local memory copy operation when in fact the target is the memory address of a remote machine.

These lightweight remote memory calls, along with a secondary centralized buffer pool and lock management facilities, means that an application does not have to connect to the member where the data already resides. It is just as efficient for any member in the cluster to receive a data page from the global buffer pool regardless of the size of the cluster. Most RDMA calls are so fast that the DB2 process making the call does not need to yield the CPU and does not have to wait to be rescheduled to complete the task. For example, to notify the CF that a row is about to be updated (and therefore an X lock is required) a member’s agent performs a Set Lock State (SLS) request by writing the lock information directly into memory on the CF. The CF confirms that there are no other members in the cluster that already have this row X locked and will directly write into the requesting member’s memory to grant the lock. The entire round trip for this SLS is so fast that the agent does not need to yield the CPU. The agent can continue to be productive rather than waiting on an IP interrupt (avoiding unnecessary context switches) as is the case with other scale-out architectures.

Oracle is claiming that they now support a direct read/write of data block(s) from other nodes in the RAC cluster, but the overhead and latency of dealing with their distributed lock management still persists.

 


Comments Off
Dwaine Snow

About Dwaine Snow

OLTP and Data Warehouse Competitive Analyst - Dwaine has worked for IBM for the past 21 years. In that time he spent a number of years working on DB2 for Linux, UNIX, and Windows and has written 8 books and numerous articles on DB2, and has presented at conferences around the world. He recently started to work with our new colleagues from Netezza in the Product Management and Product Marketing arena, and is hoping to start writing his first Netezza book soon. Follow Dwaine on Twitter @DwaineSnow and be sure to check out his blog! It's at http://dsnowondb2.blogspot.com/