This post started from a conversation two days ago in my office.
My colleague said to me: “Are you our PureApplication expert?” It is very dangerous every time that someone calls you “expert!”
“Do we have a Disaster Recovery solution for IBM PureApplication System?”
Another colleague behind me answered: “Two PureApplication Systems are the perfect solution, the first works and the secondary stays in standby in another location and you have your DR solution.”
For an IT architect is this a good question or a good answer? No; it is a simple answer to an incorrect question. The answer in not applicable. I will clarify my position with a real life example: We use a car every day for our life management: kids to school, supermarket for food, mall for shopping, soccer with friends, holiday with family, and so on. And we need a Continuous Availability for our car. Sure! Without a car for a long time it may become a problem for a lot of aspects in our life.
Just a definition:
Continuous Availability: (CA)The infrastructure cannot be interrupted at all. This availability level is often referred to as the “Five 9s” or 99.999 percent availability (this means five minutes per year of planned or unplanned unavailability in total).
How did we solve the CA for our car? Simply! In our garage we have a second car (identical model of the first), full of gasoline ready for use when the first car is not available! Is it true? How many people in the world apply this strategy? Many of us have two or more cars in the garage, but not for CA. Someone can use this strategy for CA but it depends.
This is the problem: depends!! Depends means that the Disaster Recovery, High Availability and Continuous Availability are solutions, solutions to a problem. First of all we need to know the problem, so when we know the problem we can found a solution connected to the investment. Sure, the money aspect is important. In the car example with a zero car cost a secondary car may be perfect…but what if we have no space in the garage? How much money can we spend for a new garage?
Depends how the cost/day without a car and the probability that my car will not be available. And now we have a new parameter: the probability that our system will fail. Without a previous definition of these variables that describe the problem any answer to a DR solution is not applicable.
Just another definition:
High availability: The infrastructure cannot undergo an unplanned outage for more than a few seconds or minutes at a time. It’s acceptable to bring down the application for scheduled maintenance. Now we try to classify the reason on what our infrastructure (or the applications that runs on it) is not available.
Connection Failure: The network connection between the client and the infrastructure fails. The fail can be located in the geographic area (WAN) in the internal customer network (intranet). This error should be normally resolved with a duplication of the network component. This solution is normally cheap for a low component cost and the easy for the management aspect.
Hardware Failure: An HW component interrupts a part of services for a breakdown. The fail can be total (the services based on that component is totally unavailable) or partial (the service works with a reduced performance). In a HA hardware solution all the critical components are duplicated. Sometime this duplication is not visible to the administration point of view. Using the hotSwap functionality sometime it is possible to substitute the breakdown component without interrupt the service.
Software Failure: A part of the software stack fails. The SW component can be an element of the middleware, of the Operative System or of the applications code. This kind of fail we means an accidental fail. The bug’s code is out of the scope because they are solved with the production cycle of the software.
Human error: An incorrect activity may produce an error on an HW or SW component. This kind of error is unpredictable but it can be reduced with specific procedures and system management strategy. Normally a part of this procedure is based on a part of the infrastructure that support these processes.
Disaster event: this event takes into consideration the loss of the total functionality of the infrastructure. All the infrastructure components are unavailable.
We do not take into consideration the Connection Failure error because this aspect is a typical infrastructural network aspect out of the PureApplication system box. Inside the PureApplication rack the Network communication is contained with the chassis.
In a traditional blade to blade communication flow north-south is managed through the TOR. This solution causes not only latency but the TOR can become a single point of failure. The PureApplication System solution is based on PureFlex architecture and the Node to Node communication happens within the chassis directly (without TOR). In this way the solution offering not only reduces switch latency but also provide a total robust solution for the node connection. The communication from the PureApplication rack and you network can be based.
For the Hardware failure the PureApplication System has been carefully designed to not have any single points of failure within each of its rack.
Each single piece of hardware component is redundant, from the power supplies to the top-of-rack (TOR). The management environment itself is also redundant within the rack as each PureApplication System device contains two management nodes, one of which is a backup for the other. The Software failure in a traditional WebSphere system (including DB2) can happen when one (or more) infrastructure node fails.
We can assume that the following is the software component offered built-in in a PureApplication System system:
- HTTP Server
- WebSphere Application Server
- DB2 DataBase server
The IBM PureApplication System offers a built-in expertise based on WebSphere and Db2 pattern including the strategy to eliminates a single point of failure and maintain the HA offered by the Hardware components also on the complete Software middleware stack. The end-to-end high availability expertise available in the workload pattern eliminates a single point of failure for all nodes of the system, and can minimize both planned and unplanned downtime.


I find it hard to believe that the system doesn't have any possible weak links among its racks. I'd rather have something ready just in case the supposedly impossible suddenly becomes possible.