How many cars do I need for my Family? Disaster recovery versus high availability consideration

 

This post started from a conversation two days ago in my office.

My colleague said to me: “Are you our PureApplication expert?” It is very dangerous  every time that someone calls you “expert!”

Do we have a Disaster Recovery solution for IBM PureApplication System?”

Another colleague behind me answered: “Two PureApplication Systems are the perfect solution, the first works and the secondary stays in standby in another location and you have your DR solution.”

For an IT architect is this a good question or a good answer? No; it is a simple answer to an incorrect question. The answer in not applicable. I will clarify my position with a real life example: We use a car every day for our life management: kids to school, supermarket for food, mall for shopping, soccer with friends, holiday with family, and so on.  And we need a Continuous Availability for our car. Sure! Without a car for a long time it may become a problem for a lot of aspects in our life.

Just a definition:

Continuous Availability: (CA)The infrastructure cannot be interrupted at all. This availability level is often referred to as the “Five 9s” or 99.999 percent availability (this means five minutes per year of planned or unplanned unavailability in total).

How did we solve the CA for our car? Simply! In our garage we have a second car (identical model of the first), full of gasoline ready for use when the first car is not available! Is it true? How many people in the world apply this strategy? Many of us have two or more cars in the garage, but not for CA. Someone can use this strategy for CA but it depends.

This is the problem: depends!! Depends means that the Disaster Recovery, High Availability and Continuous Availability are solutions, solutions to a problem. First of all we need to know the problem, so when we know the problem we can found a solution connected to the investment. Sure, the money aspect is important. In the car example with a zero car cost a secondary car may be perfect…but what if we have no space in the garage? How much money can we spend for a new garage?

Depends how the cost/day without a car and the probability that my car will not be available. And now we have a new parameter: the probability that our system will fail. Without a previous definition of these variables that describe the problem any answer to a DR solution is not applicable.

Just another definition:

High availability: The infrastructure cannot undergo an unplanned outage for more than a few seconds or minutes at a time. It’s acceptable to bring down the application for scheduled maintenance. Now we try to classify the reason on what our infrastructure (or the applications that runs on it) is not available.

Connection Failure: The network connection between the client and the infrastructure fails. The fail can be located in the geographic area (WAN) in the internal customer network (intranet). This error should be normally resolved with a duplication of the network component. This solution is normally cheap for a low component cost and the easy for the  management aspect.

Hardware Failure: An HW component interrupts a part of services for a breakdown. The fail can be total (the services based on that component is totally unavailable) or partial (the service works with a reduced performance). In a HA hardware solution all the critical components are duplicated. Sometime this duplication is not visible to the administration point of view. Using the hotSwap functionality sometime it is possible to substitute the breakdown component without interrupt the service.

Software Failure: A part of the software stack fails. The SW component can be an element of the middleware, of the Operative System or of the applications code. This kind of fail we means an accidental fail. The bug’s code is out of the scope because they are solved with the production cycle of the software.

Human error: An incorrect activity may produce an error on an HW or SW component. This kind of error is unpredictable but it can be reduced with specific procedures and system management strategy. Normally a part of this procedure is based on a part of the infrastructure that support these processes.

Disaster event: this event takes into consideration the loss of the total functionality of the infrastructure. All the infrastructure components are unavailable.

We do not take into consideration the Connection Failure error because this aspect is a typical infrastructural network aspect out of the PureApplication system box. Inside the PureApplication rack the Network communication is contained with the chassis.

In a traditional blade to blade communication flow north-south is managed through the TOR. This solution causes not only latency but the TOR can become a single point of failure. The PureApplication System solution is based on PureFlex architecture and the Node to Node communication happens within the chassis directly (without TOR). In this way the solution offering not only reduces switch latency but also provide a total robust solution for the node connection. The communication from the PureApplication rack and you network can be based.

For the Hardware failure the PureApplication System has been carefully designed to not have any single points of failure within each of its rack.

Each single piece of hardware component is redundant, from the power supplies to the top-of-rack (TOR). The management environment itself is also redundant within the rack as each PureApplication System device contains two management nodes, one of which is a backup for the other. The Software failure in a traditional WebSphere system (including DB2) can happen when one (or more) infrastructure node fails.

We can assume that the following is the software component offered built-in in a PureApplication System system:

  • HTTP Server
  • WebSphere Application Server
  • DB2 DataBase server

The IBM PureApplication System offers a built-in expertise based on WebSphere and Db2 pattern including the strategy to eliminates a single point of failure and maintain the HA offered by the Hardware components also on the complete Software middleware stack. The end-to-end high availability expertise available in the workload pattern eliminates a single point of failure for all nodes of the system, and can minimize both planned and unplanned downtime.

Comments: 1
Luca Amato

About Luca Amato

Luca Amato is certified IT architect on eBusiness solution area, mainly focused on eCloud solution. Method E xponent and member of methodology teachers group. IGS teacher for basic and advanced methodology courses. He represents IBM (for Italy) in Joint Technical Committee 1 of the International Organization for Standardization (ISO) for SOA standardization. From 1996 to 2000 he taught in Pavia University the corse (for Mathematic and Informatics degree) "Formal language and object Oriented implementation". Now he is a Tiger Team member on PureApplication system. Follow Luca on Twitter @luke_ita
This entry was posted in application system, Resiliency, Security and tagged . Bookmark the permalink.

One Response to How many cars do I need for my Family? Disaster recovery versus high availability consideration

  1. I find it hard to believe that the system doesn't have any possible weak links among its racks. I'd rather have something ready just in case the supposedly impossible suddenly becomes possible.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>