DNA Methylation, “Jumping Genes”, and Hadoop in the Enterprise

DNA abstractI just finished reading a truly thought-provoking book, “Survival of the Sickest – The Surprising Connections Between Disease and Longevity” by Dr. Sharon Moalem. Dr Moalem started his nonfiction thriller with the strange case of a healthy man who became not-so-healthy quite quickly but without an apparent cause. It was eventually discovered that this man was suffering from hemochromatosis, a disease which disrupts the body’s ability to metabolize iron properly. In looking further into hemochromatosis, Dr. Moalem and others found that it is actually quite a common genetic trait for anyone of Western European descent – about one in three of whom carries the gene for hemochromatosis. Why should such a strange and dangerous affliction be so common in the genetic code for those people? The theory presented was that it was because at some point in the past, it was a very good thing to have that affliction.

I’m no genetic biologist but I find this fascinating so I’d like to share some of what I learned. We have diseases today which, under different living and environment conditions in the past, would actually have allowed people to survive those conditions longer than those without those “diseases”. So what we consider a disease today was actually a genetic adaptation (or mutation) – something that the body (or group of bodies) did, whether by accident or intentionally (that’s where it gets really interesting).

Dr. Moalem poses some very interesting riddles – “[W]hy would you take a drug that is guaranteed to kill you in forty years? It’s the only thing that will stop you from dying tomorrow. Why would we select for a gene that will kill us through iron-loading [hemochromatosis] by the time we reach what is now middle age? Because it will protect us from a disease that is killing everyone else long before that.”

Diseases that are very bad today, like diabetes and malaria and sickle cell anemia, may actually have helped people in the past to live longer than those “unafflicted” by those diseases or conditions. Granted, they may not have lived long by our current standards but they may have lived longer than others in those times. And that is one of the things that matters to living creatures – more time alive means more time to procreate and that means passing on genes for those traits (or diseases) that allowed them to live longer.

The human genome is a very large and complex structure that is still not fully understood. There are now thought to be 20,000-25,000 genes in the human genome, the collection of genes in your cells that create the proteins and make you “you”. But only 3% of your total DNA contains instructions for building stuff in your body — the rest, a full 97%, was considered to be completely inactive. Many scientists, in fact, considered most of our genetic material to be “junk”. Retroviruses have the ability to insert genetic material into our DNA and it appears that through our many years on the planet they have done just that; we’ve collected a lot of junk through our evolution that does nothing for us – it is just along for the ride. Well now it appears that that “junk” might have a purpose and not be junk at all. For example, there are things called “jumping genes”, or transposons, which can actually cut or copy a piece of themselves and paste that piece into an “active” gene, one that actually is doing something, thus altering its behavior or processing. So instead of waiting for random mutation for change, our bodies appear to be able to intentionally change our genetic structure. And it seems that the trigger for a “jump” to happen might be some change in environmental conditions or some kind of stress. Whoa.

There is also field of genetics called epigenetics which covers changes to the expression of genes without changing the underlying DNA structure. One method of doing that is DNA methylation which allows the body to suppress (or enhance) the functioning of a particular gene. When I read about that, what popped into my mind was recently hearing a somewhat overweight person blaming their “fat gene”. I thought that a humorous retort to that next time would be “you just haven’t figured out how to methylate that gene yet” but then I saw that it isn’t so humorous – it was just addressed in a New York Times blog by Gretchen Reynolds called “How Exercise Changes Fat and Muscle Cells

What has prompted me to wax so poetic, er, genetic? And what’s the link to Hadoop? Well an obvious connection is that you can’t find all these relationships, dependencies, and causations between genes and diseases and living conditions, as Dr. Moalem and the scientists he cites have done, unless you have lots of data to analyze and good ways to analyze it. Hadoop provides that ability.

But a bit more subtle than that — consider the junk genes we mentioned before. Think of all that genetic material as data. Companies have in the past not been able to deal with the large volume of “junk” data, so they would throw it away. Now with the Big Data movement, companies are holding onto all their data, even if today it is still considered to be junk, because maybe one day it won’t be. One small bit of data might have very little or no value alone today, but if it gets combined with some other little bit of data tomorrow, it might lead to some extremely valuable business insight. Or business conditions might change and that bit of data may become quite valuable, and thus, no longer be “junk”.

Also, consider the enterprise data warehouse. The business environment is changing and the EDW needs to change, to “mutate”,  in order to survive. A business can no longer deal with just structured data, it has to be able to handle unstructured data (which is growing in volume much more quickly than structured data). Hadoop does that quite nicely. So what we are seeing is that even if the EDW remains focused on structured data, enterprises are augmenting their data warehouses with Hadoop to allow them to analyze all their data, regardless of structure.

Today we are announcing general availability for our new Hadoop appliance, the IBM PureData System for Hadoop. One of the key value propositions for our new Pure Systems appliance is that it lets enterprise customers get up and running on Hadoop, specifically IBM’s distribution of Hadoop, InfoSphere BigInsights, very quickly – much more quickly than if they tried to spin their own Hadoop cluster. We have also added a feature in the appliance to make it easy to move data from your Netezza data warehouse into Hadoop to allow you to keep the data online longer so that you might be able to find value in your junk.

DNA and Hadoop – a bizarre metaphor? Perhaps, but it gave me a chance to promote a cool book and to spend some quality time pondering our existence on this planet, how we have evolved based on changing conditions, and how big data might used to enable businesses to do the same. And I got to announce our new PureData System for Hadoop. For my next blog, maybe I’ll explore salicylates in foods and discuss how vegetables might be good for us because they are actually bad for us.

Comments Off
Dennis Duckworth

About Dennis Duckworth

Dennis Duckworth, Director of Product Marketing, IBM PureData Systems & Netezza, has been in the data game for quite a while, doing everything from Lisp programming in artificial intelligence to managing a sales territory for an RDBMS company. His passion is helping companies and people get real value out of cool technology. He is currently contributing to IBM efforts to create a unified comprehensive Big Data platform. In his previous role, Dennis was Director of Competitive and Market Intelligence for Netezza where he helped ensure that Netezza won a majority of its deals against our competitors like Oracle, Teradata, and Greenplum. He holds a degree in Electrical Engineering from Stanford University but has spent most of his life on the East Coast. When not working, Dennis enjoys sailing and fishing off his backyard on Buzzards Bay and he continues his quest for wine enlightenment.