A couple weeks ago, I had the opportunity to attend a Big Data briefing from HP Chief Technologist for Data Management, Greg Battas. Battas is part of the newly formed Converged Systems group in HP. He was a pioneer of Very Large Databases (VLDBs) and analytics in the telecom industry who has worked in business intelligence and analytics for a couple decades. He speaks internationally on topics of data integration and holds several patents in the areas of Relations Database, parallel query optimization and real time infrastructure architectures.
Coming from a mid market company, Big Data seems like a problem that doesn’t affect me. Its a concept I have a difficult time wrapping my head around, and for that reason, I’ve not written about it in the past. But it seems that Big Data is more than just a buzzword of the year and there is a lot of innovation occurring around Big Data in attempts to solve customer problems with these datasets.
So, what is Big Data? Wikipedia seems to sum it up best by saying “Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.” That helps.
So within the realm of Big Data, you begin to think about supercomputers, hyper-scale arrays of servers and new innovations like in-memory databases. So where is this going and what’s happening in Big Data today?
Direction: Consolidation of Clusters
Big Data Architectures in the past were based around very simple servers with direct attached storage. It leads to a first Hadoop cluster, then a second, third and fourth. A lot of the thought in Big Data was movement away from proprietary storage and databases with parallel programming and distributed file systems like Google File System on standard hardware. Common wisdom dictates that big data depends on massive IO to read huge amounts of data from disk and to optimize this you move compute closer to the data to reduce overhead. There are certain operations that benefit from being pushed down to the same location as the data.
In reality, what has been learned from Big Data workloads is that only a portion of the processing can be close to the data. Keeping data and compute local to one another is actually difficult and does not always result in optimal performance because the data may not be in the appropriate form for processing and may need to be shuffled or reduced. In addition, its observed that the majority of CPU power is still needed for analytics and aggregation. And when most work is done in-memory, the storage really doesn’t apply. This becomes particularly important with NVRAM and other persistent RAM technologies in the future. What all this allows is for the consolidation and re-tasking of clusters to meet multiple needs.
Direction: Software Defined Storage
Big Data workloads, particularly the largest datasets in the world, are running on distributed file systems on industry standard server hardware. These are parallel file systems rather than traditional storage arrays or databases. Hadoop Distributed File System is becoming a very common interface across multiple platforms and deployments and vendors are adopting HDFS and integrating it under their technologies. Today, there is a mix of proprietary and open source technologies.
HP is observing a lot of vendors running their proprietary systems on top of the HDFS, or a very similar parallel filesystem. HP’s internal blueprint with HAVEn is much the same. HAVEn has the unstructured data of Autonomy, the structured analytics of Vertica and running it on top of the HDFS. In addition to storage, the HDFS allows for data to be passed from one tool to another.
Direction: NoSQL is being adopted by a lot of software partners
The first wave of Big Data was around Batch where Hadoop was used for analytics and ETL offloading. It was often coupled with a company’s SQL databases. There is a growing interest in NoSQL products from independent software vendors because of a large dependence on the traditional database vendors. ISVs are seeing a large portion of their total sales being directed back to a database vendor at the close of a sale, so many are beginning to move their commercial products onto NoSQL products that do not require the large costs. It isn’t without challenges, since there is no longer a SQL interface or transaction management to move onto a NoSQL product that no longer requires the large costs of traditional databases. But there are several with active projects porting their products onto a NoSQL product like Hbase.
Direction: Shift to optimized hardware
With a specialized workload like Big Data, there is opportunity for tuning and tailoring hardware to specifically handle the work better than industry standard hardware. Within HP, Moonshot is one example of hardware that is hyper-scale, simple node architecture suited for Big Data (I previously covered Moonshot here). On each cartidge, co-processors or GPU’s could be added to better handle big data workloads, but there is even possibilities within the system-on-a-chip on the cartridge.
Battas mentioned the idea of Dark Silicon, or un-used and un-powered transistors within today’s chips. Today, the industry has the capability of packing on more transistors into a piece of silicon than we have connections to power, leaving them dark. But the interesting idea is tailoring the dark silicon cores to handle a specific task well and then rotating the customized core in and out of use to increase efficiency. This is a particularly interesting topic, enough so that I have written a second post today about dark silicon.