As we ring in 2011, everyone is talking about Big Data.
The previous decade has been dominated by the explosion of the Web and the big names behind it — Google, Facebook, Yahoo, Twitter, LinkedIn. This decade will be defined by the ongoing explosion of data. Driven in part by the ubiquity of Web activity, but also by the proliferation of more and more devices that generate information, data is being produced at a feverish pace and there’s no end in sight.
The challenges surrounding Big Data — how to define it, how to manage and store it, how to make use of it — have commanded the attention of big industry players and start-ups alike, spawning cool new companies and open source projects. Hadoop, Mongo, Cassandra and CouchDB are all examples of new, or now not-so-new initiatives that are taking the market by storm. There are good reasons why Map Reduce and NoSQL are hot topics. In the coming year and beyond, there is clearly momentum around managing ultra-large data sets. These data sets may not even include structured data or be well suited for document management-type Key/Value stores. Hadoop and HDFS are maturing, as will the others in a short period of time.
In the midst of all this activity, there is one growing class of data that has a particularly unique set of characteristics and database requirements: Machine-generated data.
Rise of the Machines
Machine-generated data is being defined in different ways by different people. Some say it is strictly data that is generated without any direct human intervention, and others say it is data that includes the machine tracking of human activities as well, such as web log data. But whether the definition of machine-generated data is precise or a little more open, certain key characteristics nearly always apply: new records are added with a high frequency, and the data itself is seldom if ever changed. Examples beyond web logs include all other types of logs, such as computer, network and security logs. Other examples include data coming from sensors. These might be sensors that are recording temperatures, force and other factors during a satellite stress test, sensors recording readings from your car or even the data you yourself are generating during a surgical procedure. Finally, machine-generated data can also encompass call detail records, financial trading data, ATM transactions or RFID tags from shipping containers or cars on a toll way. One trend is unmistakable: the volume and diversity of data generated by machines is enormous, and tomorrow there will be more. Organizations need to be able to understand this information, but all too often, extracting useful intelligence can be like finding the proverbial “needle in the haystack.”
While traditional databases are well suited for initially storing machine-generated data, they quickly become ill suited for analyzing it. They simply run out of headroom in terms of volume, query speed, and the disk and processing infrastructure required to support it. The instinctive response to this challenge seems to be to throw more people (database administrators) or money (expanded disk storage subsystems) at the problem, or to scale back by archiving further to reduce the size of the dataset. All of these efforts generally yield a minimal short-term reprieve, but issues quickly resurface—which leads us to the data warehouse. This approach is seen by many as the only solution to the myriad information management challenges presented by machine-generated data. The problem is, data warehouse projects are generally very costly in terms of people, hardware, software and maintenance. So what’s the answer?
New Data Challenges, New Data Solutions
Just as there are growing numbers of purpose-built databases for super-large social networks and SaaS environments, databases specifically suited to the management of machine-generated data are critical to smart, efficient and affordable information analysis. In an ideal world, organizations should be able to quickly and easily load, query, analyze and report on volumes of machine-generated information without a huge amount of maintenance or an army of database administrators needed to change, index or partition the data. Databases should allow for complex and dynamic queries, as well as ad-hoc analysis. Taking our “ideal” even further, data compression capabilities would also be purpose-built, and take advantage of the characteristics of machine-generated data to achieve 10:1, 20:1, even 30:1 compression (as opposed to 3:1 or 4:1 in more general environments). The good news is that all of this can be done today. The advances in database management run far and wide. From the compelling nature of complex event processing and sophisticated main memory techniques to purpose-built transactional systems and purpose-built storage and analytic databases, the technology world is evolving to allow for specialized solutions that co-exist in ways never anticipated before.
Specifically, there are several new approaches that today’s businesses can take to get the speed, analytic flexibility and scalability required to extract useful intelligence from machine-generated data. To start, technologies such as columnar databases (which store data column-by-column versus traditional row-oriented databases) have emerged as a logical choice for high-volume analytics. As most analytic queries only involve a subset of the columns in a table, a columnar database that retrieves just the data that is required speeds queries while reducing disk I/O and computing resources. These types of databases can also enable significant compression and accelerated query processing, which means that users don’t need as many servers or as much storage to analyze large volumes of information. Taking these benefits even further are analytic solutions, like Infobright’s, that combine column orientation with capabilities that use knowledge about the data itself to intelligently isolate relevant information and return results more quickly. The goal is to minimize the need to access unnecessary data or to index or partition data to run queries — even ones that are fairly complex.
New data challenges require new data solutions. As we enter into the era of Big Data, the reward for investigating and understanding newer, innovative technologies can be equally big, both in terms of time-to-market and cost savings.
About the author
Don DeLoach is CEO of Infobright, the open source analytic database company. Infobright is being used by enterprises, SaaS and software companies in online businesses, telecommunications, financial services and other industries to provide rapid access to critical business data. For more information, please visit www.infobright.com.