A few posts back, we encountered the evolving term big data which describes gigantic mass of data that big business enterprises are eyeing to mine for whatever value can be obtained from the data.
Examples of big data may be found in the unimaginable collection of facts, figures, image/video/multimedia data that the Google search engines have piled up from 1997 to the present, as well as in the staggering amount of personal and related data that Facebook has collected from its more than 1.35 billion registered users worldwide since Mark Zuckerberg established it in 2004. Other organizations have their own sets of big data from their own sources.
The process of big data collection alone is itself is an enormous effort that requires backend support of data centers running on a 24/7 basis the whole year — and the advanced technology packed inside the data centers. With the extremely high cost of collecting big data, it is only natural for the business enterprise involved to recover that cost by making use of the Godzilla-sized data waiting to be tapped in the enterprise’s storage devices. An important step in using big data is data analytics, and this too requires the use of advanced technology.
Fortunately such a technology exists, thanks to hardware/software vendors and open-source software developers who are coming up with more powerful processing capability, increased levels of memory, advances in bandwidth, and highly distributed architectures that measure up to the challenge of big data.
One particular technology that stands out from the many offerings in the market is Apache Hive, which the Apache Software Foundation itself describes as “a data warehouse software (that) facilitates querying and managing large datasets residing in distributed storage“.
Hive does not work alone. It is built on top of — and works with — Apache Hadoop, an open-source software that allows distributed processing of large subsets of big data across clustered computers using simple programming models. Hadoop is designed for scalability; user organizations can start with single server machines and scale up to hundreds or thousands, and each machine is capable of local computation and storage. The Hadoop software library is designed for detecting and handling failures at the application layer. This means highly available service over clustered machines.
Hive has tools to easily extract, transform, and load subsets of big data that are stored in HDFS (Hadoop Distributed File System) or in other compatible storage systems such as Apache HBase. It can impose structure on various data formats, which makes it possible to query it using HiveQL (a query language that resembles SQL). The ability to query, in turn, provides the ability to analyze data and extract value out of it.
Data queries on Hive are done via Hadoop MapReduce, a software framework for easily writing applications which process multi-terabyte data sets in parallel on clusters consisting of thousands of nodes. Sequences of MapReduce programs are produced by a powerful data analysis platform working behind the scenes: Apache Pig. MapReduce and HDFS run in the same set of nodes.
Apache Hive and all the collaborating software need appropriate IT infrastructure to host them. Unless you have the necessary talent in your business, you need to see qualified IT professionals to help you plan infrastructure acquisition and configuration because there will be plenty of technical details to attend to before Apache Hive can make big data analytics a reality in your business.
These days there’s much talk in IT circles about big data. No, it’s not a new phenomenon taking the world by storm. Neither is it about data having morphed into something so huge that the earth may no longer be enough to contain it. Rather, it’s more of super heightened interest in gigatons data that are being migrated from traditional physical storage devices into cloud computing (and storage) facilities.
At what point does data become “big data”? It’s not clear yet. One article I read on the Web says that big data is a term still in the process of “evolving”, and that the term is used to describe an extraordinary quantity of structured, not-so-structured, and entirely unstructured data that can potentially be mined for precious information because these are not just any data — they are enterprise data. Big data is said to have these characteristics: volume (an exceedingly large one), variety of data types (SQL/MySQL/NoSQL/XML datasets, multimedia, SMS, photos, plain text, etc.), and velocity at which processing of such data takes place.
Does a terabyte (a trillion bytes) of data qualify for big data? Not quite. Big data is in the magnitude of petabytes (quadrillions) and exabytes (quintillions). Now that is really big, I should say.
For years, much of the world’s data have been stashed away in relational databases, which are designed according to an established “schema” and therefore highly structured. But lately there has been a rapid deluge of data that follow either a “schema on the fly” architecture or no schema at all (structureless) and these contribute to the formation of a massive data minefield.
Business enterprises have to mine the data to extract value out of them, and doing this is not only exceedingly costly but also inordinately time consuming. Novel ways of storage and analysis that don’t rely much on data quality or RDBMS (relational database management systems) need to be explored. Some the new ways being considered are: combining extended metadata with unprocessed data in a huge data pool, teaching the computer system to learn the properties of the data being processed, and using artificial intelligence software to find and analyze repeatable data patterns. Big players (the ones with the big data) hope to find the solution in the much touted cloud platform.
Whatever the platform to be used in dealing with big data, there is going to be a corresponding increase in demand for highly skilled computer/systems engineers to deploy the required infrastructure and ensure that they work as expected. In addition, there will be a need for more data scientists to do the “dirty work” of extracting data at the minefield and convert them into profitable enterprise information.