A few posts back, we encountered the evolving term big data which describes gigantic mass of data that big business enterprises are eyeing to mine for whatever value can be obtained from the data.
Examples of big data may be found in the unimaginable collection of facts, figures, image/video/multimedia data that the Google search engines have piled up from 1997 to the present, as well as in the staggering amount of personal and related data that Facebook has collected from its more than 1.35 billion registered users worldwide since Mark Zuckerberg established it in 2004. Other organizations have their own sets of big data from their own sources.
The process of big data collection alone is itself is an enormous effort that requires backend support of data centers running on a 24/7 basis the whole year — and the advanced technology packed inside the data centers. With the extremely high cost of collecting big data, it is only natural for the business enterprise involved to recover that cost by making use of the Godzilla-sized data waiting to be tapped in the enterprise’s storage devices. An important step in using big data is data analytics, and this too requires the use of advanced technology.
Fortunately such a technology exists, thanks to hardware/software vendors and open-source software developers who are coming up with more powerful processing capability, increased levels of memory, advances in bandwidth, and highly distributed architectures that measure up to the challenge of big data.
One particular technology that stands out from the many offerings in the market is Apache Hive, which the Apache Software Foundation itself describes as “a data warehouse software (that) facilitates querying and managing large datasets residing in distributed storage“.
Hive does not work alone. It is built on top of — and works with — Apache Hadoop, an open-source software that allows distributed processing of large subsets of big data across clustered computers using simple programming models. Hadoop is designed for scalability; user organizations can start with single server machines and scale up to hundreds or thousands, and each machine is capable of local computation and storage. The Hadoop software library is designed for detecting and handling failures at the application layer. This means highly available service over clustered machines.
Hive has tools to easily extract, transform, and load subsets of big data that are stored in HDFS (Hadoop Distributed File System) or in other compatible storage systems such as Apache HBase. It can impose structure on various data formats, which makes it possible to query it using HiveQL (a query language that resembles SQL). The ability to query, in turn, provides the ability to analyze data and extract value out of it.
Data queries on Hive are done via Hadoop MapReduce, a software framework for easily writing applications which process multi-terabyte data sets in parallel on clusters consisting of thousands of nodes. Sequences of MapReduce programs are produced by a powerful data analysis platform working behind the scenes: Apache Pig. MapReduce and HDFS run in the same set of nodes.
Apache Hive and all the collaborating software need appropriate IT infrastructure to host them. Unless you have the necessary talent in your business, you need to see qualified IT professionals to help you plan infrastructure acquisition and configuration because there will be plenty of technical details to attend to before Apache Hive can make big data analytics a reality in your business.