These days there’s much talk in IT circles about big data. No, it’s not a new phenomenon taking the world by storm. Neither is it about data having morphed into something so huge that the earth may no longer be enough to contain it. Rather, it’s more of super heightened interest in gigatons data that are being migrated from traditional physical storage devices into cloud computing (and storage) facilities.
At what point does data become “big data”? It’s not clear yet. One article I read on the Web says that big data is a term still in the process of “evolving”, and that the term is used to describe an extraordinary quantity of structured, not-so-structured, and entirely unstructured data that can potentially be mined for precious information because these are not just any data — they are enterprise data. Big data is said to have these characteristics: volume (an exceedingly large one), variety of data types (SQL/MySQL/NoSQL/XML datasets, multimedia, SMS, photos, plain text, etc.), and velocity at which processing of such data takes place.
Does a terabyte (a trillion bytes) of data qualify for big data? Not quite. Big data is in the magnitude of petabytes (quadrillions) and exabytes (quintillions). Now that is really big, I should say.
For years, much of the world’s data have been stashed away in relational databases, which are designed according to an established “schema” and therefore highly structured. But lately there has been a rapid deluge of data that follow either a “schema on the fly” architecture or no schema at all (structureless) and these contribute to the formation of a massive data minefield.
Business enterprises have to mine the data to extract value out of them, and doing this is not only exceedingly costly but also inordinately time consuming. Novel ways of storage and analysis that don’t rely much on data quality or RDBMS (relational database management systems) need to be explored. Some the new ways being considered are: combining extended metadata with unprocessed data in a huge data pool, teaching the computer system to learn the properties of the data being processed, and using artificial intelligence software to find and analyze repeatable data patterns. Big players (the ones with the big data) hope to find the solution in the much touted cloud platform.
Whatever the platform to be used in dealing with big data, there is going to be a corresponding increase in demand for highly skilled computer/systems engineers to deploy the required infrastructure and ensure that they work as expected. In addition, there will be a need for more data scientists to do the “dirty work” of extracting data at the minefield and convert them into profitable enterprise information.
– Founder of Key4ce & FuseCP.
– More then 13 years of experience with Windows high availability and Microsoft Exchange.
– More then 13 years of experience with Linux and Unix.
– Open source enthusiast and a large contributor for multiple large Open Source projects.
My current main focus of attention is Geo-Clustering.