Things you should know about Big Data – Part 1 (Introduction to Hadoop)
Big Data – Is it Big..?
Big Data isn’t “big”. It is diverse.
Big Data is a collection of large and complex data that it becomes difficult to process using traditional data processing applications.
We can define the BigData in terms of 3 V’s,
- Volume – The quantity of generated and stored data.
- Velocity – The speed at which the data is generated and processed
- Variety – The type and nature of the data.
Although big data doesn’t refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data, much of which cannot be integrated easily.
“Big Data is like teenage s*x: everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it.” – Dan Ariely
Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time requires a platform like Hadoop to store large data sets across a distributed cluster and MapReduce to coordinate, combine and process data from multiple sources.
But still we face some challenges in Big Data like, Understanding and Architecting solutions incorporating Big Data technologies and knowing which Big Data solutions best meets your needs. Even Delivering analytics results from Big Data volumes and variety of data types.
We can discuss more about the challenges and its solutions in next article.
Moving to next topic, every time when we discuss about Big Data, Hadoop stands front and centre in the discussion of how to implement a big data strategy.
What it really means when someone says ‘Hadoop’.?
Hadoop is not really a Database or a Single Product…!!
Hadoop is an open source framework that offers a powerful distributed platform to store and manage big data over clusters of computers using simple programming models
It consists of multiple open source products like HDFS (Hadoop Distributed File System), MapReduce, PIG, Hive, HBase, Ambari, Mahout, Flume and HCatalog.
Basically, Hadoop is an ecosystem — a family of open source products and technologies overseen by the Apache Software Foundation (ASF).
Currently, four core modules are included in the basic framework from the Apache Foundation:
- Hadoop Common – the libraries and utilities used by other Hadoop modules.
- Hadoop Distributed File System (HDFS) – the distributed file system that lets Hadoop scale across commodity servers and, importantly, store data on the compute nodes in order to boost performance (and potentially save money)
- MapReduce – a parallel-processing engine that allows Hadoop to churn through large data sets in relatively short order
- YARN – resource management framework for scheduling and handling resource requests from distributed applications.
- We can see about two main parts – a distributed file system for data storage and a data processing framework(MapReduce).
Hadoop Distributed Filesystem (HDFS):
The distributed file system is that far-flung array of storage clusters noted above – i.e., the Hadoop component that holds the actual data. By default, Hadoop uses the cleverly named Hadoop Distributed File System (HDFS), although it can use other file systems as well.
HDFS is like the bucket of the Hadoop system: You dump in your data and it sits there all nice and cozy until you want to do something with it, whether that’s running an analysis on it within Hadoop or capturing and exporting a set of data to another tool and performing the analysis there.
Data Processing Framework & MapReduce:
Hadoop stores data and you can pull data out of it, but there are no queries involved – SQL or otherwise. Hadoop is more of a data warehousing system – so it needs a system like MapReduce to actually process the data.
MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into the data and starts pulling out information as needed. Using MapReduce instead of a query gives data seekers a lot of power and flexibility, but also adds a lot of complexity.
There are tools to make this easier: Hadoop includes Hive, another Apache application that helps convert query language into MapReduce jobs, for instance. But MapReduce complexity and its limitation to one-job-at-a-time batch processing tends to result in Hadoop getting used more often as a data warehousing than as a data analysis tool.
It’s not necessary to stick with just HDFS and MapReduce. Amazon Web Services has adapted its own S3 filesystem for Hadoop. Apache Spark has introduced In memory Data flow Engine which can be replace legacy MapReduce in Analytics. The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance and so on.
Hadoop is not always a complete, out-of-the-box solution for every Big Data task. MapReduce
is mainly used for batch processing and we prefer to use the framework only for its capability to store lots of data fast and cheap.
Hadoop delivers a proven solution for storing and processing large data sets, enabling businesses to leverage the big, diverse data that was previously too expensive or complex to use effectively. Despite its purposes and advantages, the technology is not a replacement for a data warehouse or data integration tools. Instead, the value of Hadoop can be increased by integrating it with other data or analytics solutions.
That’s why Hadoop is likely to remain the elephant in the Big Data room for some time to come.