What is Big Data?
Big data is created wherever there is digital process and social media exchange of data is large numbers where the need for processing storage gets extremely complicated and errors increase in structured databases. During this point, the big data strategy is applied to increase productivity and reduce error. Big data is characterized by volume, variety, velocity, variability and veracity.
The responsibility of this layer is to separate the noise from the relevant information. The ingestion layer should be able to validate, cleanse, transform, reduce, and integrate the data into the big data tech stack for further processing.
Distributed (Hadoop) Storage Layer
The data is stored in massively distributed storage systems such as, Hadoop distributed file system (HDFS). Hadoop is an open source framework that allows us to store huge volumes of data in a distributed fashion across low cost machines. Hadoop can support petabytes of data and massively scalable map reduce engine that computes results in batches.HDFS requires complex file read/write programs to be written by skilled developers. These programs are called NoSQL databases.
MapReduce was adopted by Google for efficiently executing a set of functions against a large amount of data in batch mode. The map component distributes the problem or tasks across a large number of systems and handles the placement of the tasks in a way that distributes the load and manages recovery from failures. After the distributed computation is completed, another function called reduce combines all the elements back together to provide a result. MapReduce simplifies the creation of processes that analyze large amounts of unstructured and structured data in parallel. Underlying hardware failures are handled transparently for user applications, providing a reliable and fault-tolerant capability.
Hive is a data-warehouse system for Hadoop that provides the capability to aggregate large volumes of data. This SQL-like interface increases the compression of stored data for improved storage-resource utilization without affecting access speed.
Pig is a scripting language that allows us to manipulate the data in the HDFS in parallel. Its intuitive syntax simplifies the development of MapReduce jobs, providing an alternative programming language to Java.
HBase is the column-oriented database that provides fast access to big data. The most common file system used with HBase is HDFS. It has no real indexes, supports automatic partitioning, scales linearly and automatically with new nodes. It is Hadoop compliant, fault tolerant, and suitable for batch processing.
As big data analysis becomes a mainstream functionality for companies, security of that data becomes a prime concern.The security requirements have to be part of the big data fabric from the beginning and not an afterthought.
Monitoring systems have to be aware of large distributed clusters that are deployed in a federated mode. The system should also provide tools for data storage and visualization. Performance is a key parameter to monitor so that there is very low overhead and high parallelism. Open source tools like Ganglia and Nagios are widely used for monitoring big data tech stacks.
A huge volume of big data can lead to information overload. However, if visualization is incorporated early-on as an integral part of the big data tech stack, it will be useful for data analysts and scientists to gain insights faster and increase their ability to look at different aspects of the data in various visual modes. Once the big data Hadoop processing aggregated output is scooped into the traditional ODS, data warehouse, and data marts for further analysis along with the transaction data, the visualization layers can work on top of this consolidated aggregated data. Additionally, if real-time insight is required, the real-time engines powered by complex event processing (CEP) engines and event-driven architectures (EDAs) can be utilized.