🧠 Second Brain

Search

Search IconIcon to open search

MapReduce

Last updated Aug 9, 2024

MapReduce is a programming paradigm model that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster.

As the processing component, MapReduce is the heart of Apache Hadoop. The term “MapReduce” refers to two separate and distinct tasks that Hadoop programs perform.

The model is a specialization of the split-apply-combine strategy for data analysis. It is inspired by the Map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms.

It operates in two main phases: the Map phase, which processes and converts input data into a format suitable for analysis (key-value pairs), and the Reduce phase, which aggregates and summarizes the results.

The strength of MapReduce lies in its ability to handle massive scalability, allowing parallel processing across numerous servers.

Good to Know “Relation to HDFS”
While HDFS is the framework for data storage, MapReduce is the framework for data processing.

A typical use case is Data Processing with Storage: In a typical Hadoop application, data is stored in HDFS. MapReduce then processes this data. MapReduce reads data from HDFS, performs the required computation (in the Map and Reduce phases), and writes the results back to HDFS.

HDFS provides the storage required for massive datasets, and MapReduce provides the tools to process these datasets.

Both HDFS and MapReduce are foundational components of the Hadoop ecosystem.


Origin:
References: HDFS
Created 2022-08-18