Map Reduce

What & Why?

Map Reduce is a programming model and implementation for processing and generating large data sets. Users express problems in form of two functions:

Map -> Generates a set of intermediate key value pairs.
Reduce -> Merges all intermediate values associated with the same key.

These 2 functions are simple, sequential. MapReduce runtime taskes care of

Partitioning the input data.
Scheduling execution across a cluster.
Handling machine failures.
Managing state and inter-machine communication.

This allows developers with minimal experience in distributed systems to write programs that run efficiently on large clusters.

Examples: Tasks that can’t be performed by a single machine.

Building search index (inverted index of content to pages)
Sorting massive datasets
Distributed grep

How?

We’ll understand how a MapReduce job completes from start to end. This will solidify our understand of the processes involved and data flow in a typical MapReduce job execution.

Split Input Data

The input dataset is split into multiple parts. Typical size is 16 MB to 64 MB.

Spawn `M` Map tasks

Coordinator schedules M map tasks, all running in parallel.

Map Phase: Emit intermediate key-value pairs

Each Map Task:

Applies user’s map function on the input.
Emits intermediate key value pairs.
Intermediate KV pairs are buffered into memory. Buffered KV pairs are spilled to disk in sorted order by key.
Partitions the data across R intermediate output files using default or user provided hash function. This ensures that the same keys always end up on the same reducer.

Map Reduce Framework takes care of sorting the data. Once during the spill and merge phase. Second during the shuffle phase (where reducer reads their partitions intermediate files from all mapper machines).

Shuffle Phase

Once all the Map tasks are done, coordinator spawns R reducer tasks.

Each reducer:

Fetches intermediate output files belonging to its partition from all Mapper tasks over the network.
MR Framework ensure reducer gets data in sorted order by key.
Reducer reads keys with an Iterator interface.

Reduce Phase

Each Reducer:

Runs the user defined reduce(key:str, values: list) function for each key.
Writes output to a temporary file.
Once done, atomically rename temporary file to final output file.

Reducer output files are stored in GFS (Google File System - Distributed File System).

Handling Failures:

Worker (Map/Reduce) Failures

Coordinator periodically pings all the workers. If it’s not able to reach a worker machine (machine failure / network partition)

Reschedules all in-progress and completed Map tasks.
Rescheules in-progress reduce tasks. There is no need to reschedule completed reduce tasks as output file is stored in GFS.

Master Failure

In the original system, master failure was not handled. Reasons

Master is running on a single machine. Its less likely to fail compared to workers.
Users were responsible to retry job in case of master failure.

Optimizations

At the time (2003), when the system was built at Google, Network was not as fast as it is today. It was a factor in the design decisions and tradeoffs made.

Locality Aware Map Execution

The scheduler tries to schedule Map task on a machine that already has the data it needs to process. GFS also runs on the same machines and each input file will be replicated across machines. In case, its not able to schedule on a machine with input file, it will schedule the execution on a machine nearest to it.

Combiner

Users can define a combiner function which is similar to Reducer. It runs before the shuffle. It reduces the size of intermediate files by doing some processing beforehand.

Example: Word Count
Map Output: <"a": "1">, <"b": "1">, <"a": "1">
Combiner Output: <"a": "2">, <"b": "1">

Dealing with Stragglers

Slow tasks worker can delay job execution. A worker could be slwo due to a variety of reasons like bad disk, resource contention etc.

To solve this, coordinator scheuldes backup tasks near the end of an execution phase (Map or Reduce). This leads to duplicate tasks running and whichever task finishes first, wins.

It improves tail latency.

Custom Hash Function

Users can override the default hash partitioning.

Example: ensure all URLs from a host go to the same reducer
Hash Function: hash(hostname(url)) % R

Annotated Paper: Google Drive