7 Tips for Improving MapReduce Performance
One service that Cloudera provides for our customers is help with tuning and optimizing MapReduce jobs. Since MapReduce and HDFS are complex distributed systems that run arbitrary user code, there’s no hard and fast set of rules to achieve optimal performance; instead, I tend to think of tuning a cluster or job much like a doctor would treat a sick human being. There are a number of key symptoms to look for, and each set of symptoms leads to a different diagnosis and course of treatment.
In medicine, there’s no automatic process that can replace the experience of a well seasoned doctor. The same is true with complex distributed systems — experienced users and operators often develop a “sixth sense” for common issues. Having worked with Cloudera customers in a number of different industries, each with a different workload, dataset, and cluster hardware, I’ve accumulated a bit of this experience, and would like to share some with you today.
In this blog post, I’ll highlight a few tips for improving MapReduce performance. The first few tips are cluster-wide, and will be useful for operators and developers alike. The latter tips are for developers writing custom MapReduce jobs in Java. For each tip, I’ll also note a few of the “symptoms” or “diagnostic tests” that indicate a particular remedy might bring you some good improvements.
Please note, also, that these tips contain lots of rules of thumb based on my experience across a variety of situations. They may not apply to your particular workload, dataset, or cluster, and you should always benchmark your jobs before and after any changes. For these tips, I’ll show some comparative numbers for a 40GB wordcount job on a small 4-node cluster. Tuned optimally, each of the map tasks in this job runs in about 33 seconds, and the total job runtime is about 8m30s.
Tip 1) Configure your cluster correctly
Diagnostics/symptoms:
- top shows slave nodes fairly idle even when all map and reduce task slots are filled up running jobs.
- top shows kernel processes like RAID (mdX_raid*) or pdflush taking most of the CPU time.
- Linux load averages are often seen more than twice the number of CPUs on the system.
- Linux load averages stay less than half the number of CPUs on the system, even when running jobs.
- Any swap usage on nodes beyond a few MB.
By Todd Lipcon read more