By Angus Kidman

Data ScienceTrying to choose the right tool for a big data project? This chart (and three simple rules) can help guide you through the options.

This chart is based on one shown by Microsoft Research senior research program manager Wenming Ye during a presentation at Build 2014 last week.

“While choosing appropriate tools is important, skills remains the biggest challenge, Ye noted. “There’s a lot of talk about challenges in the tools and challenges in the data,” he said. “But what’s really important is actually the people. There’s a lack of understanding — we really have a lack of people who are able to understand and use these distributed tools. And it’s no-one’s fault — a lot of these tools are very difficult.”

Yen suggest three key rules when dealing with big data:

  • Make sure that you’re using data to drive decisions, and not merely tracking it for its own sake.
  • Continuously update and refine your metrics.
  • Use automation to conduct more experiments and ask more questions.

The chart divides big data tasks into three areas: batch processing, interactive analysis and real-time stream processing.

 

Batch processing

Interactive analysis

Stream processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model MapReduce Queries DAG
Users Developers Developers and analysts Developers
Open source tools Hadoop, Spark Drill, Shark, Impala Hbase Storm, Apache S4, Kafka