Yahoo SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams and applying distributed machine learning algorithms. You can think of SAMOA as Mahout for streaming.

SAMOA (Scalable Advanced Massive Online Analysis) is a framework for mining big data streams. As most of the big data ecosystem, it is written in Java. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm and S4. SAMOA includes distributed algorithms for the most common machine learning tasks such as classification and clustering. For a simple analogy, you can think of SAMOA as Mahout for streaming.

yahoo-samoaYahoo SAMOA

SAMOA is both a platform and a library. As a platform, it allows the algorithm developer to abstract from the underlying execution engine, and therefore reuse their code to run on different engines. It also allows to easily write plug-in modules to port SAMOA to different execution engines.

As a library, SAMOA contains state-of-the-art implementations of algorithms for distributed machine learning on streams. The first alpha release allows classification and clustering.

For classification, we implemented a Vertical Hoeffding Tree (VHT), a distributed streaming version of decision trees tailored for sparse data (e.g., text). For clustering, we included a distributed algorithm based on CluStream. The library also includes meta-algorithms such as bagging.

To get started,

  • Download SAMOA
  • Download the Forest CoverType dataset
  • Download a simple logging library
  • Run an Example: Classifying the CoverType dataset with the VerticalHoeffdingTree in local mode. Read more