LinkedIn fills another SQL-on-Hadoop niche

Social networks generate colossal amounts of data that have come to defy the use of conventional data-processing tools, so it’s no wonder their engineering teams have built their own toolsets — such as Facebook and its machine-learning tools.

Enter LinkedIn, now offering its own Apache-licensed, open-sourced data-processing solution: Pinot, a real-time analytics engine and datastore, designed to run at scale. Yes, Hadoop is one of its data sources, providing yet another option for those looking to perform SQL-style queries.

LinkedIn’s own OLAP

As originally discussed by LinkedIn’s engineers late last year, Pinot was designed to provide the company with a way to ingest “billions of events per day” and serve “thousands of queries per second” with low latency and near-real-time results — and provide analytics in a distributed, fault-tolerant fashion.

The original system was assembled out of whole congeries of existing pieces — an Oracle database here, a Project Voldemort key-value store there — but LinkedIn found the amount of data ingested was too great for solutions not designed for OLAP-style jobs in the first place.

Like many other data-processing frameworks that live in or near Hadoop, Pinot is written in Java. It uses Apache Helix — also developed at LinkedIn — to perform cluster management. Real-time data comes in by way of Kafka, with historical data fetched from Hadoop.

Some sacrifices were made

With querying, Pinot shows some of its limitations — although most are deliberate design decisions, reflecting Pinot’s focus on the specific conditions for which LinkedIn created it.

For instance, the SQL-like query language used with Pinot does not have the ability to perform table joins, “in order to ensure predictable latency” (according to LinkedIn’s engineers). There’s truth to this, since SQL-on-Hadoop solutions have been known to suffer from poor performance if they attempt to perform joins between data stored in highly disparate places. Full-text search and relevance ordering for results also aren’t supported.

Finally, data is strictly read-only — although given the number of other SQL-for-big-data solutions that work the same way, this won’t likely be a major letdown.

A fairly vertical solution

Each SQL-on-Hadoop solution has so far addressed a slightly different set of needs — some for real-time queries (Spark SQL), some for historical data (Hive), some to emulate as much of SQL’s existing behavior as possible without sacrificing performance (Stinger). Pinot is similarly narrow in focus, given that it was built to scratch LinkedIn’s specific itches.

With the project going open source, though, LinkedIn clearly hopes it can scratch other peoples’ itches as well, especially if existing SQL-for-Hadoop/real-time-data solutions don’t cut it. It’s less clear if LinkedIn wants Pinot to eventually follow in the footsteps of other Hadoop projects and eventually become Apache-governed, although the choice of license for the project (Apache) would make such a transition a snap. Source

LinkedIn fills another SQL-on-Hadoop niche

Related Posts