nosql to hadoopDespite the increasing interest in unstructured data, much of the world’s information still lives in some form of relational database.  Firing up your first Hadoop cluster often means moving data from existing SQL tables, maybe hosted in SQL Server, MySQL or PostgreSQL, into HDFS or Hive so it can be accessed by all of that MapReduce goodness.  Towards making that easier, there’s a cool Apache project called Sqoop that allows you to import this data in a streamlined, flexible way that can leverage the parallelism in your cluster using mappers.

There are some factors that affect how fast Sqoop can move data across the network (as my colleague recently observed with SnapMirror).  It turns out that moving data this way can really start to slog if are moving data between sites or clouds.  Here are three things you can do to speed things up:

#1:  Compare the use of direct import vs. JDBC.   Sqoop provides two ways of importing data for most SQL implementations:  using the well-known JDBC interface or doing a direct dump over a TCP connection.  Looking at an example with a 450MB CSV file, If we do the transfer in PostgreSQL and add 80ms of latency on the link, the direct method is clearly faster.  However, there are some situations (like with HSQLDB or with really large tables) where you’re out of luck.

If we set Sqoop to use 4 mappers and do the import with JDBC, with Cascade Pilot we can see that 4 TCP connections are open and the database traffic is spread across all of them.

#2:  Take a close look at the number of parallel connections.  If you are using the JDBC interface, adding more mappers (with the -m option) can increase throughput, but there are diminishing returns.  For example if we double the mappers in the above transfer to 8, we only get a 12% reduction in the time it takes to do the transfer:

#3: Riverbed can help.  If you already have Riverbed in your network or are considering adding some mojo, there are a few ways we can help:

Path Selection, new in RiOS 8.5, can put Sqoop traffic into a separate network path, freeing up additional bandwidth where it is needed.  This is easy to do with either import method, and can be part of a WAN QoS stratgy.

Steelheads can both improve  the network resources required for Sqoop imports, even for large databases and between clouds.  Here’s a screenshot of RiOS optimizing several Sqoop transfers to TCP port 5432:

The Riverbed Performance Management solution can show you what’s happening between the Hadoop cluster and other parts of the enterprise — including drill down into specific application tiers, traces and network connections with just a couple of clicks.

These are just a few things you can do to turbocharge moving SQL into Hadoop.  With more enterprises embracing multiple clouds and Hadoop clusters starting to get really big, network performance is a key component of getting better results. By Nick Amato source


Subscribe to our Newsletter

Get The Free Collection of 60+ Big Data & Data Science Cheat Sheets. Stay up-to-date with the latest Big Data news.