3 Ways to Move SQL into Hadoop Faster

There are some factors that affect how fast Sqoop can move data across the network (as my colleague recently observed with SnapMirror). It turns out that moving data this way can really start to slog if are moving data between sites or clouds. Here are three things you can do to speed things up:
#1: Compare the use of direct import vs. JDBC. Sqoop provides two ways of importing data for most SQL implementations: using the well-known JDBC interface or doing a direct dump over a TCP connection. Looking at an example with a 450MB CSV file, If we do the transfer in PostgreSQL and add 80ms of latency on the link, the direct method is clearly faster. However, there are some situations (like with HSQLDB or with really large tables) where you’re out of luck.
If we set Sqoop to use 4 mappers and do the import with JDBC, with Cascade Pilot we can see that 4 TCP connections are open and the database traffic is spread across all of them.
#2: Take a close look at the number of parallel connections. If you are using the JDBC interface, adding more mappers (with the -m option) can increase throughput, but there are diminishing returns. For example if we double the mappers in the above transfer to 8, we only get a 12% reduction in the time it takes to do the transfer:
#3: Riverbed can help. If you already have Riverbed in your network or are considering adding some mojo, there are a few ways we can help:
Path Selection, new in RiOS 8.5, can put Sqoop traffic into a separate network path, freeing up additional bandwidth where it is needed. This is easy to do with either import method, and can be part of a WAN QoS stratgy.
Steelheads can both improve the network resources required for Sqoop imports, even for large databases and between clouds. Here’s a screenshot of RiOS optimizing several Sqoop transfers to TCP port 5432:
The Riverbed Performance Management solution can show you what’s happening between the Hadoop cluster and other parts of the enterprise — including drill down into specific application tiers, traces and network connections with just a couple of clicks.
These are just a few things you can do to turbocharge moving SQL into Hadoop. With more enterprises embracing multiple clouds and Hadoop clusters starting to get really big, network performance is a key component of getting better results. By Nick Amato source






