Using Apache Hadoop and Impala together with MySQL for data analysis

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from MySQL to Hadoop, load the data to Cloudera Impala (columnar format) and run a reporting on top of that. For the examples below I will use the “ontime flight performance” data from my previous post (Increasing MySQL performance with parallel query execution). I’ve used the Cloudera Manager v.4 to install Apache Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs:

Purpose	Server specs
Namenode, Hive metastore, etc + Datanodes	2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16G RAM, RAID 10 with 8 SAS drives
Datanodes only	4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8G RAM, single 4TB drive

As you can see those a pretty old servers; the only thing I’ve changed is added a 4TB drive to be able to store more data. Hadoop provides redundancy on the server level (it writes 3 copies of the same block to all datanodes) so we do not need RAID on the datanodes (need redundancy for namenodes thou).

Data export

There are a couple of ways to export data from MySQL to Hadoop. For the purpose of this test I have simply exported the ontime table into a text file with:

select * into outfile ‘/tmp/ontime.psv’

FIELDS TERMINATED BY ‘,’

from ontime;

(you can use “|” or any other symbol as a delimiter) Alternatively, you can download data directly from www.transtats.bts.gov site using this simple script:

Load into Hadoop HDFS

First thing we will need to do is to load data into HDFS as a set of files. Hive or Impala it will work with a directory to which you have imported your data and concatenate all files inside this directory. In our case it is easy to simply copy all our files into the directory inside HDFS.

1.$ hdfs dfs -mkdir /data/ontime/

2.$ hdfs -v dfs -copyFromLocal On_Time_On_Time_Performance_*.csv /data/ontime/

Create external table in Impala

Now, when we have all data files loaded we can create an external table:

01.CREATE EXTERNAL TABLE ontime_csv (

02.YearD int ,

03.Quarter tinyint ,

04.MonthD tinyint ,

05.DayofMonth tinyint ,

06.DayOfWeek tinyint ,

07.FlightDate string ,

08.UniqueCarrier string ,

09.AirlineID int ,

10.Carrier string ,

11.TailNum string ,

12.FlightNum string ,

13.OriginAirportID int ,

14.OriginAirportSeqID int ,

15.OriginCityMarketID int ,

16.Origin string ,

17.OriginCityName string ,

18.OriginState string ,

19.OriginStateFips string ,

20.OriginStateName string ,

21.OriginWac int ,

22.DestAirportID int ,

23.DestAirportSeqID int ,

24.DestCityMarketID int ,

25.Dest string ,

26….

27.ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’

28.STORED AS TEXTFILE

29.LOCATION ‘/data/ontime’;

Note the “EXTERNAL” keyword and LOCATION (LOCATION points to a directory inside HDFS, not a file). The impala will create a meta information only (will not modify the table). We can query this table right away, however, impala will need to scan all files (full scan) for queries. Read more

Using Apache Hadoop and Impala together with MySQL for data analysis

Related Posts