5 Reasons to Choose Parquet for Spark SQL
It is well-known that columnar storage saves both time and space when it comes to big data processing. Parquet, for example, is shown to boost Spark SQL performance by 10X on average compared to using text, thanks to low-level reader filters, efficient execution plans, and in Spark 1.6.0, improved scan throughput!
To see the power of Parquet, we pick 24 TPC-DS-derived queries (out of 99 because some queries simply fail to run with flat CSV data files at 1TB scale factor. More on this below.) from the spark-perf-sql test kit for comparison. The queries represent all categories in TPC-DS: reporting, ad hoc, iterative and data mining. We also make sure to include both short (queries 12 and 91) and long-running queries (queries 24a and 25), and queries known to use 100% of CPU (query 97).
We use a 6-node on-prem Cisco UCS cluster similarly configured per Cisco Validated Designs. We tune the underlying hardware so there is no network or disk IO bottlenecks in any tests. The focus is to find differences in performance running these queries simply with text and Parquet storage formats in Spark 1.5.1 and in the just-released Spark 1.6.0. The total Spark working memory is 500GB. The TPC-DS scale factor is 1TB.
- Spark SQL is much faster with Parquet!
The chart below compares the sum of all execution times of the 24 queries running in Spark 1.5.1. Queries taking about 12 hours to complete using flat CVS files vs. taking less than 1 hour to complete using Parquet, a 11X performance improvement.
- Spark SQL works better at large-scale with Parquet
Poor choice of storage format often cause exceptions that are difficult to diagnose and fix. At 1TB scale factor for example, at least 1/3 of all runnable queries failed to complete using flat CSV files, but they all completed using Parquet files.
- Less disk IO
Parquet with compression reduces your data storage by 75% on average, i.e., your 1TB scale factor data files will materialize only about 250 GB on disk. This reduces significantly input data needed for your Spark SQL applications. But in Spark 1.6.0, Parquet readers used push-down filters to further reduce disk IO. Push-down filters allow early data selection decisions to be made before data is even read into Spark.
- Higher scan throughput in Spark 1.6.0
The Databricks’ Spark 1.6.0 release blog mentioned significant Parquet scan throughput because a “more optimized code path” is used. To show this in real world, we ran query 97 in Spark 1.5.1 and in 1.6.0 and captured nmon data. The improvement is very obvious.
- Efficient Spark execution graph
In addition to smarter readers such as in Parquet, data formats also directly impact Spark execution graph because one major input to the scheduler is RDD count. In our example where we run the same query 97 on Spark 1.5.1 using text and Parquet, we got the following execution pattern for the stages. Read more