Stefan Müller, Director Business Intelligence & Big Data, it-novum

Big Data seems to have rendered the traditional data warehouse concept obsolete, being too slow, not flexible enough and having poor analytics. However, this article shows that – implemented in in the right way – Hadoop, NoSQL and analytical databases can be very well used in conjunction with data warehouses.

Big Data, MapReduce and NoSQL (Not only SQL) may make the data warehouse concept which was developed in the Eighties look out of date.Sure, it has established itself as a proven solution for both data integration and deployment. However, facing the ever-growing volumes of data and the much higher demands that are placed on data analysis many have started to think that it is becoming increasingly unsuited for big data analytics.

State of the art in the Eighties

But first, let’s go a little back in time. About 20 years ago, the data warehouse was introduced asa new concept for gathering, integrating and analyzing data from different company sources. This information could then be used to define or fine-tune a company’s strategic decisions. Since then, not much has changed. Data warehouses combine and analyze data from different sources, using extraction, transformation and loading processes (ETL). Within a data warehouse, departments and divisions can have their own data martsto fulfill their demand for special analyses. As a rule, these data marts are multi-dimensional, which makes them optimally suited for analytic purposes.

Relational databases are mature business tools

Both data warehouses and data marts are used as central databases for a company’s analytics and reporting needs. Even though big data hasn’t really changed this, they still have a major drawback. Data warehouses are based on relational databases. These RDBMS are sophisticated, mature, extremely reliable systems that can be used with many front-ends and can be queried using a powerful language, SQL. Their potential weakness, however, lies with their lack of scalability when dealing with extremely large data volumes: Commercial database software tends to become quite expensive due to licensing costs while performance decreases. This in turn slows down ETL processes and querying. For large data structures, extending the database schema becomes a difficult task. Also, the frequency of data generation and storage will suffer. Another challenge is posed by unstructured data. Relational databases quickly reach their limits because they are just not designed for this kind of data.

Different technological approaches have emerged to overcome these limitations. Some of them I will consider below.

New players: analytical databases

Analytical databases present an easy way to extend data warehouses. Although they are based on relational databases, they are optimized for fast querying. This makes them especially suitable for large volumes of data as well as for analytic applications. Evaluation is done with the help of either SQL or BI software, tools that can be used by management accountants and business analysts.

Analytical databases are column-oriented and make use of massive parallel processing (MPP), data compression and in-memory storage. In practice, this means that

–          they can be used for data, where processing would otherwise be very slow or storage too expensive

–          SQL can be used for querying

–          it is possible to use BI front-ends

–          implementation is fast, with little administrative effort

–          companies are not required to invest in complex hardware architecture

Using data marts, analytical databases can easily be combined with data warehouses. Another alternative would consist of loading data warehouse data into the data marts, which would access the integration layer. This approach, however would involve a lot of effort for modeling and development.

Adding NoSQL Databases

NoSQL databases such as MongoDB or Cassandra are open source, schema-free, distributed, horizontally scalable and non-relational. With these features, they help overcome the traditional weaknesses of relational databases, mainly concerning scalability and performance. With the addition of more servers, the processing of large amounts of data becomes relatively cheap. Reliability can be increased by replicating the data between servers. It is also easier to implement schema extensions. As an added bonus, their simpler schemas are easier to adapt and customize due to their flexibility.

Although NoSQL systems offer many benefits when working with Big Data, their query languages are no match for SQL (the language used by BI applications). NoSQL databases normally do not offer direct SQL access. Some of these systems come with special interfaces, allowing for report creation directly from within the database. However, compared to SQL, their functionality is quite limited. Still, there are many uses for NoSQL databases in combination with traditional data warehouses: While the advantages and benefits of NoSQL come to play in the application databases, for analyses the data is loaded into the data warehouse, where relational databases can play to their strengths.

One and one makes three: Hadoop and data warehouses

Traditional data warehouse technology does not deal well with very large amounts of data, which drives operational costs up. Especially unstructured data tends to pose a problem. For this reason, Hadoop was developed. The tool’s architecture makes it extremely scalable and it is able to process even very large amounts of data without any difficulty. This is why Hadoop is especially well suited for batch processing. It also has a very good price-performance ratio, since it runs on standard hardware.

Hadoop is not a database but a software framework based on HDFS and MapReduce. It is both a storage platform and a tool for the large-scale processing of data-sets on clusters of commodity hardware. The system offers basic data warehouse functionality such as aggregation and sum and average operations.

Combined with a data warehouse, Hadoop becomes even more useful. After processing in Hadoop, the result can be stored in a data warehouse or a data mart, where all the advantages of a data warehouse platform come into play. The raw data itself is stored in Hadoop. While there are tools being developed to access Hadoop data through SQL, the scenario described here (combination of Hadoop with traditional data warehouse technology) is currently the approach which makes the most sense.

Big DataTechnology: Bringing it all together

Let’s look at an example where these different technologies come together, each with their own strengths: A company has a website which generates a lot of traffic. It would like to collect all that data in order to analyze visitor behavior and to gain insights into their needs. For a scenario like this, the combined approach is ideal:

The raw data for the analysis comes from the web server logs. Since NoSQL databases are very efficient in dealing with large amounts of log data, the data from the web application is best stored exactly there. This also makes it easy to continually add new data objects.

In our example hundreds of millions new records are generated every day. Traditional data warehouses would struggle to process such amounts of data in a timely fashion. That is why they are stored in Hadoop, allowing for powerful batch processing. At storage time, the log data is compacted into pages, hosts or hours. It is also enriched with data from third party sources.

Next, the data is aggregated and loaded into the data warehouse/the analytical database. Here, SQL and other technologies may be leveraged to perform powerful query operations, such as joins, grouping, complex filtering or online analytical processing (OLAP).

Breathing life into old technology – Data warehouses in coexistence with Big Data stores

With this example I want to show that even in times of Hadoop, MapReduce and the likes the data warehouse concept is more relevant than ever. The only challenge is to extend the concept so that its weaknesses are compensated for. Rather than dismissing the proven approach of data warehouses, combining old and new technology will lead to better and quicker data analysis as well as solid data management and storage.

About the author:

Stefan Müller_PorträtStefan Müller is Director Business Intelligence & Big Data at German-based it-novum . Having worked in Governance & Management Accounting and Sourcing Management for many years, Stefan’s job at it-novum deals with open source BI solutions for business applications. He is a frequent speaker at workshops and conferences and regularly publishes in leading media.