What Is Data Preparation and Why Is It Important?
Data is regularly described as the “new oil,” which is kind of true: There’s good money to be made by those who use it in clever ways. Data is also fundamentally unlike oil in that, in the right hands, it delivers more than profit — it provides insight and understanding.
Before a company or organization gains any understanding from their data, they must first organize it and make it ready for analysis. That’s where data preparation comes in.
What Is It?
In simple terms, data preparation is work that involves collecting, consolidating and “cleaning up” a collection of data prior to analyzing it. Data preparation is of greatest interest to parties that wish to:
- Combine data gathered from more than one source, such a reports, documents, live web pages and multiple cloud databases
- Correct problems and artifacts imported from “unstructured” sources like PDFs
- Bring order to non-standardized and unsorted data
- Find and replace duplicates and inconsistencies, including combining “like” terms (e.g. “St.” and “Street”)
More simply, data preparation involves gathering data from multiple sources, finding problems in the information and correcting them, and then repackaging the data for use by other applications, parties and analytics tools. When people say the world “runs on data,” what they really mean is that it runs on “ordered data.” Data preparation imposes that order by turning scattered or siloed information into useful, actionable insights.
Why Is It Important?
With the question “what is data preparation” answered, let’s focus on why it’s worth the bother. First and foremost, data preparation is the first step toward processing that data and using it for some kind of useful analytical purpose.
One other part of the “data as oil” comparison that holds true is that companies often find themselves “sitting on” an untapped resource they didn’t know they had. According to studies, as much as 73% of the available institutional data in the world is having its analytical potential squandered.
Clearly, companies and organizations don’t always know the value of the information they’ve already been gathering from processes like:
- Equipment downtime and failure rates
- Webpage and keyword performance
- Product throughput in distribution hubs and warehouses
- Energy consumption by various processes and equipment
- Competitor and market research, including demand and pricing fluctuations
- Correlations and patterns in user behaviors and characteristics
Not all of the data a given company collects is useful. However, a great deal of it will never achieve usefulness because it’s disordered or hasn’t been made accessible by the department that compiled it.
Gathering, restructuring and correcting data from across an organization can be daunting. But it may unlock insights into how well your people and assets are performing and how likely different events are to disrupt your growth, and helps pinpoint bottlenecks and growth opportunities.
Organizations need the right approach to get data preparation right, however. As calls grow from some quarters to more strictly regulate social media companies and other data brokers, data preparation is also an important transparency and compliance tool. Gathering and retaining business data, no matter how benign the intentions, is a lucrative business with the omnipresent threat of fines and regulator scrutiny if care isn’t taken.
How to Engage in Data Preparation
There are essentially two ways to engage in data preparation: manually — what some refer to as “spreadsheet wrangling” — or by using automation tools. Most companies will choose the latter, but you’ll have to judge the merits of such products for yourself.
No matter the broader mission, data preparation requires that stakeholders answer the following questions before beginning the process:
- What is the question we want to answer or the problem we need to solve?
- What data is most useful for answering this question?
- Is the data available? Where is it located?
By and large, data preparation projects will follow similar blueprints and require similar degrees of cooperation from the parties handling the data, the parties using that data to reach business decisions, and the departments which generated the data.
Here’s what to know about each step in the process:
- Discovery: With or without the help of automation, the first step is to find the data that’s best-suited to the analysis you need done or the problem you need solved.
- Cataloging: The discovery phase usually results in the creation of a data catalog detailing what data you have and where it came from. The catalog should be updated as new discoveries are made, to help users find the data again later.
- Cleansing and refining: This is where the data is purged of obvious errors that shouldn’t make it into the final data package.
- Blending and distillation: This is the stage where commonly substituted terms and duplicated entries are taken into account so they don’t cause abnormalities in the final data set. Distillation may involve applying custom data quality rules using automation.
- Documentation: Documentation is important in case other parties use the same data for new projects in the future. Metadata in the data catalog can include details on relationships between databases, definitions for technical and business terminology, source information, and a list of changes to the data during distillation and when they were implemented.
- Packaging and reformatting: This step is important because companies may use any number of tools and procedures for interacting with the data after it’s been discovered. The resulting data package should be ready for importation into other tools for visualization and further manipulation.
Data Preparation Serves the Scientific Method
Businesses everywhere can leverage big data even on a small scale — but some may not know how to begin. For those companies, the answer is: “get organized.”
Analyzing business data is like any other use of the scientific method: It has to begin with high-quality information. Data preparation requires cooperation between departments, attention to detail, and a clear sense of the mission or problem being tackled. It also more than likely requires the right software to do well.
But getting it right means extracting “free” value from data you likely had already — including finding new answers to old questions and discovering new ways to reach customers, optimize performance, reduce waste, or realize any number of other business goals.