The secrets of designing and building big data apps
Software applications have traditionally been perceived as a unit of computation designed and used to solve a problem. Whether an application is a CRM tool that helps manage customer information or a complex supply-chain management system, the problem it solves is often rather specific. Applications are also frequently designed with a relatively static set of input and output interfaces, and communication to and from the application uses specially designed (or chosen) protocols.
Applications are also designed around data. The data that an application uses to solve a problem is stored using a data platform. This underlying data platform has historically been designed to enable optimal data storage and retrieval. Somewhere in the process of storage and retrieval of data, an application applies computation is to produce results in the application.
One unfortunate side effect of this optimized data storage and retrieval design is that it requires data to be structured in a predefined way (both on disk and during information design and retrieval.) In the world of big data, applications must draw on data from rigidly structured elements, such as names, addresses, quantities, and birthdays, as well as to loose and unstructured data such as images and free-form text.
Defining and building a big data application can be perplexing given the lack of rigidity in the underlying data. This lack of structure makes it more difficult to precisely define what a big data application will do. This applies to communication interfaces, computation on unstructured or semi-structured data and even communication with other applications.
While the traditional application may have solved a specific problem, the big data application doesn’t limit itself to a highly specific or targeted problem. Its objective is to provide a framework to solve many problems. A big data application manages life-cycles of data in a pragmatic and predictable way. Big data applications may include a batch or high-latency component, a low-latency (or real-time component), or even an in-stream component. Big data applications do not replace traditional single-problem applications, but complement them.
Let’s use a CRM tool as an example. The traditional CRM tool might store information about customers, their purchase history and customer loyalty level. Given a finite resource such as a customer call center, during peak loads the CRM must determine which customers should receive equal versus priority service. Typically, higher loyalty customers will receive the priority service, with the levels of loyalty usually being pre-determined. Those levels might be driven by spend, spend ranges, or other rules, but the determination is dependent on the the data, which is typically rigidly structured data.
However, if the CRM tool has the ability to predict whether a given customer, even if she is not within the pre-determined loyalty range, is exhibiting behaviors known to lead to a high loyalty customer, it would be able to make a smarter decision on how to prioritize resources and suggest prioritizing her call. Read more