machine learningAI and Machine Learning were two of 2017’s hottest technological buzzwords. It’s not difficult to understand why: the potential benefits of these technologies are exciting and profound. But artificial intelligence and machine learning both rely on other foundational technologies in order to achieve the results that they promise. Consequently, innovation within the realm of AI is constrained by the limitations of other technologies.

Access to high-quality, usable data is one factor that has significant implications for the development of AI. Even as AI is enjoying its moment in the spotlight, innovation within the realm of Big Data is becoming more crucial than ever for the continued development of AI technologies.

Data Integrity in the Third Wave of AI

The history of AI development can be divided into three distinct waves. First-wave AI was characterized by optimization and “knowledge engineering” programs, which took real-world problems and found efficient solutions. Second-wave AI was characterized by machine learning programs that automated pattern recognition based on statistical probabilities. We have now entered the third wave of AI: hypothesis generation programs, or “contextual normalisation.” Third-wave AI programs have the ability to examine huge data sets, identify statistical patterns, and create algorithms that explain the existence of the patterns.

In recent years, AI programs have taken significant leaps in their ability to analyze patterns in complex datasets and to generate novel insights — even insights that escape human analysts. When IBM Watson took down a human competitor on Jeopardy, it did so with advanced natural language processing and a remarkable breadth of general knowledge.

Pharmaceutical companies such as Johnson & Johnson and Merck & Co. have begun to invest in similar third-wave AI technologies in order to gain an advantage over their rivals. The adoption of such technologies by pharmaceutical companies has led to significant discoveries, such as the link between Raynaud’s disease and fish oil. AI also has the potential to dramatically accelerate the drug development process by reducing costly and time-consuming errors.

Of course, AI has also suffered several highly-publicised failures. Many of these failures, such as MD Anderson’s problems with IBM Watson, are the result of one glaring issue within the field of AI: dataset integrity. In the Watson example, everything went wrong when MD Anderson changed its electronic medical record provider and Watson could no longer access the data that it needed.

A deeper look at the issue of dataset integrity can reveal key insights for the future of AI development and implementation.

It All Depends on the Data

It doesn’t matter how advanced AI and machine learning algorithms become if they can’t access the data necessary to conduct analysis and generate insights.

Life science datasets are notoriously insufficient and difficult to work with, due to the remarkable depth, density, and diversity of biological data. Consequently, biological research has relied heavily on manually curated datasets that must be created and cleaned to test manually-conceived hypotheses. The work involved in this highly manual process has driven up research costs and the costs of biomedical products like vaccines and biotechnology. The time-consuming nature of this process has meant that by the time conclusions are published in academic journals; they may already be obsolete.

By creating and analyzing biological data sets in this slow, inefficient, and error-prone way, researchers have inadvertently created a huge problem of publication bias and inaccuracy in medical science data.

Biased and flawed data sets were a problem for first- and second-wave AI programs, but third-wave AI software suffers most significantly from these limitations. Consider, for example, the issue of abbreviations and acronyms in medical terminology. One acronym often has various meanings — “Ca”, depending on its context, can mean either “cancer” or “calcium.” Third-wave AI programs rely on complex contextual information in order to perform, and messy, manually-curated data sets reduce the effectiveness of these programs.

A change in data

2009’s HITECH Act ushered in the era of ubiquitous electronic medical record systems. As a result, rich datasets now exist that contain real-time, comprehensive biological information. These new datasets are joining with data from biological patents, clinical trials, legislative bodies, academic theses, and other sources within the innovation ecosystem to create complex pools of biological data.

This wealth of unstructured data has, until recently, only been useful to computing programs after a great degree of human effort to clean and organize the data. But now, AI is sufficiently advanced to parse and analyze heterogeneous data using advanced algorithms that combine machine learning, natural language processing, and advanced text analytics. We’ve gone from a world of outdated, incomplete, and inaccessible data to a new paradigm in which AI can structure previously unstructured data for real-time analysis and context normalisation.

Third-wave AI gives us clean, centralized data that reflects the complexity of biological systems. By parsing this data, we can achieve deep insights into the current biomedical landscape.

The post is by Gunjan Bhardwaj, Founder and CEO of Innoplexus, a leader in AI, machine learning, and analytics as a service for healthcare, pharma, and the life sciences. Before founding Innoplexus, he was with the Boston Consulting Group and, prior to that, served as the leader of the global business performance think-tank of Ernst & Young and as a manager in the German practice with a solution focus on strategy and innovation.