Big Data Is Too Big for Scientists to Handle Alone
Seven years ago, when David Schimel was asked to design an ambitious data project called the National Ecological Observatory Network, it was little more than a National Science Foundation grant. There was no formal organization, no employees, no detailed science plan. Emboldened by advances in remote sensing, data storage and computing power, NEON sought answers to the biggest question in ecology: How do global climate change, land use and biodiversity influence natural and managed ecosystems and the biosphere as a whole?
“We don’t understand that very well,” Schimel said.
Splitting his time at first between the new project and his role as a senior scientist at the National Center for Atmospheric Research, Schimel said he was surprised by the magnitude of the challenge, by the “sheer number of different measurements required to address the key science questions.” Before any observatories could be erected or staff members hired, decisions had to be made about where to take measurements, what to measure, how to measure it and how to generate meaningful data.
Schimel began to explore site options across the country and to assemble NASA-inspired “tiger teams” that could develop rigorous scientific methodologies and data-processing requirements. The final plan called for hiring dozens of scientists with disparate backgrounds; building more than 100 data-collection sites across the continental United States, Alaska, Hawaii and Puerto Rico; recording approximately 600 billion raw measurements per year for 30 years; and converting the raw data into more user-friendly “data products” to be made freely available to scientists and the public. Building the observatory network is projected to take four more years and cost $434 million, and millions more will be needed to cover annual operating expenses.
In 2007, Schimel became NEON’s chief scientist and first full-time employee. “I’ve been interested in processes at the continental scale for a long time and it’s always been a data-starved activity,” he said. “The opportunity to actually design a system to collect the right data at that scale was irresistible.”
Across the sciences, similar analyses of large-scale observational or experimental data, dubbed “big science,” offer insights into many of the greatest mysteries. What is dark matter, and how is it distributed throughout the universe? Does life exist, or is it capable of existing on another planet? What are the connections between genetic markers and disease? How will the Earth’s climate change over the next century and beyond? How do neural networks form thoughts, memories and consciousness?
Much of the recent data frenzy — from the physical and life sciences to the user-generated content aggregated by Google, Facebook and Twitter — has come in the form of largely unstructured streams of digital potpourri that require new, flexible databases, massive computing power and sophisticated algorithms to wring out bits of meaning from them, said Matt LeMay, a former product manager at the URL shortening and bookmarking service Bitly. Read more