| by Gayle
Eherenman, Associate Editor |
Simulating
natural phenomena, mapping the human genome,
and discovering ways to improve product quality all have one thing in
common: They generate tremendous amounts of data.
Having terabytes of data at your disposal greatly increases the chances
that you can find the answer to even the toughest
questionsif you don't mind searching for a needle in a
giant digital haystack.
The method for an increasing number of scientists and engineers is data
mining. Data mining draws upon extensive work in areas such as statistics,
machine learning, pattern recognition, databases, and high-performance
computing to discover interesting and previously unknown information in
data.
 |
| NASA scientists are using satellite
data gathered from remote locations (such as Kilimanjaro) to discover
changes in the global climate system. |
More specifically, data mining is the analysis of often-large data sets
to find relationships and patterns that aren't readily apparent,
and to summarize the data in new and useful ways.
Data mining technology has enabled earth scientists from NASA to discover
changes in the global carbon cycle and climate system, and biologists
to map and explore the human genome.
"We can create and collect complex data at tremendous rates, but
unless we can manage and analyze that data, it's useless,"
said Ron Musick, formerly the lead researcher on Lawrence Livermore National
Laboratory's Large-Scale Data Mining Pilot Project in Human Genome.
"To maximize the value of this data, we have to be able to access
it and manipulate it as structured, coherent, integrated information.
Data mining helps automate the exploration and characterization of the
data being generated, so the scientist or engineer is free to concentrate
on interpreting and using the data."
Musick is currently employed as the lead scientist at
iKuni Inc., a machine-learning start-up focused on behavior capture for
the entertainment industry. As a member of Lawrence Livermore's
Center for Applied Scientific Computing, he was the group leader for the
Data Science Group, and the project lead for the DataFoundry project,
which is an ongoing research effort to improve scientists' interactions
with large data sets.
Watching the Weather
Scientists working with NASA Ames Research Center in Moffett Field, Calif.,
are using satellite data to develop a detailed global picture of the interplay
of natural disasters, human activities, and the rise of carbon dioxide
in the Earth's atmosphere over the past 20 years.
"The new results come from data mining, which sorts through a huge
amount of satellite and scientific data to detect patterns and events
that otherwise would be overlooked," said Vipin Kumar, professor
of computer science and engineering at the University of Minnesota, and
director of the Army's High Performance Computing Center. Kumar
was the principal investigator on a recently ended joint project among
the University of Minnesota; California State University-Monterey Bay
in Seaside; Boston University; and NASA Ames to develop data mining techniques
to help earth scientists discover changes in the global carbon cycle and
climate system.
"Our goal was to explore teleconnections, to figure out what causes
disturbances in the global carbon cycle," Kumar said. Teleconnections
are atmospheric interactions between widely separated regions that have
been identified through statistical correlations. For example, the El
Ni-o teleconnection with the southwestern United States involves large-scale
changes in climatic conditions that are linked to increased winter rainfall.
 |
| Finding a single gene amid the vast stretches
of DNA that make up the human genome requires a set of powerful tools,
such as data mining. |
The project drew on data collected since roughly 1950. Since 1980, the
data has been collected from NASA's Earth Observing Satellites
and the Advanced Very High Resolution Radiometer aboard National Oceanic
and Atmospheric Administration satellites. Prior to 1980, farmers, fishermen,
and small regional development centers collected the data, which was largely
anecdotal, Kumar said.
Data from the NOAA satellites was used to measure monthly changes in leafy
plant cover worldwide. Boston University used unique NASA computer codes
to produce global greenness values. These codes removed interfering data
from atmospheric effects. When statistics showed there was much less greenness
in specific areas for more than a year, scientists found a high probability
of ecological disturbances.
"Watching for changes in the amount of absorption of sunlight by
green plants is an effective way to look for ecological disasters,"
said Christopher Potter, a scientist at NASA Ames.
 |
| Satellite images and data, such
as the one above of the Vatnajökull Glacier in Iceland, assist
researchers in tracking teleconnections. |
The satellites generate roughly 1 terabyte of raw data per day, according
to Kumar. The data consists of a sequence of global snapshots of the Earth,
typically at monthly intervals, and includes various atmospheric, land,
and ocean variables, such as sea surface temperature, pressure, precipitation,
and net primary production, the net photosynthetic accumulation of carbon
by plants. Sudden changes in net primary production can have a direct
impact on a region's ecology.
The vast amount of data, and its complexity, meant that the project required
specially designed algorithms to mine it. "When you have a small
amount of data, it's possible to do the bulk of the analysis manually,"
Kumar said. "With terabytes of data, the analysis process has to
be more data-driven and automated. There's no way to go through
all that data manually."
Part of the challenge in analyzing this data is that it's not all
in a form that can be easily analyzed, according to Kumar. This means
that a fair amount of preprocessing must be performed to get the data
into shape to be analyzed. "We needed to see what we could do with
a mix of modest-quality data and very high-quality data," Kumar
said.
The sheer size of the data warehouse created an additional challenge.
All the data mining algorithms needed to be scalable, since more data
is continuously added. A combination of desktop PCs equipped with 2 to
3 gigabytes of memory each, and supercomputers with 100 times the memory
of a desktop PC, run the algorithms.
 |
| Researchers are using data mining
to track the impact of natural disasters, like hurricanes (above),
on the global carbon cycle. |
The typical data in the project are spatial time series data, where each
time series references a location on Earth. The data mining algorithms
look for spatio-temporal patterns that correlate extremes in weather with
an ecological disturbance, according to Kumar. An ecological disturbance
is an event that disrupts the physical makeup of an ecosystem and how
it works for longer than one growing season of native plants.
Natural disturbances may include fires, hurricanes, floods, droughts,
lava flows, and ice storms. Other natural disturbances are due to plant-eating
insects and mammals, and disease-causing microorganisms. Human-caused
disturbances could happen as a result of logging, deforestation, draining
wetlands, clearing, chemical pollution, and introducing non-native species
to an area.
According to Potter, "Ecosystem disturbances can contribute to
the current rise of carbon dioxide in the atmosphere." Nine billion
metric tons of carbon may have moved from the Earth's soil and
surface life forms into the atmosphere in 18 years beginning in 1982 due
to wildfires and other disturbances, according to the study. A metric
ton is 2,205 pounds, equivalent to the weight of a small car. In comparison,
fossil fuel emission of carbon dioxide to the atmosphere was about seven
billion metric tons in 1990.
"In the new era of worldwide carbon accounting and management,
we need an accurate method to tell us how much carbon dioxide is moving
from the biosphere and into the atmosphere," Potter said. "Global
satellite images go beyond the capability of human eyesight. All we need
to do is look at the data with the proper formulas to filter out just
what we need."
According to Kumar, "The more data we have, the more information
we have, and the more we can correlate changes in weather to human actions.
If we have the data, and the tools to mine it, we might be able to avoid
ruining the planet."
Big Data, Little Source
Mapping the human genome meant creating a vast database of information
that researchers needed to easily update, query, retrieve components from,
and ultimately analyze. None of that would be possible without data mining
technology, according to researchers.
Lawrence Livermore National Labs worked on a pilot project to address
many of the mission-critical questions in large-scale data mining, as
applied to the Human Genome study.
Among the challenges were the size of the data sets, the complexity of
the data and algorithms, the data management demand, and the need to qualify
the results of the data mining.
"Large-scale data sets pose a particular challenge," said
Ron Musick. "The genomic research branched off from work we did
with physics and astrophysics data. Astrophysics data sets range up to
1 or 2 terabytes. They're large enough so that they won't
fit into a computer's core memory. This creates a need for scalable
interfaces and input-output architectures for moving the data."
In the Livermore Lab's project, data mining queries against the
genomics database were run on desktop PCs. When working with physics data,
the data mining was performed on mini-clusters of PCs, not because computing
power was limited, but rather to get results more quickly. Musick said
they striped the data across as many as 30 hard disks, so that a query
could be run more quickly.
The pilot project relied on custom-built data mining algorithms because
of the size of the data sets and complexity of the information.
 |
| The discovery of how various genetic
components work, such as DNA proteins (above and below) was made possible
through data mining. |
|
|
"We couldn't use commercial software, unless we broke the
dataset into independent chunks," Musick said. "In a lot
of ways, we had to first define the language the system would use to solve
the problem, in much the same way you need the appropriate terminology
to be able to describe how you hit a ball with a bat."
The National Center for Biotechnology Information has built on the research
done at Livermore and other places to build a variety of publicly available
database and data mining tools. These have been developed by a multidisciplinary
group of computer scientists, mathematicians, biologists, physicians,
and researchers. These tools allow researchers to generate testable hypotheses
regarding the function or structure of a gene or protein by identifying
similar sequences in better-characterized organisms.
One NCBI data mining tool, the Basic Local Alignment Search Tool, or BLAST,
is used for comparing gene and protein sequences against others in public
databases. BLAST performs sequence similarity searches against a DNA database
of more than two million sequences in less than 15 seconds, according
to an NCBI spokesperson. In essence, it's all about looking for
patterns and correlations.
Finding those patterns can help researchers identify the genes linked
to a particular disease or those that help suppress a disease, for example.
The number of BLAST queries sent to the server continues to increase,
growing from about 100,000 per weekday at the beginning of 2002 to about
140,000 per weekday in early 2004, according to the NCBI.
Data on a Smaller Scale
Data mining isn't restricted solely to vast banks of data with unlimited
ways of analyzing it. Manufacturers such as W.L. Gore (the maker of GoreTex)
use commercially available data mining tools to warehouse and analyze
their data, and improve their manufacturing process.
Gore uses data mining tools from analytic software vendor SAS for statistical
modeling in its manufacturing process. According to SAS, the software
allows Gore to tweak the manufacturing process at a remote plant and to
identify variations in output at specific points during the manufacturing
process to fix or avoid production quality problems.
The University of Minnesota's Kumar is in the early stages of developing
data mining techniques that could be used in the detection of defects
in mechanical structures. He sees data mining technology as another tool
for engineers to use in non-destructive analysis.
"Data mining has many uses," Kumar said. "The key is coming up with the
right questions to ask. Once you have the question, you can develop the
language to ask that question, then build a feature set and the algorithms
to carry out your search. But it all starts with the data. You can't do
data mining without the data."
And somewhere, in all that data, may just lay the needle you're looking
foror another you never knew existed.
home
| features | breaking
news | marketplace
| departments | about
ME back issues | ASME
| site search
© 2005 by The American Society
of Mechanical Engineers
|