DNA sequencing has become one of the most important tools in modern biology. It helps researchers understand genetic causes of cancers, neurological disorders, infectious diseases, and many other health conditions. But as sequencing technology has advanced, scientists around the world have started generating enormous amounts of genetic data. These datasets are so large, measured in petabytes, that they are difficult to store, search, and analyze.
To
solve this problem, researchers at ETH Zurich have created a new system
called MetaGraph, described in research published in the journal Nature.
MetaGraph works like a search engine for DNA. It combines huge amounts of
sequencing data into one organized platform that allows scientists to quickly
look up specific genes, mutations, or sequences. The system currently brings
together almost 600 million unique sequences and around 21 million
gigabytes of data.
How MetaGraph Works
Since
the early days of sequencing, beginning with Fred Sanger’s chain-termination
method in 1977, scientists have tried to make DNA analysis faster and more
accurate. Today’s next-generation sequencing produces massive amounts of
information, but until now, searching through this information has been slow
and costly.
|
Read
More Predicting
Human Genetic Variants in Mice: A Game Changer for Genomic Research |
MetaGraph
changes this by converting raw sequencing files into compact, searchable
indexes. The system cleans the data, corrects errors, and organizes the
sequences into mathematical graphs that can be merged into one unified
structure. This process compresses the data enormously, for example, large
datasets like GTEx and TCGA, which normally take up 100 terabytes, can be
reduced to about 10 gigabytes each.
The
database includes sequences from viruses, bacteria, fungi, plants, humans, and
even environmental samples such as the human gut microbiome. The refined graph
structure removes redundant information, allowing the data to be stored
efficiently and searched rapidly.
One major advantage is that researchers no longer need to download entire datasets. Instead, they can perform detailed searches directly within MetaGraph. This saves both time and money. In fact, the full public sequencing database. normally far too large to store on regular computers, can now fit on only a few hard drives, and each search costs only a few cents. The team estimates that the entire system can operate for roughly $2,500.
What MetaGraph Means for the Future
Currently,
MetaGraph includes about half of the world’s publicly available sequencing
data, and the team expects the rest to be added by the end of 2025. The system
is designed to grow without slowing down, making it valuable for large-scale
genetic research. Because the platform is open-source, it can be used by
scientists, pharmaceutical companies, educators, and even interested
individuals.
Researchers
believe MetaGraph could make genetic studies much easier. For example,
scientists who tracked the SARS-CoV-2 genome during the COVID-19 pandemic
relied on fast sequencing tools. Others use genome data to study how species
evolve or how microbes spread. With MetaGraph’s search capabilities, these
tasks can be done faster, more efficiently, and at a much lower cost.
|
Read More How Scientists Are Learning to Measure Ethical AI: Why It Matters More Than Ever? |
As
one ETH researcher noted, even Google didn’t know all the ways a search engine
would be used when it was first created. Similarly, as DNA sequencing continues
to expand, tools like MetaGraph may eventually become part of everyday life, perhaps
even helping people identify plants or microbes around them.
0 Comments