Participant Profile

Yasufumi Sakakibara

Yasufumi Sakakibara
A genome refers to the complete set of genetic information that an organism possesses. This genome is stored in a medium called DNA within cells. As the technical term "DNA base sequence" suggests, genomic information is stored as a string of symbols, a linear arrangement of four types of nucleic acid molecules (abbreviated as A, C, G, and T). In other words, just as a sentence in Japanese is represented as a string of characters like hiragana, katakana, and kanji, a genome can be thought of as a string of text using an alphabet of four symbols. For this reason, many researchers have proposed an approach called "genetic linguistics."
For example, a portion of the *Bacillus subtilis natto* genome, which our laboratory recently sequenced, is represented by the following string of characters:
GTGCGTGAAAAAAAATATTATGAATTAGTGGAACAATT
AAAAGACAGAACACAAGACGTAACATTTTCAGCTACAA
AAGCACTAAGTCTTCTTATGCTGTTCAGCAGATATTTG
GTCAATTACACCAATGTCGAATCAGTAAATGACATTAA
TGAGGAATGCGCCAAACATTATTTTAACTACTTAATGA
AAAACCATAAGCGATTAGGAATTAATCTGACAGATATA
AAAAGGTCGATGCATCTAATCAGCGGGTTATTGGATGT
GGATGTAAACCACTATTTAAAGGATTTTTCACTATCGA
ATGTCACGCTGTGGATGACGCAAGAGAGATAA
However, most people would not understand what is written there just by looking at this string. This string contains the genetic information necessary for *Bacillus subtilis natto* to ferment soybeans and produce its characteristic sticky substance (see Figure 1).
Furthermore, a difficult problem in sequencing a genome is its enormous data size. The length of the string above is 336 characters (technically, 336 base pairs), but the entire genome of *Bacillus subtilis natto* is 4.09 million base pairs, making it about 12,000 times larger. In the case of humans, the genome size is 3 billion base pairs, an enormous amount of data approximately 8.9 million times larger. To analyze such vast data, research in bioinformatics, which combines computers with a method called comparative genomics to analyze genomic information, is being actively pursued. For example, the two DNA sequences in Figure 2 are parts of the genomes of *Mycobacterium tuberculosis* and *Mycobacterium leprae*, both bacteria that infect humans and cause disease. Here, we are finding and comparing parts that correspond to similar "words" in the text strings. DNA sequences that contain many such similar words can be regarded as sentences that mean the same thing. Furthermore, a comprehensive comparison of many entire genomes of higher organisms, including humans, provides an overview like the one shown in Figure 3.
As of December 2010, the number of species whose genome sequences have been determined is 153 eukaryotes and 1,384 prokaryotes, for a total of 1,537 species. In addition, genome sequencing projects are underway for 1,689 eukaryotic species and 5,673 prokaryotic species. And it is estimated that tens of millions of species exist on Earth (though the exact number is completely unknown). Furthermore, even within the same species, genomes differ slightly among individuals. This is called polymorphism, and even in humans, the genomes of parents and children, siblings, and unrelated individuals are different from your own. This manifests as differences in physical traits such as face and hair color, and even predispositions like susceptibility to cancer. If we consider that each organism possesses a book called a genome, then the Earth can be thought of as a "Genome Library" that collects all of these genomic books. In this library, which houses a vast number of books, there is also the book of your own genome, and there are many books that no one has yet read. And among them, there are surely books more interesting than a Haruki Murakami novel.