トップ
> English
> Researchers
> Research Institute
> Achievement
> data_science
> Species classification using deep learning of oligonucleotide frequencies in genomic sequences

Species classification using deep learning of oligonucleotide frequencies in genomic sequences

Extant species that are thought to have evolved from a common ancestor share many common physiological characteristics, but show remarkable diversity in morphological and other characteristics. The genome sequence, which plays the most important role in these phenotypes, also shows similarities in the regions with important functions, but when viewed as a whole, significant differences are observed. With the advent of next generation sequencers, it has become easier to obtain a variety of sequence information including genomes. For example, when analyzing the frequency information of a specific long sequence, it is necessary to process more than one million data sets, which is the 10th power of 4 for 10 bases, and methods using machine learning are attracting attention. In this study, we overcame this difficulty by using existing machine learning frameworks using linear regression models and our own implementation of deep learning, and as an example, we attempted to classify species and NGS analysis methods from data given as FASTQ. We created a dataset of tens of thousands of base length-fixed frequency information from genome reference sequences or FASTQs of humans, mice, frogs, slugfish, sea urchins, etc., performed machine learning, and verified the results using different FASTQs as test data, and obtained a correct response rate of over 90% for all species. In the case of humans, the FASTQs obtained from whole-genome and whole-exome analyses were found to be roughly classifiable. The sequence length used to generate the training and test data is an important hyper-parameter that affects not only the sensitivity and specificity of the system, but also the data processing time, and interestingly, it was also found to be a number that has biological implications for individual species. Interestingly, the values were also found to be biologically suggestive for individual species. This study also has the potential to be used as a confirmation mechanism for handling the increasing number of next-generation sequencer data.