Classification of Metagenome Fragments With Agglomerative Hierarchical Clustering
Unlike genomics which study specifically culturable microorganisms, metagenomics is a field that studies microorganic samples retrieved directly from the environment. Such samples produce widely varying fragments when sequenced, many of which are still unidentified or unknown. Assembly of these fragments in the goals of identifying the species contained among them are thus prone to make said goals more difficult, so it becomes necessary for binning techniques to come in handy while trying to classify these mixed fragments onto certain levels in the phylogenetic tree. This research attempts to implement algorithms and methods such as k-mers to use for feature extraction, linear discriminant analysis (LDA) for dimensionality reduction, and agglomerative hierarchical clustering (AGNES) for taxonomic classification to the genus level. Experimentation is done across different objective measurements, including the length of the observed metagenome fragment that spans from 0,5 Kbp up to 10 Kbp for both the 3-mer and 4-mer contexts (k = 3 and k = 4). The averaged validity scores of the resulting data clusters generated from both the training and test sets, computed with the silhouette index metric, are 0.6945 and 0.0879 for the 3-mer context, along with 0.5219 and 0.1884 for the 4-mer context.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlike International License (CC-BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Copyright without Restrictions
The journal allows the author(s) to hold the copyright without restrictions and will retain publishing rights without restrictions.
The submitted papers are assumed to contain no proprietary material unprotected by patent or patent application; responsibility for technical content and for protection of proprietary material rests solely with the author(s) and their organizations and is not the responsibility of the ULTIMATICS or its Editorial Staff. The main (first/corresponding) author is responsible for ensuring that the article has been seen and approved by all the other authors. It is the responsibility of the author to obtain all necessary copyright release permissions for the use of any copyrighted materials in the manuscript prior to the submission.