ISSN: 1304-7191 | E-ISSN: 1304-7205
Feature extraction for DNA capillary electrophesis signals based on discrete wavelet transform combined with multi-scale permutation entropy
1Yildiz Technical University, Department of Statistics, Istanbul, Turkey
Sigma J Eng Nat Sci 2022; 40(3): 475-490 DOI: 10.14744/sigma.2022.00051
Full Text PDF

Abstract

DNA sequence classification is an important challenge in genomic studies due to non-linear and chaotic behavior of DNA oxidation signals of Adenine, Cytosine, Guanine, and Thymine bases. To achieve genotype identification of samples derived from biological sources accurately, Machine Learning (ML) methods have been commonly preferred instead of expert-based methods due to the ability in handling such these complex-structured biological sequences. Reducing the dimension without sacrificing important information that should not be omitted during the classification process is an important task in ML applications. This study presents a new feature extraction method to detect two sub-types of hepatitis nucleic acid trace files. The proposed method combines both discrete wavelet transform (DWT) and entropy. The DWT decomposes the bases signals up to three levels and thus all necessary information that is hidden in both spatial and frequency domains is aimed to captured. To achieve a good summarization of DNA trace files having different length, multi-scale permutation entropy (MPE) measures are then computed from approximate and detail coefficients of signals stored in the sub-bands. Different feature sets are extracted with the proposed method using real data covering 200 hepatitis DNA trace files and then fed to a simple memory-based learning classifier, k-NN. The classification performance of the proposed feature extraction method is compared against a method based on MPE features without wavelet decomposition. The results indicate, in classifying hepatitis DNA trace files, the average accuracy reaches up to nearly 99% with feature sets based on proposed method even at 30% training samples proportion.