Accurate prediction of DNA N 4 -methylcytosine sites via boost-learning various types of sequence features

  • PDF / 1,795,438 Bytes
  • 11 Pages / 595 x 791 pts Page_size
  • 32 Downloads / 148 Views

DOWNLOAD

REPORT


RESEARCH ARTICLE

Open Access

Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features Zhixun Zhao1

, Xiaocai Zhang1 , Fang Chen2 , Liang Fang3 and Jinyan Li1*

Abstract Background: DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. Results: The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. Conclusions: The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations. Keywords: DNA N4-methylcytosine, Sequence feature, Feature selection, Site prediction

Background As an essential epigenetic modification, DNA base methylation expands the DNA content and plays crucial roles in regulating various cellular processes [1–3]. According to the location where a methylated group occurs in the DNA sequence, there are many kinds of DNA base methylation. For example, 5-Methylcytosine (5mC), N6-methyladenine *Correspondence: [email protected] Advanced Analytics Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, PO Box 123, Broadway, NSW 2007 Sydney, Australia Full list of author information is available at the end of the article 1

(6mA) and N4-methylcytosine (4mC) are the most common types [4–6]. 5mC occurs at the C5-position of cytosine and is the dominant methylation type in eukaryotic genomes, actively involved in differentiation, gene expression, genomic imprinting, preservation of chromosome stability, aging, suppression of repetitive element, and X chromosome i