Machine learning to predict retention time of small molecules in nano-HPLC

  • PDF / 1,515,732 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 99 Downloads / 186 Views

DOWNLOAD

REPORT


RESEARCH PAPER

Machine learning to predict retention time of small molecules in nano-HPLC Sergey Osipenko 1 & Inga Bashkirova 1 & Sergey Sosnin 1 & Oxana Kovaleva 1 & Maxim Fedorov 1 & Eugene Nikolaev 1 & Yury Kostyukevich 1 Received: 26 May 2020 / Revised: 29 July 2020 / Accepted: 20 August 2020 # Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Retention time is an important parameter for identification in untargeted LC-MS screening. Precise retention time prediction facilitates the annotation process and is well known for proteomics. However, the lack of available experimental information for a long time has limited the prediction accuracy for small molecules. Recently introduced large databases for small-molecule retention times make possible reliable machine learning–based predictions for the whole diversity of compounds. Applying simple projections may expand these predictions on various LC systems and conditions. In our work, we describe a complex approach to predict retention times for nano-HPLC that includes the consequent deployment of binary and regression gradient boosting models trained on the METLIN small-molecule dataset and simple projection of the results with a small number of easily available compounds onto nano-HPLC separations. The proposed model outperforms previous attempts to use machine learning for predictions with a 46-s mean absolute error. The overall performance after transfer to nano-LC conditions is less than 155 s (10.8%) in terms of the median absolute (relative) error. To illustrate the applicability of the described approach, we successfully managed to eliminate averagely 25 to 42% of false-positives with a filter threshold derived from ROC curves. Thus, the proposed approach should be used in addition to other well-established in silico methods and their integration may broaden the range of correctly identified molecules. Keywords Retention time prediction . Nano-HPLC . Machine learning

Introduction Untargeted screening of small molecules based on liquid chromatography coupled with mass spectrometry (LC-MS) has become a common practice in forensic analysis [1], doping control [2], drug discovery [3], medicine [4], food [5], and environmental chemistry [6]. The bottleneck of all untargeted approaches is a compound annotation that is mainly based on Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00216-020-02905-0) contains supplementary material, which is available to authorized users. * Eugene Nikolaev [email protected] * Yury Kostyukevich [email protected] 1

Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Nobel Str., 3, 121205 Moscow, Russia

matching fragmented mass spectra to publicly available databases [7]. It significantly reduces the number of candidates obtained after accurate mass search; however, the fragmentation pattern depends on a certain instrument and collision energy settings and may result in a high ratio of fals