ReactionCode: format for reaction searching, analysis, classification, transform, and encoding/decoding

  • PDF / 3,910,318 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 84 Downloads / 180 Views

DOWNLOAD

REPORT


SEARCH ARTICLE

Journal of Cheminformatics Open Access

ReactionCode: format for reaction searching, analysis, classification, transform, and encoding/ decoding Victorien Delannée  and Marc C. Nicklaus* 

Abstract  In the past two decades a lot of different formats for molecules and reactions have been created. These formats were mostly developed for the purposes of identifiers, representation, classification, analysis and data exchange. A lot of efforts have been made on molecule formats but only few for reactions where the endeavors have been made mostly by companies leading to proprietary formats. Here, we present ReactionCode: a new open-source format that allows one to encode and decode a reaction into multi-layer machine readable code, which aggregates reactants and products into a condensed graph of reaction (CGR). This format is flexible and can be used in a context of reaction similarity searching and classification. It is also designed for database organization, machine learning applications and as a new transform reaction language. Keywords:  ReactionCode, Reaction, Encoding, Decoding, Searching, Classification Introduction Different proprietary and open formats for reactions have been invented over the past 50 years. The first reaction format can probably be attributed to E. J. Corey and W. T. Wipke. They implemented a format based on rules to generate new molecules and integrated it in the first computer-aided organic synthesis program: OCSS (Organic Chemical Simulation of Synthesis) [1]. This project split to give birth to LHASA (Logic and Heuristics Applied to Synthetic Analysis) [2–4] and SECS (Simulation and Evaluation of Chemical Synthesis) [5]. The LHASA team designed the language CHMTRN (CHeMistryTRaNslator), while the SECS group created the ALCHEM (A Language for CHEMistry) language [6]. After their launch, diverse additional reaction transform languages came up along the implementation of programs such as CLASS and IGOR & IGOR2. However, the arrival of SMILES *Correspondence: [email protected] Computer‑Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, NIH, 376 Boyles Street, Frederick, MD 21702, USA

(Simplified Molecular Input Line Entry System) in the late 1980s led to the development of ReactionSMILES and SMIRKS (SMIles ReaKtion Specification). These two formats were largely adopted by the community and are still widely used nowadays [7–10]. The work around reaction formats has also affected the need for representations and identifiers for data exchange. In the 1990s, Molecular Design Limited (MDL) developed the Chemical Table file (CTfile) format [11]. In this context, the RXNfile and RDfile formats were defined with the objective to store reaction data and quickly became a reference. RXNfile is used to store the structural information for the reactants and products of a single reaction [11], while RDFiles allows one to store a set of RXNs with their associated data [11]. Since then, additional formats have emerged or are under develo