Benchmark AFLOW Data Sets for Machine Learning
- PDF / 588,314 Bytes
- 4 Pages / 595.276 x 790.866 pts Page_size
- 42 Downloads / 257 Views
TECHNICAL ARTICLE
Benchmark AFLOW Data Sets for Machine Learning Conrad L. Clement1 · Steven K. Kauwe1 · Taylor D. Sparks1 Received: 8 March 2020 / Accepted: 28 April 2020 © The Minerals, Metals & Materials Society 2020
Abstract Materials informatics is increasingly finding ways to exploit machine learning algorithms. Techniques such as decision trees, ensemble methods, support vector machines, and a variety of neural network architectures are used to predict likely material characteristics and property values. Supplemented with laboratory synthesis, applications of machine learning to compound discovery and characterization represent one of the most promising research directions in materials informatics. A shortcoming of this trend, in its current form, is a lack of standardized materials data sets on which to train, validate, and test model effectiveness. Applied machine learning research depends on benchmark data to make sense of its results. Fixed, predetermined data sets allow for rigorous model assessment and comparison. Machine learning publications that do not refer to benchmarks are often hard to contextualize and reproduce. In this data descriptor article, we present a collection of data sets of different material properties taken from the AFLOW database. We describe them, the procedures that generated them, and their use as potential benchmarks. We provide a compressed ZIP file containing the data sets and a GitHub repository of associated Python code. Finally, we discuss opportunities for future work incorporating the data sets and creating similar benchmark collections. Keywords AFLOW · Benchmark data sets · Machine learning · Materials informatics
Introduction The previous decade saw widespread interest in machine learning, affecting and sometimes transforming fields and industries. Machine learning has existed for more than half a century as an area of research [1]. The swell of attention to it, in recent years, has been driven largely by advances in neural network algorithms and deep neural networks in particular. However, algorithm development is only part of the story. Other key factors that have made machine learning so useful include special purpose GPU chips, the general increase in computing power associated with Moore’s Law, and the unprecedented availability of training data. Successfully applying machine learning tools to a scientific or industrial domain depends on access to large, highquality data sets [2], together with the software infrastructure needed to process them. In the case of materials science and informatics, sources of training data include shared research * Taylor D. Sparks [email protected] 1
Department of Materials Science and Engineering, University of Utah, Salt Lake City, USA
databases such as AFLOW [3], the Materials Project [4], the Inorganic Crystal Structure Database (ICSD) [5], and the Open Quantum Materials Database (OQMD) [6]. A more comprehensive list of materials databases can be found in Hill et al. [7] Examples of software specifically designed fo
Data Loading...