Benchmark AFLOW Data Sets for Machine Learning

PDF / 588,314 Bytes
4 Pages / 595.276 x 790.866 pts Page_size
42 Downloads / 307 Views

TECHNICAL ARTICLE

Benchmark AFLOW Data Sets for Machine Learning Conrad L. Clement1 · Steven K. Kauwe1 · Taylor D. Sparks1 Received: 8 March 2020 / Accepted: 28 April 2020 © The Minerals, Metals & Materials Society 2020

Abstract Materials informatics is increasingly finding ways to exploit machine learning algorithms. Techniques such as decision trees, ensemble methods, support vector machines, and a variety of neural network architectures are used to predict likely material characteristics and property values. Supplemented with laboratory synthesis, applications of machine learning to compound discovery and characterization represent one of the most promising research directions in materials informatics. A shortcoming of this trend, in its current form, is a lack of standardized materials data sets on which to train, validate, and test model effectiveness. Applied machine learning research depends on benchmark data to make sense of its results. Fixed, predetermined data sets allow for rigorous model assessment and comparison. Machine learning publications that do not refer to benchmarks are often hard to contextualize and reproduce. In this data descriptor article, we present a collection of data sets of different material properties taken from the AFLOW database. We describe them, the procedures that generated them, and their use as potential benchmarks. We provide a compressed ZIP file containing the data sets and a GitHub repository of associated Python code. Finally, we discuss opportunities for future work incorporating the data sets and creating similar benchmark collections. Keywords AFLOW · Benchmark data sets · Machine learning · Materials informatics

Introduction The previous decade saw widespread interest in machine learning, affecting and sometimes transforming fields and industries. Machine learning has existed for more than half a century as an area of research [1]. The swell of attention to it, in recent years, has been driven largely by advances in neural network algorithms and deep neural networks in particular. However, algorithm development is only part of the story. Other key factors that have made machine learning so useful include special purpose GPU chips, the general increase in computing power associated with Moore’s Law, and the unprecedented availability of training data. Successfully applying machine learning tools to a scientific or industrial domain depends on access to large, highquality data sets [2], together with the software infrastructure needed to process them. In the case of materials science and informatics, sources of training data include shared research * Taylor D. Sparks [email protected] 1

Department of Materials Science and Engineering, University of Utah, Salt Lake City, USA

databases such as AFLOW [3], the Materials Project [4], the Inorganic Crystal Structure Database (ICSD) [5], and the Open Quantum Materials Database (OQMD) [6]. A more comprehensive list of materials databases can be found in Hill et al. [7] Examples of software specifically designed fo

Data Loading...

Benchmark AFLOW Data Sets for Machine Learning

Recommend Documents

Mitigating Gender Bias in Machine Learning Data Sets

Machine learning based novel cost-sensitive seizure detection classifier for imbalanced EEG data sets

Data: The Fuel for Machine Learning

Big Data and Machine Learning

Machine Learning and Deep Learning Models for Big Data Issues

Machine Learning and Data Mining

Ensuring Data Privacy Using Machine Learning for Responsible Data Science

Dynamic Features Spaces and Machine Learning: Open Problems and Synthetic Data Sets

VGM-Bench: FPU Benchmark Suite for Computer Vision, Computer Graphics and Machine Learning Applications

Investing Data with Machine Learning Using Python

Advances in Machine Learning and Data Analysis

Big Data Analytics and Machine Learning Technologies for HPC Applications