Is Domain Knowledge Necessary for Machine Learning Materials Properties?

  • PDF / 2,579,971 Bytes
  • 7 Pages / 595.276 x 790.866 pts Page_size
  • 61 Downloads / 258 Views

DOWNLOAD

REPORT


TECHNICAL ARTICLE

Is Domain Knowledge Necessary for Machine Learning Materials Properties? Ryan J. Murdock1 · Steven K. Kauwe1 · Anthony Yu‑Tung Wang2 · Taylor D. Sparks1  Received: 4 June 2020 / Accepted: 9 July 2020 © The Minerals, Metals & Materials Society 2020

Abstract  New featurization schemes for describing materials as composition vectors in order to predict their properties using machine learning are common in the field of Materials Informatics. However, little is known about the comparative efficacy of these methods. This work sets out to make clear which featurization methods should be used across various circumstances. Our findings include, surprisingly, that simple fractional and random-noise representations of elements can be as effective as traditional and new descriptors when using large amounts of data. However, in the absence of large datasets or for data that is not fully representative, we show that the integration of domain knowledge offers advantages in predictive ability. Graphical abstract

Keywords  Materials informatics · Machine learning · Featurization · Descriptors · Neural networks

* Taylor D. Sparks [email protected] 1



Materials Science and Engineering Department, University of Utah, Salt Lake City, UT 84109, USA



Technische Universität Berlin, Fachgebiet Keramische Werkstoffe/Chair of Advanced Ceramic Materials, 10623 Berlin, Germany

2

Introduction In Materials Informatics (MI), composition-based machine learning (ML) entails the creation of a composition-based feature vector (CBFV) that represents materials based on expertly curated element properties. Traditionally, descriptive statistics (average, range, sum, and variance) regarding

13

Vol.:(0123456789)



the constituent elements represent the core of a CBFV scheme (see Fig. 1). An exemplar of the CBFV method is the Magpie[1] descriptor. This domain-derived approach (CBFV) has been successfully employed in materials informatics studies in the literature[2–7]. Not only has this approach been successful, but the information it contains is also human-readable, potentially allowing for physically interpretable results. Contrary to the CBFV are data-driven techniques such as CGCNN[8], mat2vec[9], SchNet[10], ElemNet[11], etc. These represent a new philosophy. When featurization is reliant primarily on data, domain knowledge is less important. The representation of chemical systems is no longer relegated to expert opinion. When used within learning frameworks, these data-driven techniques allow for materials insight that may be outside of current scientific understanding. The removal of materials experts stands juxtaposed to traditional learning that uses hand-engineered materials representations, such as the classic CBFV. Although a variety of data-driven approaches can be utilized, works such as mat2vec rely heavily on curated materials knowledge. For mat2vec, this knowledge comes in the form of materials science abstracts. Natural language processing techniques are applied to these abstracts, reportedly yielding ve