High Entropy Alloy Discovery by Text Mining

What lies beyond the known High Entropy Alloy territory?
We obtained 500 promising lightweight ultrahigh-entropy alloys with 6 and 7 components out of 2.6 million candidate systems by text mining from of 6.4 million papers.

The full paper is here.


Design of ultrahigh-entropy alloys via mining six million texts
High Entropy Alloys via Text Mining and [...]
PDF-Dokument [2.2 MB]

Text mining (TM) is an artificial intelligence method to analyze and discover scientific knowledge that is hidden in the literature. TM has been used in several fields of materials science.


TM has the potential for automatic materials discovery given sufficiently large corpora, such as for the material group of high- and medium-entropy alloys (HEAs, MEAs), where more than 10,000 papers have been published.

Several TM methods have been suggested that build on corpora as training data. One group of TM algorithms uses vectors to represent words, known as word-embedding algorithms. Operations on the vectors provide meaningful information. For example, the difference between vector “FCC" and vector “Al" is approximately equal to that between vector “W" and vector “BCC", since the chemical element “Al" is commonly found with a face-centered-cubic (FCC) crystal structure and “W" with a body-centered-cubic (BCC) structure. These vectors are determined by maximizing the co-occurrence probability of an embedded word and its neighbors within the corpora. The cosine of two vectors measures the similarity of the words they represent. When increasing the frequency of the word “CoCrFeNiV" as the neighbor of “CoCrFeMnNi" by 10 times in a TM (skip-gram) model, its similarity ranking increases by 13.

TM models trained on specially selected corpora are predictive, as the presence of less relevant text items can reduce the relative frequency of keywords1.

Here we have developed a highly optimized TM model for metallic materials focusing on HEAs. Unfortunately, TM methods can only identify targeted materials that are in principle already present in the corpora, a fact that does not per se include the discovery of materials. A key challenge in designing HEAs, however, is searching for similar elements with high mutual solubility. To this end, we propose a design concept of “context-similar elements" to overcome this limitation of existing TM methods in this field. The context-similar elements approach aims to capture the similarity of chemical elements in the alloy-design context used by scientists. The similarity in this context is not a metric calculable from simple elemental properties but a more comprehensive one that also reflects researchers’ experience in materials research and design. This approach will enrich the portfolio of existing alloy-design methods and can accelerate the alloy discovery process by replacing the laborious literature search, review, and knowledge extraction with TM models. With this approach researchers with less domain-specific experience can design complex HEAs with many components assisted by TM models that not only “read” huge amounts of publications but also ”analyze” them more context-sensitive.