Md-hit: machine learning for material property prediction with dataset redundancy control

Play all audios:

ABSTRACT Materials datasets usually contain many redundant (highly similar) materials due to the tinkering approach historically used in material design. This redundancy skews the

performance evaluation of machine learning (ML) models when using random splitting, leading to overestimated predictive performance and poor performance on out-of-distribution samples. This

issue is well-known in bioinformatics for protein function prediction, where tools like CD-HIT are used to reduce redundancy by ensuring sequence similarity among samples greater than a

given threshold. In this paper, we survey the overestimated ML performance in materials science for material property prediction and propose MD-HIT, a redundancy reduction algorithm for

material datasets. Applying MD-HIT to composition- and structure-based formation energy and band gap prediction problems, we demonstrate that with redundancy control, the prediction

performances of the ML models on test sets tend to have relatively lower performance compared to the model with high redundancy, but better reflect models’ true prediction capability.

SIMILAR CONTENT BEING VIEWED BY OTHERS MATERIALS PROPERTY PREDICTION FOR LIMITED DATASETS ENABLED BY FEATURE SELECTION AND JOINT LEARNING WITH MODNET Article Open access 03 June 2021 MLMD: A

PROGRAMMING-FREE AI PLATFORM TO PREDICT AND DESIGN MATERIALS Article Open access 21 March 2024 UNVEILING THE COMPLEX STRUCTURE-PROPERTY CORRELATION OF DEFECTS IN 2D MATERIALS BASED ON HIGH

THROUGHPUT DATASETS Article Open access 01 February 2023 INTRODUCTION Density functional theory (DFT) level accuracy of material property prediction1 and >0.95 _R_2 for thermal

conductivity prediction2 with less than a hundred training samples have been routinely reported recently by an increasing list of machine learning algorithms in the material informatics

community. In3, an AI model was shown to be able to predict formation energy of a hold-out test set containing 137 entries from their structure and composition with a mean absolute error

(MAE) of 0.064 eV/atom which significantly outperformed DFT computations for the same task (discrepancies of >0.076 eV/atom). In another related work in Nature Communication by the same

group4, an MAE of 0.07 eV/atom was achieved for composition-based formation energy prediction using deep transfer learning, which is comparable to the MAE of DFT computation. Pasini et al.5

reported that their multitasking neural networks can estimate the material properties (total energy, charge density, and magnetic moment) for a specific configuration hundreds of times

faster than first-principles DFT calculations while achieving comparable accuracy. In6, the authors claimed their graph neural network (GNN) models can predict the formation energies, band

gaps, and elastic moduli of crystals with better than DFT accuracy over a much larger data set. In7, Farb et al. showed numerical evidence that ML model predictions deviate from DFT less

than DFT deviates from the experiments for all nine properties that they evaluated over the QM9 molecule dataset. They also claimed the out-of-sample prediction errors with respect to hybrid

DFT reference were on par with, or close to, chemical accuracy. In8, Tian et al. reported that current ML models can achieve accurate property-prediction (formation energy, band gap, bulk

and shear moduli) using composition alone without using structure information, especially for compounds close to the thermodynamic convex hull. However, this good performance may be

partially due to the over-represented redundancy in their test samples obtained with 6:2:2 random selection from Matminer datasets without redundancy control. To illustrate this point, Fig.

1 shows the formation energy and band gap landscape over the Materials Project (MP)9 composition space, which is generated by mapping the MatScholar features of all MP unique compositions to

the 2D space using t-SNE10 and then plotting the surface. We additionally denote the X-Y projection alongside the corresponding property colors subfigure positioned in the upper left

corner, showing detailed property ranges in some specific areas. Both figures show that there exists a large number of local areas with smooth or similar property values. Random splitting of

samples in those areas into training and test sets may lead to information leakage and over-estimation of the prediction performance. Despite these encouraging successes, the DFT accuracy

reports of these ML models for material property prediction should be cautiously interpreted as they are all average performance evaluated over mostly randomly held-out samples that come

from unexpectedly highly redundant datasets. Materials databases such as Materials Project and Open Quantum Materials Database (OQMD)11,12 are characterized by the existence of many

redundant (highly similar) materials due to the tinkering approach historically used in material design13,14,15. For example, the Materials Project database has many perovskite cubic

structure materials similar to SrTiO3. This sample redundancy within the dataset causes the random splitting of machine learning model evaluation to fail, leading ML models to achieve

over-estimated predictive performance which is misleading for the materials science community. This issue is well known in the area of ecology16 and bioinformatics for protein function

prediction, in which a redundancy reduction procedure (CD-HIT17) is required to reduce the sample redundancy by ensuring no pair of samples has a sequence similarity greater than a given

threshold e.g., 95% sequence identity. In a recent work in 2023, it was also shown that an excellent benchmark score may not imply good generalization performance18. The overestimation of ML

performance for materials has been investigated in a few studies. In19, Meredig et al. examined the extrapolation performance of ML methods for material discovery. They found that

traditional ML metrics, even with cross-validation (CV), overestimate model performance for material discovery and introduce the leave-one-(material) cluster-out cross-validation (LOCO CV)

to objectively evaluate the extrapolation performance of ML models. They especially highlighted that material scientists often intend to extrapolate with trained ML models, rather than

interpolate, to discover new functional materials. Additionally, the sampling in materials training data is typically highly non-uniform. Thus, the high interpolation performance of ML

models trained with datasets with high sample redundancy (e.g., due to doping) does not indicate their strong capability to discover new materials or out-of-distribution (OOD) samples. They

showed that current ML models have much higher difficulty in generalizing from the training clusters to distinct test clusters. They suggested the use of uncertainty quantification (UQ) on

top of ML models to evaluate and explore candidates in new regions of design space. Stanev et al.20 also discussed this generalization issue across different superconductor families. In21,

Xiong et al. proposed K-fold forward cross-validation (FCV) as a new way for evaluating exploration performance in material property prediction by first sorting the samples by their property

values before CV splitting. They showed that current ML models’ prediction performance was actually very low as shown by their proposed FCV evaluation method and the proposed exploratory

prediction accuracy. A similar study for thermal conductivity prediction22 also showed that when ML models are trained with low property values, they are usually not good at predicting

samples with high property values, indicating a weak extrapolation capability. A recent large-scale benchmark study of OOD performances by Omee et al.23 of structure-based graph neural

network models (GNN) for diverse materials properties showed that most of state-of-the-art GNN models tended to have significantly degraded property prediction performance. All these studies

show the need for the material property model developers to focus more on extrapolative prediction performance rather than average interpolation performance over test samples with high

similarity to training samples due to dataset redundancy. The redundancy issue of material datasets has also been studied recently from the point of view of training efficient ML models or

achieving sample efficiency. Magar and Farimani24 proposed an adaptive sampling strategy to generate/sample informative samples for training machine learning models with the lowest amounts

of data. They assumed that informative samples for a model are those with the highest K MAEs (e.g., 250 MAEs) in the test set, which are added to the initial 1000 training set iteratively.

Another selection approach is to add samples similar to data points of the train set having the maximum MAE during training. They showed that their sampling algorithms can create smaller

training sets that obtain better performance than the baseline CGCNN(Crystal Graph Convolutional Neural Networks) model trained with all training samples. This approach can be used with

active learning to build high-performance ML models in a data-efficient way. In a more recent work13, Li et al. studied the redundancy in large material datasets and found that a significant

degree of redundancy across multiple large datasets is present for various material properties and that up to 95% of data can be removed from ML model training with little impact on

prediction performance for test sets sampled randomly from the same distribution dataset. They further showed that the redundant data is due to over-represented material types and does not

help improve the low performance on out-of-distribution samples. They proposed a pruning algorithm similar to24 which first splits the training set into A and B, then trains a ML model on A,

and evaluates the prediction errors on samples in B. After that, the test samples with low MAEs are pruned and the remaining samples are merged and split into A and B again, and so on. Both

approaches rely on the iterative training of ML models and are specific to a given material property. They also proposed an uncertainty quantification-based active learning method to

generate sample-efficient training sets for model training. While these works recognize the possibility to build data-efficient training sets, they did not mention how redundancy has led to

the overestimated ML model performance commonly seen in the literature. Moreover, all approaches for building informative training sets are material property specific, making it difficult to

generate a single non-redundant benchmark dataset for benchmarking material property prediction algorithms for all material properties. Another limitation of these methods is that they show

different similarity thresholds when applied to different datasets, which makes the resulting non-redundant datasets have different minimum distances among the samples. Since material

property prediction research is now pivoting toward developing ML models with high accuracy that are generalizable and transferable between different materials (including those of different

families), a healthy evaluation of ML algorithms is needed to recognize the limitations of existing ML models and to invent new models for material property prediction. Within this context,

reducing the dataset redundancy of both training and test sets can avoid the overestimation of ML model performance, ameliorate the training bias towards samples in crowded areas, and push

the model developers to focus on improving extrapolation performance instead of only interpolation performance. Our work aims to address two major limitations of the latest data redundancy

study on material property prediction13: (1) their redundancy removal procedure is specific to a given material property of interest, and they showed that such redundancy removal may

deteriorate the prediction performance, but not too much. However, in materials property prediction problems, having too many training samples is usually not our major concern. Instead, it

is the out-of-distribution performance of the materials property prediction model that is most interesting to material researchers. However, their work does not show how redundancy removal

may affect the OOD prediction performance; (2) the ’OOD’ samples of their study are not defined rigorously as they are just ’new materials included in a more recent version of the database’.

However, such new samples in a new Materials Project version do not guarantee they are OOD samples that are significantly different from the training set. In this paper, we discuss the

importance of redundancy control in the training and test set selection to achieve objective performance evaluation, especially for extrapolative predictions. Neglecting this aspect has led

to many overestimated ML performances as reported in the literature for both composition-based and structure-based material property prediction. We conduct experiments to demonstrate that

the overestimated ML models often fail for samples that are distant from training samples, indicationg a lack of extrapolation performance. To address this issue, we developed two

redundancy-reducing algorithms (MD-HIT-composition and MD-HIT-structure) with open-sourced code for reducing the dataset redundancy of both composition datasets and structure datasets. These

algorithms utilize composition- and structure-based distance metrics to add samples that are above a defined distance threshold. After this data redundancy control, the dataset can be

randomly split into training, validation, and test sets to achieve objective performance evaluation. We show that with this dataset redundancy control, the predicted performance tends to

reflect their true prediction capability more accurately. RESULTS MD-HIT-COMPOSITION ALGORITHM FOR REDUNDANCY REDUCTION OF COMPOSITION DATASETS The early version of the CD-HIT algorithm17 of

bioinformatics was originally developed to handle large-scale sequence datasets efficiently. It employs a clustering approach to group similar sequences together based on a defined sequence

identity threshold. Within each cluster, only one representative sequence, called the “centroid,” is retained, while the rest of the highly similar sequences are considered duplicates and

removed. However, the clustering approach is still inefficient in dealing with datasets containing hundreds of thousands of sequences. The next generation of CD-HIT further improved the

efficiency by using a greedy algorithm25. Our MD-HIT-composition and MD-HIT-structure redundancy reduction algorithms are designed based on this idea, utilizing greedy incremental

algorithms. In our case, MD-HIT starts the selection process with a seed material (default to _H_2_O_) and then sorts the remaining materials by the number of atoms instead of the formula

lengths. Subsequently, it classifies each material as redundant or representative, depending on its similarity to the existing representatives already selected into the cluster. Composition

similarities are estimated using the ElMD (The Element Movers Distance)26 package, which offers the option to choose linear, chemically derived, and machine-learned similarity measures. By

default, we utilized the Mendeleev similarity and the MatScholar similarity27 for our non-redundant composition dataset generation. Mendeleev similarity measures the similarity between

chemical compositions by comparing the properties of their constituent elements, such as atomic radius and electronegativity, based on the principles used by Dmitri Mendeleev in organizing

the periodic table. The MatScholar distance function is defined as the Euclidean distance between two MatScholar feature vectors27 for a given pair of material compositions. This distance

function is essentially a literature-based word embedding for materials that capture the underlying structure of the periodic table and structure-property relationships in materials. The

Matminer (Materials Data Mining) package28 provides several other material composition descriptors that can also be employed. In this study, our focus was on the ElMD package and the

MatScholar feature-based distance function for redundancy control of composition datasets for material property prediction. The complete composition similarity metrics can be found in Table

1. MD-HIT-STRUCTURE ALGORITHM FOR REDUNDANCY REDUCTION OF STRUCTURE DATASETS MD-HIT-structure algorithm uses the same greedy adding approach as the MD-HIT-composition, except that it uses a

structure-based distance metric. However, due to the varying number of atoms in different crystals, comparing the similarity of two given structures is non-trivial and challenging, given

that most structure descriptors tend to have different dimensions for structures with different numbers of atoms. In this study, we chose two structure distances for redundancy reduction.

One is the distance metric based on XRD (X-ray diffraction) features calculated from crystal structures. We utilized a Gaussian smoothing operation to first smooth the calculated XRD with

the Pymatgen XRDCalculator module29 and then sampled 900 points evenly distributed between 0 and 90 degrees, which leads to XRD features with a fixed 900-dimension. We also selected the OFM

(OrbitalFieldMatrix) feature to calculate the distances of two structures. This kind of feature has also been used in ref. 24 to select informative samples for ML model training. It is a set

of descriptors that encode the electronic structure of a material. These features, which have fixed dimensions (1024), provide information about the distribution of electrons in different

atomic orbitals within a crystal structure and a comprehensive representation of the electronic structure and bonding characteristics of materials. Similar to the MD-HIT-composition, the

MD-HIT-structure algorithm also starts the selection process with a seed material (default to _H_2_O_) which is put in the non-redundant set. It then sorts the remaining materials in the

candidate set by the number of atoms instead of the formula lengths, and classifies them one-by-one as redundant or representative materials based on their similarities (we use Euclidean

distance of XRD features or OFM features) to the existing representatives already selected into the non-redundant set. Redundant samples are discarded, while non-redundant ones are added to

the non-redundant set until the candidate set is empty. DATASETS GENERATION We downloaded 125,619 cif files with material structures from the Materials Project database, which includes

89,354 materials with unique compositions. In cases where compositions corresponded to multiple polymorphs, we adopted average material property values by default, with the exception of

formation energy property, for which we used the minimum value. Additionally, we excluded mp-101974 (_H__e__S__i__O_2) due to issues with calculating Matscholar features. After eliminating

formulas with over 50 atoms, we obtained a non-duplicate composition dataset with 86,741 samples and then used different similarity (distance) thresholds to generate non-redundant datasets.

For Mendeleev similarity, we used distance thresholds of 0.5, 0.8, 1, 1.5, 2, 2.5, and 3 to generate seven non-redundant datasets (Mendeleev-nr). The dataset sizes range from 86,740 to 3177.

Similarly, we generated eight Matscholar non-redundant datasets (Matscholar-nr) with percentages of the total range from 50.82% to 2.33%. We also applied the MD-HIT-structure algorithm to

all 125,619 structures and used different thresholds to generate seven XRD non-redundant datasets and eight OFM non-redundant datasets. After removal of redundancy based on varying degrees

of sample identity using MD-HIT algorithms, we obtained all non-redundant datasets, and the details are shown in Table 2. To visually understand the effect of redundancy removal on datasets,

Fig. 2 shows the material distribution t-SNE maps of the whole dataset and two non-redundant datasets. For each dataset, we calculated the MatScholar composition features for all samples.

Then, we used t-SNE dimension reduction algorithm to map the features to a two-dimensional space. Figure 2a shows the distribution of the whole dataset, which is filled with crowded samples

with high redundancy. Figure 2b shows the less redundant dataset Matscholar-nr generated with the threshold of 0.1. It contains only 50.82% of the samples. Figure 2c shows the Mendeleev-nr

non-redundant dataset with only 4930 samples, which has only 5.68% of the samples of the whole dataset while still covering the entire map with much lower redundancy. The non-redundant

datasets thus allow us to test the true generalization capability when trained and tested on them. COMPOSITION BASED MATERIAL PROPERTY PREDICTION WITH REDUNDANCY CONTROL To investigate the

impact of redundancy control on the performance of ML models for predicting material properties, we conducted experiments using datasets filtered by Mendeleev and Matscholar distances. We

evaluated two state-of-the-art composition-based property prediction algorithms, Roost and CrabNet (See Methods section), on non-redundant datasets derived from the MP composition dataset

with 86,740 samples using different distance thresholds. The datasets were randomly divided into training, validation, and test sets with an 8:1:1 ratio. Figures 3 and 4 compare the

performances of Roost and CrabNet for formation energy and band gap prediction on datasets of varying sizes, filtered by Mendeleev distance thresholds of 0, 0.5, 0.8, 1, 1.5, 2, 2.5 and 3

and Matscholar distance thresholds of 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, and 0.4. Please note that we have chosen to report the results from a single random split for each dataset in

Figs. 3 and 4. This decision was made due to the large number of experiments conducted and our verification that the standard deviations of performances across multiple repeat experiments

are small relative to the mean values. This approach allows for a clear presentation of results while maintaining statistical reliability. For formation energy prediction (Figs. 3a and 4a),

both models exhibit a deteriorating trend with increasing thresholds (i.e., lower data redundancy), as evidenced by decreasing _R_2 and increasing MAE scores. Matscholar distance yields

higher correlations between prediction performance and thresholds compared to Mendeleev distance, indicating that it generates more evenly distributed non-redundant datasets. For band gap

prediction (Figs. 3b and 4b), the _R_2 scores of both models are gradually decreasing with increasing thresholds. However, the MAE scores show a general uptrend with abrupt jumps at certain

points, possibly due to outliers in the band gap datasets, highlighting the challenges in band gap prediction. The inconsistent trends in MAE and _R_2 for band gap prediction using

Matscholar distance (Fig. 4b) may be attributed to the large percentage of zero band gap samples. Overall, removing dataset redundancy allows for more realistic performance evaluations of ML

models in real-world applications, where query materials often differ training samples. Experiments reveal that samples within dense areas tend to have lower prediction errors (Fig. 9).

Without reducing redundancy, a significant portion of test samples may be located in areas crowded with similar training samples, leading to low prediction errors and over-estimated

performance. This occurs because the model may overly rely on information from redundant samples during training while disregarding more diverse samples. STRUCTURE BASED MATERIAL PROPERTY

PREDICTION WITH REDUNDANCY CONTROL To investigate the impact of redundancy control on structure-based material datasets, we utilized the Materials Project database of 123,108 crystal

structures with their formation energy per atom and band gaps. We employed the XRD and OFM features of crystal structures to define the similarity between pairs of structures, which was used

to control the structure redundancy using the minimum XRD/OFM distance thresholds between any pair of samples. For XRD-based non-redundant datasets (XRD-nr), we used thresholds of 0.5, 0.6,

0.8, and 0.9. We evaluated the material property prediction performances of two state-of-the-art graph neural network algorithms, ALIGNN30 and DeeperGATGNN31 (See Methods section), on these

datasets. For formation energy prediction (Fig. 5a), XRD-distance provides effective control of data redundancy, as evidenced by the gradual increase in MAEs and decrease in _R_2 scores for

both algorithms with increasing XRD thresholds. For band gap prediction (Fig. 5b), the effect of dataset redundancy on the performance of both algorithms is more complex. While the _R_2

scores decrease with increasing thresholds, the MAE of ALIGNN for thresholds 0.8 and 0.9 are lower than for the threshold of 0.6, despite lower _R_2 scores. This discrepancy suggests higher

nonlinearity and the influence of outlier band gap values in the prediction problem, a phenomenon also observed in the composition-base results (Figs. 3 and 4). We further evaluated the

impact of OFM-controlled data redundancy on the algorithms’ performance (Fig. 6). Both algorithms showed high consistency in formation energy prediction (Fig. 6a), with _R_2 scores decreaing

and MAE scores increasing with increasing thresholds, indicating that OFM distance is an effective redundacy control method for crystal structure datasets. However, for band gap prediction

(Fig. 6b), while the _R_2 scores decrease with increasing thresholds as expected, the MAE scores also decrease, which is counter-intuitive. Analysis of the test sets revealed that the MD-HIT

algorithm accidentally selected a higher percentages of near-zero band gap samples (<0.01 eV) for higher thresholds, making the prediction task easier. In particular, while the whole

redundant dataset contains only 48.64% near-zero band gap samples, our MD-HIT algorithm selected 64.09%, 67.81%, 84.52%, and 92.43% near-zero band gap samples for thresholds 0.15, 0 2, 0.45,

and 0.7, respectively. This data bias explains the unexpected decrease in MAEs scores. To further elucidate the data bias, we constructed scatter plots depicting the band gaps predicted by

DeeperGATGNN across the entire dataset and two non-redundant datasets, as illustrated in Fig. 7. The analysis reveals a striking predominance (92.43%) of near-zero samples in the

non-redundant dataset with a threshold of 0.7. Choosing a different seed structure other than SrTiO3, which has a band gap close to zero, may reduce this bias. These findings highlight the

importance of monitoring data bias, which can easily lead to overestimated ML model performance in material property prediction. PERFORMANCE COMPARISONS BETWEEN ID AND OOD SETS Our

experiments have demonstrated that redundant material datasets often lead to overestimated high performance for material prediction, as reported in the current literature. When we reduce the

dataset redundancy, the ML performances significantly decreases. Previous work has also shown that removing redundant samples enables the training of efficient ML models with reduced

data13, which can achieve comparable in-distribution (ID) prediction performance while eliminating up to 95% of samples. In this study, we aim to showcase the additional potential benefit of

redundancy removal: enhancing the ML performance for out-of-distribution (OOD) samples. We first selected 1000 OOD test samples to create the MatscholarOOD test set, based on the densities

calculated using Matscholar features of compositions from the entire MP dataset (86,740 samples) for formation energy prediction. The remaining samples were then used to prepare the training

sets. We selected a non-redundant training set (_non-rdfe_) from the MP dataset with a threshold of 0.1, resulting in approximately 40,000 samples. A redundant training set (_rdfe_) of

equal size was then randomly selected from the entire dataset, excluding the OOD samples. The Roost model trained on non-rdfe is referred to as Roost_n__r_ and the Roost model trained on

rdfe is referred to as Roost_r__e__d_. However, our experimental results in previous sections do not demonstrate whether ML models trained with non-redundant sets can achieve performance

improvements for OOD test sets. We found that models trained with non-redundant training samples exhibit lower performance for the randomly split leave-out test set, which is reasonable as

the reduction of redundancy between the training set and the test set makes it more challenging for the models to predict test samples. In this section, we aim to illustrate the effect of

reducing dataset redundancy on ML performance for OOD samples. However, our Roost model trained with the naively created non-redundant training set based on the Matscholar feature space

achieves an MAE of 0.1322 eV, which is worse than that of the Roost model trained with the redundant training set (MAE: 0.1224 eV). Upon close examination, we discovered that the sparse

samples in the MatscholarOOD test set are not necessarily located in the sparse areas of the embedding space for the deep learning models such as Roost, CrabNet, and DeeperGATGNN, indicating

that these OOD samples are not true OOD samples. This finding also explains the fact that our Roost trained with non-redundant set has an MAE of 0.1322 eV when evaluated on the

MatscholarOOD test set, while it achieves a higher MAE of 0.1728 eV on the random-split test set. To compare the true OOD performance of models trained on non-redundant and redundant

dataset, we prepared another OOD test set named EmbeddingOOD. First, we used a pretrained Roost model as an encoder to obtain the latent representations for all samples in the entire dataset

(86,740 samples). We then calculated the pairwise distances of all samples using their latent representations and selected 1000 OOD samples that are furthest away on average from their

three nearest neighbors, forming our EmbeddingOOD test set. We then compared the performance of Roost models on the random-split test sets (split from nonrdfe and rdfe in a 9:1 ratio) and

this EmbeddingOOD test set. It should be noted that we removed all OOD samples from the original nonrdfe and rdfe datasets. Figure 8 shows the performance of two Roost models on two test

sets (ID and OOD sets). The MAE of the Roost_r__e__d_ increases from 0.1278 eV on the random-split test set to 0.4089 eV for the EmbeddingOOD test set, while _R_2 significantly decreases

from 0.9168 to 0.5318 (Fig. 8c, d), indicating that our EmbeddingODD samples pose a significant challenge for our Roost_r__e__d_ model. In contrast, for Roost_n__r_, its MAE increased from

0.1994 eV to 0.3679 eV, and _R_2 reduced from 0.8544 to 0.6998 (Fig. 8a, b). However, we find that the Roost_n__r_ significantly outperforms Roost_r__e__d_ model, with 10.03% improvement in

MAE and a 31.6% improvement in _R_2 for the OOD test set. This result demonstrates that removing redundant data can steer an ML model away from focusing on crowded samples, ensuring

equitable attention to all other samples, and consequently improving OOD prediction performance. Moreover, to demeonstrate that the prediction errors tend to be lower in areas of high sample

density, we created parity plot showing the correlation between MAEs of the Roost models and sample density. The Roost models were trained with both redundant and non-redundant training

sets. We first sorted all test samples in the random-split or OOD test sets according to their densities, calculated using their latent representations. The sorted samples were then split

into 50 bins, and the average MAE was calculated for each bin, resulting in 50 (density & MAE) data points for each test set, as shown in Fig. 9. Fitted curves for the 50 data points of

the random-split and OOD sets were added to the scatter plots in Fig. 9a, b. The fitted curves in both Fig. 9a, b show a trend of decreasing MAEs as sample density increases. However, the

MAEs for OOD samples have much higher variance compared to those of the random-split test samples. Furthermore, for the Roost model trained with the non-redundant dataset (Fig. 9b), the two

fitted curves are closer to each other than those for the predictions by the Roost_r__e__d_ in Fig. 9a. This indicates that the ML model trained on the non-redundant dataset has more

consistent performance across the random-split and the OOD test sets. Another interesting question arises as to why the deep learning models (Roost, CrabNet, and DeeperGATGNN) trained with

the non-redundant dataset perform worse than those trained with the redundant dataset when testing the MatscholarOOD test set. In contrast, Random Forest models behave oppositely: the

RF_n__r_ model achieves better performance than RF_r__e__d_ model for the MatscholarOOD test set. A possible explanation is that deep learning models project raw composition or structural

inputs into a high-level latent representation space, which differs from the Matscholar feature space used to build the MatscholarOOD test set. This difference makes our MatscholarOOD

samples not sparse in the latent space used for decision-making by the deep learning models, explaining why deep learning models trained with redundant samples work better than trained with

non-redundant samples. In contrast, RF models lack representation learning capability and use the Matscholar feature space directly as their decision space. This charecteristic allows the

RF_n__r_, trained with non-redundant training set, to work better than the _R__F__r__e__d_ model for the true Matscholar OOD set. To further explore the performance of ML models on OOD

samples selected using Matscholar features, we train two RF models, RF_n__r_ and RF_r__e__d_, on the non-rdfe and rdfe datasets, respectively. We then selected the 1000 OOD samples that are

furthest away on average from their corresponding three nearest neighbors according to their Matscholar features. The performance of RF models was tested on both the random-split test sets

and the OOD test set. As shown in Fig. 10c, d, for the RF model trained with the redundant rdfe dataset (RF_r__e__d_), there is a significant performance difference between the random-split

test set and the OOD test set. The MAE increases from 0.4014 eV to 0.7562 eV, while _R_2 significantly decreases from 0.7094 to 0.0527. In contrast, for RF_n__r_, the MAE increases from

0.4330 eV to 0.6427 eV, and _R_2 is reduced from 0.6382 to 0.2668 (Fig. 10a, b). Although the performance of RF_n__r_ on the OOD test set is worse than on random-split test set, it is still

much better than the 0.0527 _R_2 value of the RF_r__e__d_ model. This indicates that removing data redundancy can improve OOD prediction performance for RF models. Another interesting

question is how to determine the threshold for redundancy control. Instead of having a commonly agreed value as used in the CD-HIT code by the bioinformatics community, we can use the

standard validation method for hyper-parameter tuning to find the optimal threshold value. DISCUSSION Large material databases such as the Materials Project usually contain a high degree of

redundancy, which causes biased ML models and overestimated performance evaluations due to the redundancy between randomly selected test samples and the remaining training samples. The

claimed DFT accuracy averaged over all data samples from the literature deviates from the common needs of material scientists who usually want to discover new materials that are different

from known training samples, which makes it important to evaluate and report the extrapolation rather than interpolation material property prediction performance and performance comparison

across different datasets should be interpreted within the context of data redundancy levels. Here we propose and develop two material dataset redundancy-reducing algorithms based on a

greedy algorithm inspired by the peer bioinformatics CD-HIT algorithm. We use two composition distance metrics and two structure distance metrics as the thresholds to control the sample

redundancy of our composition and structure datasets. Our benchmark results over two composition-based and two structure-based material property prediction models over two material

properties (formation energy and band gap) showed that the prediction performance of current ML models all tend to degrade due to the removal of redundant samples, leading to the measurement

of more realistic prediction performance of current ML material property models in practice. The more different the query samples, the more difficult it is to predict them accurately by

current machine learning models that focus on interpolation. The out-of-distribution prediction problem is now under active research in the machine learning community which focuses on OOD

generalization performance32,33 including works that use domain adaptation to improve OOD prediction performance for composition-based property prediction34. More investigation is needed to

check the exact relationships between dataset redundancy and machine learning model generalization performance. The availability of our easy-to-use open-source code of MD-HIT-composition and

MD-HIT-structure makes it easy for researchers to conduct objective evaluations and report realistic performances of their ML models for material property prediction. It should also be

noted that the current multi-threaded implementation of our MD-HIT algorithms is still slow and more improvements are highly desirable. METHODS COMPOSITION-BASED MATERIAL PROPERTY PREDICTION

ALGORITHMS We evaluated two state-of-the-art composition-based material property prediction algorithms including Roost35 and Crabnet36 to study the impact of dataset redundancy on their

performance. The Roost algorithm is a DL model specifically designed for material property prediction based on material composition. It utilizes a graph neural network framework to learn

relationships between material compositions and their corresponding properties. CrabNet is a transformer self-attention-based model for composition-only material property prediction. It

matches or exceeds current best-practice methods on nearly all of 28 total benchmark datasets. STRUCTURE-BASED MATERIAL PROPERTY PREDICTION ALGORITHMS We evaluated two state-of-the-art

structure-based material property prediction algorithms including ALIGNN (Atomistic Line Graph Neural Network)30 and DeeperGATGNN (a global attention-based GNN with differentiable group

normalization and residual connection)31 to compare the impact of dataset redundancy on their performance. The ALIGNN model addresses a major limitation of the majority of current GNN models

used for atomistic predictions, which only rely on atomic distances while overlooking the bond angles. Actually bond angles play a crucial role in distinguishing various atomic structures

and small deviations in bond angles can significantly impact several material properties. ALIGNN is a GNN architecture that conducts message passing on both the interatomic bond graph and

its corresponding line graph specifically designed for bond angles. It has achieved state-of-art performances in most benchmark problems of the matbench37. The DeeperGATGNN model is a global

attention-based graph neural network that uses differentiable group normalization and residual connection to achieve high-performance deep graph neural networks without performance

degradation. It has achieved superior results as shown in a set of material property predictions. EVALUATION CRITERIA We use the following performance metrics for evaluating dataset

redundancy’s impact on model performance, including Mean Absolute Error (MAE) and R-squared (_R_2). $$MAE=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\left\vert

{y}_{i}-{\hat{y}}_{i}\right\vert$$ (1) $${R}^{2}=1-\frac{\mathop{\sum }\nolimits_{i = 1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}{\mathop{\sum }\nolimits_{i = 1}^{n}{({y}_{i}-\bar{y})}^{2}}$$ (2)

Where _y__i_ represents the observed or true values, ${\hat{y}}_{i}$ represents the predicted values, and $\bar{y}$ represents the mean of the observed values. The summation symbol ∑ is

used to calculate the sum of values, and _n_ represents the number of data points in the dataset. DATA AVAILABILITY The non-redundant datasets can be freely accessed at

https://github.com/usccolumbia/MD-HIT. CODE AVAILABILITY The source code can be freely accessed at https://github.com/usccolumbia/MD-HIT. REFERENCES * Xie, T. & Grossman, J. C. Crystal

graph convolutional neural networks for an accurate and interpretable prediction of material properties. _Phys. Rev. Lett._ 120, 145301 (2018). Article CAS PubMed Google Scholar * Chen,

L., Tran, H., Batra, R., Kim, C. & Ramprasad, R. Machine learning models for the lattice thermal conductivity prediction of inorganic materials. _Comput. Mater. Sci._ 170, 109155 (2019).

Article CAS Google Scholar * Jha, D., Gupta, V., Liao, W.-k, Choudhary, A. & Agrawal, A. Moving closer to experimental level materials property prediction using ai. _Sci. Rep._ 12,

1–9 (2022). Article Google Scholar * Jha, D. et al. Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning. _Nat. Commun._

10, 5316 (2019). Article CAS PubMed PubMed Central Google Scholar * Pasini, M. L. et al. Fast and stable deep-learning predictions of material properties for solid solution alloys. _J.

Phys.: Condens. Matter_ 33, 084005 (2020). Google Scholar * Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph networks as a universal machine learning framework for molecules and

crystals. _Chem. Mater._ 31, 3564–3572 (2019). Article CAS Google Scholar * Faber, F. A. et al. Prediction errors of molecular machine learning models lower than hybrid dft error. _J.

Chem. theory Comput._ 13, 5255–5264 (2017). Article CAS PubMed Google Scholar * Tian, S. I. P., Walsh, A., Ren, Z., Li, Q. & Buonassisi, T. What information is necessary and

sufficient to predict materials properties using machine learning?_arXiv preprint arXiv:2206.04968_ (2022). * Jain, A. et al. Commentary: The materials project: A materials genome approach

to accelerating materials innovation. _APL Mater._ 1 (2013). * Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. _J. Mach. Learn. Res._ 9 (2008). * Saal, J. E., Kirklin, S.,

Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery with high-throughput density functional theory: the open quantum materials database (oqmd). _Jom_ 65, 1501–1509

(2013). Article CAS Google Scholar * Kirklin, S. et al. The open quantum materials database (oqmd): assessing the accuracy of dft formation energies. _npj Comput. Mater._ 1, 1–15 (2015).

Article Google Scholar * Li, K. et al. Exploiting redundancy in large materials datasets for efficient machine learning with less data. _Nat. Commun._ 14, 7283 (2023). Article CAS PubMed

PubMed Central Google Scholar * Trabelsi, Z. et al. Superconductivity phenomenon: Fundamentals and theories. In _Superconducting Materials: Fundamentals, Synthesis and Applications_,

1–27 (Springer, 2022). * Zunger, A. & Malyi, O. I. Understanding doping of quantum materials. _Chem. Rev._ 121, 3031–3060 (2021). Article CAS PubMed Google Scholar * Roberts, D. R.

et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. _Ecography_ 40, 913–929 (2017). Article Google Scholar * Li, W. & Godzik,

A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. _Bioinformatics_ 22, 1658–1659 (2006). Article CAS PubMed Google Scholar * Li, K.,

DeCost, B., Choudhary, K., Greenwood, M. & Hattrick-Simpers, J. A critical examination of robustness and generalizability of machine learning prediction of materials properties. _npj

Comput. Mater._ 9, 55 (2023). Article CAS Google Scholar * Meredig, B. et al. Can machine learning identify the next high-temperature superconductor? examining extrapolation performance

for materials discovery. _Mol. Syst. Des. Eng._ 3, 819–825 (2018). Article CAS Google Scholar * Stanev, V. et al. Machine learning modeling of superconducting critical temperature. _npj

Comput. Mater._ 4, 29 (2018). Article Google Scholar * Xiong, Z. et al. Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward

cross-validation. _Comput. Mater. Sci._ 171, 109203 (2020). Article CAS Google Scholar * Loftis, C., Yuan, K., Zhao, Y., Hu, M. & Hu, J. Lattice thermal conductivity prediction using

symbolic regression and machine learning. _J. Phys. Chem. A_ 125, 435–450 (2020). Article PubMed Google Scholar * Omee, S. S., Fu, N., Dong, R., Hu, M. & Hu, J. Structure-based

out-of-distribution (OOD) materials property prediction: a benchmark study. _Npj Comput. Mater._ 10, 144 (2024). Article Google Scholar * Magar, R. & Farimani, A. B. Learning from

mistakes: Sampling strategies to efficiently train machine learning models for material property prediction. _Comput. Mater. Sci._ 224, 112167 (2023). Article CAS Google Scholar * Fu, L.,

Niu, B., Zhu, Z., Wu, S. & Li, W. Cd-hit: accelerated for clustering the next-generation sequencing data. _Bioinformatics_ 28, 3150–3152 (2012). Article CAS PubMed PubMed Central

Google Scholar * Hargreaves, C. J., Dyer, M. S., Gaultois, M. W., Kurlin, V. A. & Rosseinsky, M. J. The earth mover’s distance as a metric for the space of inorganic compositions.

_Chem. Mater._ 32, 10610–10620 (2020). Article CAS Google Scholar * Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. _Nature_

571, 95–98 (2019). Article CAS PubMed Google Scholar * Ward, L. et al. Matminer: An open source toolkit for materials data mining. _Comput. Mater. Sci._ 152, 60–69 (2018). Article

Google Scholar * De Graef, M. & McHenry, M. E.Structure of materials: an introduction to crystallography, diffraction and symmetry (Cambridge University Press, 2012). * Choudhary, K.

& DeCost, B. Atomistic line graph neural network for improved materials property predictions. _npj Comput. Mater._ 7, 185 (2021). Article Google Scholar * Omee, S. S. et al. Scalable

deeper graph neural networks for high-performance materials property prediction. _Patterns_ 3, 100491 (2022). Article PubMed PubMed Central Google Scholar * Arjovsky, M. Out of

distribution generalization in machine learning. Ph.D. thesis, New York University (2020). * Krueger, D. et al. Out-of-distribution generalization via risk extrapolation (rex). In

_International Conference on Machine Learning_, 5815–5826 (PMLR, 2021). * Hu, J., Liu, D., Fu, N. & Dong, R. Realistic material property prediction using domain adaptation based machine

learning. _Digital Discov._ 3, 300–312 (2024). Article CAS Google Scholar * Goodall, R. E. & Lee, A. A. Predicting materials properties without crystal structure: Deep representation

learning from stoichiometry. _Nat. Commun._ 11, 6280 (2020). Article CAS PubMed PubMed Central Google Scholar * Wang, A. Y.-T., Kauwe, S. K., Murdock, R. J. & Sparks, T. D.

Compositionally restricted attention-based network for materials property predictions. _Npj Comput. Mater._ 7, 77 (2021). Article Google Scholar * Dunn, A., Wang, Q., Ganose, A., Dopp, D.

& Jain, A. Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. _npj Comput. Mater._ 6, 138 (2020). Article Google Scholar

Download references ACKNOWLEDGEMENTS The research reported in this work was supported in part by National Science Foundation under the grant number 2311202. The views, perspectives, and

content do not necessarily represent the official views of the NSF. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * College of Big Data and Statistics, Guizhou University of Finance and

Economics, Guiyang, China Qin Li * Department of Computer SCience and Engineering, University of South Carolina, Columbia, SC, USA Nihang Fu, Sadman Sadeed Omee & Jianjun Hu Authors *

Qin Li View author publications You can also search for this author inPubMed Google Scholar * Nihang Fu View author publications You can also search for this author inPubMed Google Scholar *

Sadman Sadeed Omee View author publications You can also search for this author inPubMed Google Scholar * Jianjun Hu View author publications You can also search for this author inPubMed

Google Scholar CONTRIBUTIONS Conceptualization, J.H.; methodology, Q.L., N.F., J.H., S.O.; software, Q.L., N.F., J.H., and S.O.; resources, J.H.; writing–original draft preparation, J.H.,

Q.L, N.F., and S.O.; writing–review and editing, J.H. and N.F.; visualization, N.F., J.H., and S.O.; supervision, J.H.; funding acquisition, J.H. CORRESPONDING AUTHOR Correspondence to

Jianjun Hu. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to

jurisdictional claims in published maps and institutional affiliations. RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International

License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source,

provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons

licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by

statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Li, Q., Fu, N., Omee, S.S. _et al._ MD-HIT: Machine learning for material property

prediction with dataset redundancy control. _npj Comput Mater_ 10, 245 (2024). https://doi.org/10.1038/s41524-024-01426-z Download citation * Received: 22 April 2024 * Accepted: 23 September

2024 * Published: 18 October 2024 * DOI: https://doi.org/10.1038/s41524-024-01426-z SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get

shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative