A machine learning platform to estimate anti-sars-cov-2 activities

Play all audios:

ABSTRACT Strategies for drug discovery and repositioning are urgently need with respect to COVID-19. Here we present REDIAL-2020, a suite of computational models for estimating small

molecule activities in a range of SARS-CoV-2-related assays. Models were trained using publicly available, high-throughput screening data and by employing different descriptor types and

various machine learning strategies. Here we describe the development and use of eleven models that span across the areas of viral entry, viral replication, live virus infectivity, in vitro

infectivity and human cell toxicity. REDIAL-2020 is available as a web application through the DrugCentral web portal (http://drugcentral.org/Redial). The web application also provides

similarity search results that display the most similar molecules to the query, as well as associated experimental data. REDIAL-2020 can serve as a rapid online tool for identifying active

molecules for COVID-19 treatment. SIMILAR CONTENT BEING VIEWED BY OTHERS BIOLOGICAL ACTIVITY-BASED MODELING IDENTIFIES ANTIVIRAL LEADS AGAINST SARS-COV-2 Article 23 February 2021 AN

INTERACTION-BASED DRUG DISCOVERY SCREEN EXPLAINS KNOWN SARS-COV-2 INHIBITORS AND PREDICTS NEW COMPOUND SCAFFOLDS Article Open access 06 June 2023 LESSONS FROM THE COVID-19 PANDEMIC FOR

ADVANCING COMPUTATIONAL DRUG REPURPOSING STRATEGIES Article 14 January 2021 MAIN There is currently an urgent need to find effective drugs for treating coronavirus disease 2019 (COVID-19).

Here we present REDIAL-2020, a suite of machine learning models that forecast activities for live viral infectivity, viral entry and viral replication, specifically for severe acute

respiratory syndrome coronavirus 2 (SARS-CoV-2), in vitro infectivity, and human cell toxicity. This application could serve the scientific community when prioritizing compounds for in vitro

screening and may ultimately accelerate the identification of novel drug candidates for COVID-19 treatment. REDIAL-2020 consists of eleven independently trained machine learning models and

includes a similarity search module that queries the underlying experimental dataset for similar compounds. These models were developed using experimental data generated by the following

assays: the SARS-CoV-2 cytopathic effect (CPE) assay and its host cell cytotoxicity counterscreen, the Spike–ACE2 protein–protein interaction (AlphaLISA) assay and its TruHit counterscreen,

the angiotensin-converting enzyme 2 (ACE2) enzymatic activity assay, the 3C-like (3CL) proteinase enzymatic activity assay, the SARS-CoV pseudotyped particle entry (CoV-PPE) assay and its

counterscreen (CoV-PPE_cs), the Middle-East respiratory syndrome coronavirus (MERS-CoV) pseudotyped particle entry assay (MERS-PPE) and its counterscreen (MERS-PPE_cs), and the human

fibroblast toxicity (hCYTOX) assay. Such assays represent five distinct categories: viral entry (CPE1 and host cell cytotoxicity counterscreen2), viral replication (3CL enzymatic activity),

live virus infectivity (AlphaLISA, TruHit counterscreen and ACE2 enzymatic activity)3, in vitro infectivity (CoV-PPE with associated counterscreens for two other coronaviruses, SARS-CoV and

MERS) and hCYTOX, as described in the National Center for Advancing Translational Sciences (NCATS) COVID-19 portal4. We retrieved these datasets from the NCATS COVID-19 portal5. The NCATS

team is committed to performing a range of COVID-19-related viral and host target assays, as well as analysing the results6. A more exhaustive description of each assay is provided in the

Methods. For model development, three different types of descriptors were employed and a best model for each descriptor type was developed by employing various machine learning algorithms.

The three best models from each descriptor type were then combined using a voting method to give an ensemble model. These ensemble machine learning models are integrated into a user-friendly

web portal that allows input using three different formats: (1) the drug name, both as the international non-proprietary name (for example, remdesivir) or as trade name (for example,

Veklury); (2) the PubChem compound ID number (PubChem CID)7 (for example, 121304016 for remdesivir); or (3) using the chemical structure encoded in the simplified molecular-input line-entry

system (SMILES) format8. The workflow and output, regardless of input format, are identical and described below. Drug repositioning requires computational support9 and data-driven decision

making offers a pragmatic approach to identifying optimal candidates while minimizing the risk of failure. As molecular properties and bioactivities can be described as a function of

chemical structure, cheminformatics-based predictive models are becoming increasingly useful in drug discovery and repositioning research. Specifically, anti-SARS-CoV-2 models based on

high-throughput data could be used as a prioritization step when planning experiments, particularly for large molecular libraries, thus decreasing the number of experiments and reducing

downstream costs. REDIAL-2020 could serve such a purpose and help the scientific community reduce the number of molecules before experimental tests for anti-SARS-CoV-2 activity. This suite

of machine learning models can also be used via the command line for large-scale virtual screening. As new datasets become available in the public domain, we plan to tune the machine

learning models further, add additional models based on SARS-CoV-2 assays and make these models available in future releases of REDIAL-2020. RESULTS DATA MINING All workflows and procedures

were performed using the KNIME platform10. The NCATS data associated with the aforementioned assays were downloaded from the COVID-19 portal4,5. The files contained over 23,000 data points

generated by high-throughput screening (HTS) experiments. When possible, each compound was cross-linked to drugs annotated in DrugCentral11,12,13 to retrieve the chemical structure in SMILES

format (see Methods), otherwise the original SMILES strings were retained. Bioactivity data were mined according to the curve class and maximum response parameters14. The activity class and

a significance class were defined using criteria reported in Supplementary Tables 1 and 2, respectively. As a final data-wrangling step, all compounds were categorized and assay data

grouped to have a unique record per molecule for each assay. When more than one assay was measured for the same molecule, only the datapoint with the best curve class was retained. At the

end of this process, 4,954 unique molecules were stored. The compounds were labelled as positive or negative for each assay. The compounds with a low-activity class were treated as negative,

whereas compounds with high- and moderate-activity classes were treated as positive. Finally, the following calculated physicochemical property filters were applied: log[_P_] _<_ 1,

log[_P_] _>_ 9, log[_S_] > −3, log[_S_] < −7.5, where log[_P_] is the log10 of the octanol/water partition coefficient and log[_S_] is the log10 of the aqueous solubility. These

thresholds were initially used to maximize the number of inactive compounds removed while minimizing the number of active compounds excluded (see Discussion). Following use of the

physicochemical property filters, each dataset was reduced in size (see Table 1). As shown in Table 1, certain datasets would have resulted in 15% or more of the active compounds being

excluded; log[_P_] and log[_S_] filters were therefore not applied to those datasets. Chemical structures were standardized in terms of SMILES representation (see Methods). Following

standardization, desalting, neutralizing and tautomer normalization, multiple input SMILES can resolve into the same output SMILES string. Hence, the final step was removal of duplicate

chemical structures. MODEL DEVELOPMENT Several prediction models were developed for each assay, employing three categories of features and 22 distinct machine learning classification

algorithms from the scikit-learn package15 (see the Methods for the complete description of features categories; Supplementary Fig. 1 shows the workflow for model generation). The three

different categories of features employed were based on chemical fingerprints, physicochemical descriptors and topological pharmacophore descriptors. Briefly, 19 different RDKit fingerprints

were tested for fingerprint-based descriptors, Volsurf+ and RDKit descriptors were employed for physicochemical descriptors, and topological pharmacophore atom triplets fingerprints (TPATF)

from Mayachemtools were used for pharmacophore descriptors. Input data were split into a 70% training set, 15% validation set and 15% test set for each model using a stratified sampling

(Supplementary Table 3 reports the number of compounds used in training, validation and test sets for each model). Six assays (CPE, cytotox, AlphaLISA, TruHit, ACE2 and 3CL) were initially

trained with 22 different classifiers available in scikit-learn (see Methods)16; however, some did not output probability estimates of the class labels (for example, OneVsOne, ridge, nearest

centroid, linear SVC and so on). As our consensus based on probability models relies on predicted probability of each predicted label, only classifiers that output class probabilities were

used for training. Two more classifiers—support vector machines and quadratic discriminant analysis—were evaluated. Finally, 15 classifiers and 22 features of three distinct categories (see

Methods) were trained across eleven assays, using hypopt for hyperparameter tuning17. APPLICABILITY DOMAIN Machine learning models have boundaries for predictability16, traditionally called

the applicability domain18. The applicability domain is defined by the parameter space of the training set on which machine learning models are built. Machine learning predictions are deemed

reliable when they fall within the applicability domain of that specific model and less reliable when outside of it19. There are two categories of methods to determine the applicability

domain for classification models: novelty detection and confidence estimation. Novelty detection defines the applicability domain in terms of molecular (feature) space, whereas confidence

estimation defines it in terms of expected prediction reliability20. As confidence estimation is more efficient at reducing the error rate than novelty detection20, we implemented this

method for evaluating applicability domain (see Methods). Confidence scores, which are averaged for each query molecule, as calculated by default using three different models, are

incorporated along predictions in the results page. Confidence scores for each model can be examined by hovering over the confidence score value shown on the results webpage. SUBMISSION WEB

PAGE By accessing REDIAL-2020 (http://drugcentral.org/Redial) from any web browser, including mobile devices, the submission page is displayed (Fig. 1). The web server accepts SMILES, drug

names or PubChem CIDs as input. The user interface at the top of the page allows users to navigate various options (Fig. 1). The user interface provides a summary of the models, such as

model type, which descriptor categories were used for training and the evaluation scores. The user interface further depicts the processes of cleaning the chemical structures (encoded as

SMILES) before training the machine learning models. Input queries such as drug name and PubChem CID are converted to SMILES before processing. Each SMILES string input is subject to four

different steps, namely, converting the SMILES into canonical SMILES21, removing salts (if present), neutralizing formal charges (except permanent ones) and standardizing tautomers.

REDIAL-2020 predicts input compound activity across all eleven assays: CPE, cytotox, AlphaLISA, TruHit, ACE2, 3CL, CoV-PPE, CoV-PPE_cs, MERS-PPE, MERS-PPE_cs and hCYTOX. The workflow of

operations performed on the submitted query SMILES through the redial web appplication are summarized in Supplementary Fig. 2. Figure 2 shows an example of the output panel, which is loaded

onto the same web page. REDIAL-2020 links directly to DrugCentral11,12,13 for approved drugs and to PubChem for chemicals (where available), enabling easy access to further information on

the query molecule. Using REDIAL-2020 estimates, promising anti-SARS-CoV-2 compounds would ideally be active in the CPE assay while inactive in cytotox and in hCYTOX; active in the AlphaLISA

assay and inactive in the TruHit assay while not blocking (inactive) ACE2; active in CoV-PPE while inactive in CoV-PPE_cs; active in MERS-PPE while inactive in MERS-PPE_cs; or active in the

3CL assay with any combination of the above. After running all of the prediction models, a schematic representation of the best profile that can be defined for a molecule is depicted in

Fig. 3. SIMILARITY SEARCH A similiarity tool is implemented in the web portal. The similarity is determined using Tanimoto coefficient calculations with ECFP4 bit vector fingerprint of

length 1,024. The Tanimoto coefficient represents the overlap of features between molecules as the ratio of the number of common features to the total number of features in each fingerprint.

Tanimoto coefficient values range from 0 to 1, with 1 corresponding to identical fingerprints. A fingerprint-based Tanimoto22 similarity search is thus conducted for each query molecule

against training set molecules, based on NCATS COVID-19 portal5 data. The top-ten similar molecules to that of the query molecule, based on Tanimoto coefficient23 scores, are displayed in

the results page. DISCUSSION Before developing machine learning models, unsupervised learning can detect patterns that might guide successive steps. Hence, after establishing the

experimental categories (see above), we inspected the data using principal component analysis (PCA)24 on VolSurf+25 descriptors. For both CPE and cytotox, clusters emerge along the first

principal component (PC1; Fig. 4). For CPE data, the majority of compounds showing high-to-moderate CPE activity are grouped in the right-hand of Fig. 4a. At the same time, compounds with

high-to-moderate cytotoxicity are grouped in the right-hand region of Fig. 4b. By inspecting the loading score plot for VolSurf+ descriptors that are likely to contribute to these patterns,

we identified membrane permeability (estimated using log[_P_]) and water solubility (estimated using log[_S_]) as major contributors to the first latent variable (see Supplementary Fig. 3).

Compounds with low log[_P_]/high log[_S_]—clustered in the left-hand region of the score plot—are less likely to be active in the CPE assay and more likely to be non-cytotoxic. The

distribution of actives was also visualized for AlphaLISA and TruHit compounds in Fig. 4c and Fig. 4d, respectively (see also Table 1). Although clustering is less pronounced for the

AlphaLISA assay with respect to CPE (Fig. 4a), the right-hand part of the plot does capture most of the high/moderate-activity compounds. Such distribution of actives in the right-hand

region was not observed for ACE2 actives (Fig. 4e); thus, permeability and solubility are not the major determinants of this ACE2 inhibition assay. This preliminary analysis can point to

filtering data before machine learning. For example, the majority of compounds placed on the left side of the Fig. 4 PCA plot are inactive (except for ACE2); therefore, before developing the

machine learning models, we used ALOGPS26 on every dataset except for ACE2 to apply cutoff filters on the basis of compounds’s calculated log[_P_] and log[_S_] values. These filters narrow

the focus of machine learning models on features derived only from compounds for which simple property criteria (for example, log[_P_] and log[_S_]) cannot be used to distinguish actives

from inactives, specifically, the right-hand regions in Fig. 4. As the fraction of active compounds excluded from the ACE2 dataset was quite high (34%), log[_P_] and log[_S_] filters were

not applied for ACE2 inhibition. For 3CL enzymatic activity, data from NCATS were retrieved separately. The initial set contained 12,263 data points; however, data wrangling identified 2,100

duplicates and 2,366 inconclusive entries, which were discarded. More entries were removed during the desalting and physicochemical feature generation as VolSurf+ descriptors could not be

computed for some of the compounds. The final 3CL dataset contains 7,716 entries, with 286 active and 7,430 inactive compounds. Given that the fraction of active 3CL compounds filtered would

have been 30%, the physicochemical property filters were not applied. There were no notable activity clusters detected in the 3CL dataset via PCA-VolSurf+ (see Supplementary Fig. 4).

Furthermore, NCATS released data for five completely new HTS assays—and updated assay data for the other six after additional testing—between June and October 2020. Hence, we reevaluated the

entire set of assays. The total number of compounds after data wrangling was 10,074. Our analysis showed that only the CPE and the cytotoxicity assays were enriched with more compounds.

There were 2,354 more compounds, with 158 new actives in the CPE dataset and 2,332 more compounds (295 new actives) in the cytotox dataset. As the fraction of active compounds filtered out

on applying physicochemical property filters was over 15%, these filters were not applied for the five new datasets (see also Table 1). With respect to actives versus inactives, all eleven

NCATS assays are highly unbalanced, with a disproportionate ratio of the active (few) compounds compared with inactive (many) compounds. For example, there were approximately nine times more

inactives than actives and approximately three times more non-cytotoxic compounds than cytotoxic compounds for the CPE and cytotoxicity assays, respectively. Thus, to avoid overtraining for

the dominant category, each model was derived using random selection wherein compounds from the majority class were selected in equal proportion to those of the minority class. Our balanced

dataset numbers were as follows: 996 for CPE, 2,252 for cytotox, 1,260 for AlphaLISA, 1,668 for TruHit, 206 for ACE2, 572 for 3CL, 1,782 for CoV-PPE, 320 for CoV-PPE_cs, 760 for hCYTOX, 970

for MERS-PPE and 368 for MERS-PPE_cs. We implemented eleven predictive models based on consensus methods to evaluate anti-SARS-CoV-2 activities of novel chemicals. Of the two consensus

methods evaluated (voting-based and probability score-based), the voting-based consensus model exhibited better performance (see Supplementary Figs. 5–10) and was thus implemented in the

REDIAL-2020 web application. Consensus models were generated on the basis of the top-three performing models trained on fingerprint, pharmacophore and physicochemical descriptors. First, we

selected a fingerprint model from an initial evaluation of 19 different fingerprint descriptor methods; this was combined with a TPATF model. Finally, RDKit or VolSurf+ provided a third

model, which was based on physicochemical properties. Supplementary Fig. 11a–d summarizes our initial evaluation and the comparison between various features and machine learning algorithms,

Supplementary Fig. 11a,b compares the performance of each feature across 22 machine learning algorithms (classifiers) and six assays, and Supplementary Fig. 11c,d compares the performance of

each classifier across 22 features and six assays (CPE, cytotoxicity, AlphaLISA, Truhit, ACE2, and 3CL). For example, the violin plot for the Avalon feature (see Supplementary Fig. 11a)

summarizes F1 scores from all six assays (and 22 classifiers). Among descriptors, VolSurf+ and LFCFP6 outperformed others, whereas the gradient boost and the multilayer perceptron

classifiers performed better among machine learning algorithms (see Supplementary Figs. 12 and 13 for comparisons of each feature across 15 machine learning algorithms and eleven assays;

Supplementary Figs. 14–47 depict more detailed comparisons across different features and machine learning algorithms with respect to individual models). Two options for the consensus model

were initially considered based on the potential overlap between VolSurf+ and RDKit descriptors: fingerprint+TPATF+RDKit and fingerprint+TPATF+VolSurf+, respectively. RDKit descriptors

outperformed VolSurf+ in cytotox, AlphaLISA, ACE2, 3CL, MERS-PPE_cs, CoV-PPE, CoV-PPE_cs and hCYTOX, whereas VolSurf+ descriptors outperformed RDKit in CPE and hCYTOX, along with similar

results in MERS-PPE and TruHit for the tested evaluation metrics such as accuracy, F1 score and AUC in validation sets (see Supplementary Figs. 48–58). However, the situation slightly

changed when considering consensus models. Inclusion of VolSurf+ yielded a better consensus model for the CPE, whereas including RDKit yielded better consensus models for the cytotox, 3CL,

TruHit, AlphaLISA, MERS-PPE_cs, CoV-PPE and CoV-PPE_cs assays (Supplementary Figs. 5–10 compare the best models from each feature category). As the NCATS team released data for more

compounds for the six initial assays plus five new assays in October 2020, we updated the initial six models and developed models for the five new assays (comparisons of models from each

category for the new and updated models are shown in Supplementary Figs. 53–57). Among the eleven assay models, the voting-based consensus model performed slightly better than individual

feature type models for validation F1 score results; in three assays (ACE2, MERS-PPE and hCYTOX), the voting-based consensus model was not the top performer, but its performance was close to

the top performing model. For the web platform, we implemented voting-based consensus models for all eleven assay models using RDKit descriptors as opposed to Volsurf+ descriptors, as RDKit

is open-source software that can be ported and dockerized without restrictions. Table 2 summarize the evaluation scores for all models implemented in REDIAL-2020. To confirm the utility of

our models, we collected three additional datasets from the literature and submitted these molecules (external to our training/validation/test sets) as input for prediction. First, we used a

database for COVID-19 experiments27 to explore and download recently published28,29,30,31,32,33,34,35,36,37 in vitro COVID-19 bioactivity data of the reported compounds. After removing

compounds already included in the NCATS experiments, we identified 27 external compounds active in anti-SARS-CoV-2 CPE assays (see Supplementary Table 4). Out of 27 compounds, three were

excluded on applying the log[_P_]/log[_S_] filters, and the remaining 24 were predicted by the CPE model. Sixteen compounds were correctly predicted as active by the consensus model (that

is, at least two models, see Supplementary Fig. 59), with eight compounds predicted as inactive. Among those predicted to be inactive, the majority stem from the work by Ellinger and

colleagues37, which were derived from Caco-2 cells for CPE experiments. There is a high degree of variability between these two CPE assays (Caco-2 versus Vero E6), which explains the lack of

predictivity using Vero E6-trained CPE models for Caco-2 data. The second dataset of 3CL (Mpro) inhibitors36 identified six inhibitors: ebselen (0.67 µM), disulfiram (9.35 µM), tideglusib

(1.55 µM), carmofur (1.82 µM), shikonin (15.75 µM) and PX-12 (21.39 µM) (see Supplementary Table 5). Among these six inhibitors, our consensus 3CL model correctly predicted four of them as

actives, and five of them as actives by at least one of the three 3CL machine learning models. The REDIAL-2020 suite of models therefore correctly predicted 67% of the external compounds for

CPE and 3CL inhibitors36. Although the external predictivity of CPE model seems to underestimate previous model performance in the validation and external sets (see Supplementary Table 6),

it has been noted that CPE experiments are affected by considerable intra- and interexperiment variability27. Hence, we cannot exclude the possibility that some of the experiments performed

by other laboratories are not directly comparable with NCATS COVID-19 portal5 results. CONCLUSION Here we described REDIAL-2020, an open-source, open-access machine learning suite for

estimating anti-SARS-CoV-2 activities from molecular structure. By leveraging data available from NCATS, we developed eleven categorical machine learning models: CPE, cytotox, AlphaLISA,

TruHit, ACE2, 3CL, CoV-PPE, CoV-PPE_cs, MERS-PPE, MERS-PPE_cs and hCYTOX. These models are exposed on the REDIAL-2020 portal, and the output of a similarity search using input data as a

query is provided for every submitted molecule. The top-ten most similar molecules to the query molecule from the existing COVID-19 databases, together with associated experimental data, are

displayed. This allows users to evaluate the confidence of the machine learning predictions. The REDIAL-2020 platform provides a fast and reliable way to screen novel compounds for

anti-SARS-CoV-2 activities. REDIAL-2020 is available on GitHub and DockerHub as well, and the command-line version supports large-scale virtual screening purposes. Future developments of

REDIAL-2020 could include additional machine learning models. For example by using the TMPRSS2 inhibition assay38 data from the NCATS COVID-19 portal or additional NCATS data as they become

available in the public domain. We will continue to update and enhance the machine learning models and make these models available in future releases of REDIAL. METHODS HTS ASSAYS The

SARS-CoV-2 CPE assay measures the ability of a compound to reverse the cytopathic effect induced by the virus in Vero E6 host cells. As cell viability is reduced by a viral infection, the

CPE assay measures the compound’s ability to restore cell function (cytoprotection). Although this assay does not provide any information concerning the mechanism of action, it can be used

to screen for antiviral activity in a high-throughput manner; however, there is the possibility that the compound itself may exhibit a certain degree of cytotoxicity, which could also reduce

cell viability. As this confounds the interpretation of CPE assay results, masking the cytoprotective activity, a counterscreen to measure host (Vero E6) cell cytotoxicity is used to detect

such compounds; thus, a net-positive result from the combined CPE assays consists of a compound showing a protective effect but no cytotoxicity. The AlphaLISA assay measures a compound’s

ability to disrupt the interaction between the viral Spike protein and its human receptor protein, ACE239. The surface of the ACE2 protein is the primary host factor recognized and targeted

by SARS-CoV-2 virions40. This binding event between the SARS-CoV-2 Spike protein and the host ACE2 protein initiates binding of the viral capsid and leads to viral entry into host cells.

Thus, disrupting the Spike–ACE2 interaction is likely to reduce the ability of SARS-CoV-2 virions to infect host cells. This assay has two counterscreens, as follows. The TruHit

counterscreen is used to determine false positives, that is, compounds that interfere with the AlphaLISA readout in a non-specific manner, or with assay signal generation and/or detection.

It uses the biotin–streptavidin interaction (one of the strongest known non-covalent drug–protein interactions) as other compounds are unlikely to disturb it. Consequently, any compound

showing interference with this interaction is most likely a false positive. Common interfering agents are oxygen scavengers or molecules with spectral properties sensitive to the 600–700 nm

wavelengths used in AlphaLISA. The second counterscreen is an enzymatic assay that measures human ACE2 inhibition to identify compounds that could potentially disrupt endogenous enzyme

function. ACE2 lowers blood pressure by catalysing the hydrolysis of angiotensin II (a vasoconstrictor octapeptide) into the vasodilator angiotensin (1–7)41. Although blocking the Spike–ACE2

interaction may stop viral entry, drugs effective in this manner could potentially cause unwanted side-effects by blocking the endogenous vasodilating function of ACE2. The ACE2 assay thus

serves to detect such eventualities and to de-risk such off-target events. Following entry into the host cell, the main SARS-CoV-2 replication enzyme is 3CL, also called main protease or

Mpro36, which cleaves the two SARS-CoV-2 polyproteins into various proteins (for example, RNA polymerases, helicases, methyltransferases and so on), which are essential to the viral life

cycle. As inhibiting the 3CL protein disrupts the viral replication process, this makes 3CL an attractive drug target42. The SARS-CoV-2 3CL biochemical assay measures the ability of

compounds to inhibit recombinant 3CL cleavage of a fluorescently labelled peptide substrate. In this category there are four assays: SARS-CoV pseudotyped particle entry and its

counterscreen, and MERS-CoV pseudotyped particle entry and its counterscreen. The pseudotyped particle assay measures the inhibition of viral entry in cells but it does not require a BSL-3

facility (BSL-2 is sufficient) to be performed, as it does not use a live virus to infect cells. It instead uses pseudotyped particles that are generated by the fusion of the coronavirus

Spike protein with a murine leukaemia virus core. As they have the coronavirus spike protein on their surface, the particles behave like their native coronavirus counterparts for entry

steps. This makes them excellent surrogates of native virions for studying viral entry into host cells. The experimental protocol of such an assay is described in detail elsewhere43. The

cell lines used are Vero E6 for SARS-CoV and Huh7 for MERS-CoV, respectively. At the time of data extraction, compound data were available for one assay human fibroblast toxicity. With the

human fibroblast toxicity assay, it is possible to assess the general human cell toxicity of compounds by measuring host cell ATP content as a readout for cytotoxicity (similarly to what is

done in the various counter screenings). This assay is therefore intended for discarding compounds that are likely to show high toxicity in human cells (that is, side effects in the

organism). Hh-WT fibroblast cells are used in this assay and the highly cytotoxic drug bortezomib is used as a reference compound. DATA MATCHING OPERATIONS The matching of NCATS compounds to

DrugCentral was conducted in three sequential steps: by InChI (international chemical identifier)44, by synonym (name), and by matching Chemical Abstracts Service registry numbers. First,

NCATS molecules were matched by InChI. Molecules that did not match were then queried by drug name and associated synonyms, as annotated in DrugCentral. Finally, if not matched by either

InChI or name, molecules were matched by Chemical Abstracts Service number. If none of the above steps resulted in a match, then the molecule in question was not classified as an approved

drug. At the end of this process, 4,954 unique molecules (2,273 approved drugs and 2,681 chemicals) were stored. SMILES were retrieved from DrugCentral whenever possible, otherwise the

original SMILES strings were retained. SMILES STANDARDIZATION Chemical structures were standardized to ensure rigorous deduplication, accurate counts and performance measures, and consistent

descriptor generation, preserving stereochemistry, which is required for conformer-dependent descriptors. This workflow uses the MolStandardize SMARTS-based functionality in RDKit45 to

transform input SMILES into standardized molecular representations. Four different filters were implemented via RDKit: (1) input SMILES were standardized into canonical (isomeric where

appropriate) SMILES strings. The input SMILES that failed to convert were discarded; (2) RDKit Salt Stripper was used to de-salt input compounds (that is, remove the salt structures). The

donotRemoveEverything feature leaves the last salt structure when the entire canonical SMILES string is comprised of salts only; (3) RDKit Uncharger neutralizes input molecules by

adding/removing hydrogen atoms and setting formal charges to zero (except for for example, quaternary ammonium cations); (4) canonical SMILES were then formalized into specific tautomers

using RDKit. MOLECULAR FEATURES/DESCRIPTORS A total of 22 features of three distinct types (19 fingerprints-based, 1 pharmacophore-based and 2 physicochemical descriptors-based) were

implemented. Fingerprints were converted into a bit vector of either 1,024 or 16,384 lengths. Pharmacophore type was also a bit vector of size 2,692, whereas RDKit and VolSurf+ descriptors

were of length 200 and 128, respectively. The fingerprints-based description includes the circular, path-based, and substructure keys46,47. Circular fingerprints include the

extended-connectivity fingerprints (ECFP_x_) and feature-connectivity fingerprints (FCFP_x_), where _x_ is 0, 2, 4, and 6 are the bond length or diameter for each circular atom environment.

The ECFP consists of the element, number of heavy atoms, isotope, number of hydrogen atoms and ring information, whereas the FCFP consists of pharmacophore features. Avalon and the molecular

access system (MACCS) are two distinct types of substructure keys (fingerprints). The Avalon fingerprint, used here, is a bit vector of size 1,024. It includes feature classes such as atom

count, atom symbol path, augmented atom, augmented symbol path and so on. MACCS structural keys are 166-bit structural key descriptors. Each bit here is associated with a SMARTS pattern and

belongs to the dictionary-based fingerprint class. Path-based fingerprints include RDK_x_ (where _x_ is 5, 6, 7), topological torsion (TT), HashTT, atom pair (AP) and HashAP. The size of

each fingerprint is 1,024. The longer, 16,384-bits, versions of the fingerprint, marked by the prefix L (LAvalon, LECFP6, LECFP4, LFCFP6 and LFCFP4) were used for comparison. Topological

pharmacophore atomic triplets fingerprints were obtained using Mayachemtools48; the TPATFs describe the ligand sites that are necessary for molecular recognition of a macromolecule or a

ligand, and passes that information to the machine learning model to be trained. Ligand SMILES strings were passed through a Perl script to generate TPATF. The basis sets of atomic triplets

were generated using two different constraints: (1) the triangle rule, that is, the length of each side of a triangle cannot exceed the sum of the lengths of the other two sides; and (2)

elimination of redundant pharmacophores related by symmetry. The default pharmacophore atomic types hydrogen-bond donor (HBD), hydrogen-bond acceptor (HBA), positively ionizable (PI),

negatively ionizable (NI), hydrophobic (H) and aromatic (Ar) were used during generation of TPATF49. The physicochemical description includes the RDKit molecular descriptors and VolSurf+

descriptors. For RDKit descriptors, a set of 200 descriptors were used, which were obtained from RDKit45. They are either experimental properties or theoretical descriptors, which are for

example molar refractivity, log[_P_], heavy atom counts, bond counts, molecular weight, topological polar surface area. A total of 128 descriptors were obtained using VolSurf+ software.

VolSurf+ is a computational approach aimed at describing the structural, physicochemical and pharmacokinetic features of a molecule starting from a three-dimensional map of the interaction

energies between the molecule and chemical probes (grid-based molecular interaction fields)50. VolSurf+ compresses the information present in molecular interaction fields into numerical

descriptors, which are simple to use and interpret25,51. MACHINE LEARNING CLASSIFIERS Using assay data as input (specifically, CPE, cytotox, AlphaLISA, TruHit, ACE2 and 3CL), we trained

machine learning models using the following 24 different classifiers: complement naive Bayes, extreme gradient boosting, KNeighbors, gradient boosting, perceptron, OneVsRest, extra-tree,

ridge, OneVsOne, bagging, random forest, output code, passive aggressive, linear SVC, stochastic gradient descent, logistic regression, extra trees, multinomial naive Bayes, AdaBoost,

decision tree, nearest centroid, multilayer perceptron, support vector machines and quadratic discriminant analysis. All of these algorithms are implemented in the scikit-learn package16.

The 22 types of features (ECFP0, ECFP2, ECFP4, LECFP4, ECFP6, LECFP6, FCFP2, FCFP4, LFCFP4, FCFP6, LFCFP6, RDK5, RDK6, RDK7, Avalon, LAvalon, MACCS, HashTT, HashAP, VolSurf+, TPATF and RDKit

descriptors) that served as input to the machine learning classifiers are described above. All classifiers were trained on their default configurations. For hyperparameter tuning we used

hypopt17 and the best-suited combination of classifiers and features (see Supplementary Table 7). All models were optimized and selected based on the validation F1 score. The best-performing

models were saved and used for the evaluation of external datasets. CONFIDENCE SCORES One way to calculate the certainty of prediction is provided by the classification algorithms framework

applied here, as implemented in the scikit-learn package. The confidence estimate associated with predictions for each object (small molecule) recalls a basic feature of scikit-learn,

predict_proba. For example, in the random forest classifier, votes are noted for each (sub)model; thus, for each class, predict_proba returns the number of votes divided by the number of

trees in that particular forest (model). This confidence score, which estimates the model prediction’s reliability, is used to gauge the applicability domain. DATA AVAILABILITY All data used

for the model described in this work are available at Zenodo (https://doi.org/10.5281/zenodo.4606720). These datasets were originally collected from the following links (please note that

these data are subject to change without notice): CPE: https://opendata.ncats.nih.gov/covid19/assay?aid=14, cytotox: https://opendata.ncats.nih.gov/covid19/assay?aid=15, AlphaLISA:

https://opendata.ncats.nih.gov/covid19/assay?aid=1, TruHit: https://opendata.ncats.nih.gov/covid19/assay?aid=2, ACE2: https://opendata.ncats.nih.gov/covid19/assay?aid=6, 3CL:

https://opendata.ncats.nih.gov/covid19/assay?aid=9, CoV-PPE: https://opendata.ncats.nih.gov/covid19/assay?aid=22, CoV-PPE_cs: https://opendata.ncats.nih.gov/covid19/assay?aid=23, MERS-PPE:

https://opendata.ncats.nih.gov/covid19/assay?aid=24, MERS-PPE_cs: https://opendata.ncats.nih.gov/covid19/assay?aid=25, hCYTOX: https://opendata.ncats.nih.gov/covid19/assay?aid=21. CODE

AVAILABILITY All of the codes and the trained models are available at Zenodo (https://doi.org/10.5281/zenodo.4606720). REFERENCES * Gorshkov, K., Chen, Z.C., Bostwick, R. et al. The

SARS-CoV-2 cytopathic effect is blocked by lysosome alkalizing small molecules. _ACS Infect. Dis._ https://doi.org/10.1021/acsinfecdis.0c00349 (2021). * Sun, H., Wang, Y., Cheff, D. M.,

Hall, M. D. & Shen, M. Predictive models for estimating cytotoxicity on the basis of chemical structures. _Bioorg. Med. Chem._ 28, 115422 (2020). Article Google Scholar * Hanson, Q. M.

et al. Targeting ACE2–RBD interaction as a platform for COVID-19 therapeutics: development and drug-repurposing screen of an AlphaLISA proximity assay. _ACS Pharmacol. Transl. Sci_. 6,

1352–1360 (2020). * Brimacombe, K. R. et al. An OpenData portal to share COVID-19 drug repurposing data in real time. Preprint at https://www.biorxiv.org/content/10.1101/2020.06.04.135046v1

(2020). * _SARS-CoV-2 Assays_ (NCATS, accessed 25 September 2020); https://opendata.ncats.nih.gov/covid19/assays * Huang, R., Xu, M., Zhu, H. et al. Biological activity-based modeling

identifies antiviral leads against SARS-CoV-2. _Nat. Biotechnol._ https://doi.org/10.1038/s41587-021-00839-1 (2021). * Kim, S. et al. PubChem substance and compound databases. _Nucl. Acids

Res._ 44, D1202–D1213 (2016). Article Google Scholar * Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. _J. Chem. Inf.

Comput. Sci._ 28, 31–36 (1988). Article Google Scholar * Oprea, T. I. et al. Associating drugs, targets and clinical outcomes into an integrated network affords a new platform for

computer-aided drug repurposing. _Mol. Inform._ 30, 100–111 (2011). Article Google Scholar * Berthold Michael, R. et al. in _Data Analysis, Machine Learning and Applications_ 319–326

(Springer, 2007). * Ursu, O. et al. DrugCentral: online drug compendium. _Nucl. Acids Res._ 45, D932–D939 (2017). Article Google Scholar * Ursu, O. et al. DrugCentral 2018: an update.

_Nucl. Acids Res._ 47, D963–D970 (2019). Article Google Scholar * Avram, S. et al. DrugCentral 2021 supports drug discovery and repositioning. _Nucl. Acids Res_. 49, D1160–D1169 (2020). *

Markossian, S. et al. (eds) _Assay Guidance Manual_ (Eli Lilly & Company and the National Center for Advancing Translational Sciences, 2004). * Pedregosa, F. et al. Scikit-learn: machine

learning in Python. _J. Mach. Learning Res._ 12, 2825–2830 (2011). MathSciNet MATH Google Scholar * Oprea, T. I. & Waller, C. L. in _Reviews in Computational Chemistry_ Vol. 11,

127–182 (John Wiley and Sons, 2007). * _hypopt_ (Github, accessed 24 July 2020); https://github.com/cgnorthcutt/hypopt * Eriksson, L. et al. Methods for reliability and uncertainty

assessment and for applicability evaluations of classification- and regression-based QSARs. _Environ. Health Perspect._ 111, 1361–1375 (2003). Article Google Scholar * Liu, R. &

Wallqvist, A. Molecular similarity-based domain applicability metric efficiently identifies out-of-domain compounds. _J. Chem. Inf. Model._ 59, 181–189 (2019). Article Google Scholar *

Mathea, M., Klingspohn, W. & Baumann, K. Chemoinformatic classification methods and their applicability domain. _Mol. Inform._ 35, 160–180 (2016). Article Google Scholar * Weininger,

D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. _J. Chem. Inf. Comput. Sci._ 29, 97–101 (1989). Article Google Scholar * Rogers, D.

J. & Tanimoto, T. T. A computer program for classifying plants. _Science_ 132, 1115–1118 (1960). Article Google Scholar * Whittle, M., Gillet, V. J., Willett, P., Alex, A. &

Loesel, J. Enhancing the effectiveness of virtual screening by fusing nearest neighbor lists: a comparison of similarity coefficients. _J. Chem. Inf. Comput. Sci._ 44, 1840–1848 (2004).

Article Google Scholar * Carey, R. N., Wold, S. & Westgard, J. O. Principal component analysis. Alternative to referee methods in method comparison studies. _Anal. Chem._ 47, 1824–1829

(1975). Article Google Scholar * Cruciani, G., Pastor, M. & Guba, W. VolSurf: a new tool for the pharmacokinetic optimization of lead compounds. _Eur. J. Pharm. Sci._ 11, S29–S39

(2000). Article Google Scholar * Tetko, I. V. et al. Virtual computational chemistry laboratory—design and description. _J. Comput. Aided Mol. Des._ 19, 453–463 (2005). Article Google

Scholar * Kuleshov, M. V. et al. The COVID-19 drug and gene set library. _Patterns_ 1, 100090 (2020). Article Google Scholar * Jeon, S. et al. Identification of antiviral drug candidates

against SARS-CoV-2 from FDA-approved drugs. _Antimicrob. Agents Chemother._ 64, e00819-20 (2020). Article Google Scholar * Weston, S. et al. Broad anti-coronavirus activity of Food and

Drug Administration-approved drugs against SARS-CoV-2 in vitro and SARS-CoV in vivo. _J. Virol._ 94, e01218-20 (2020). Article Google Scholar * Touret, F. et al. In vitro screening of a

FDA approved chemical library reveals potential inhibitors of SARS-CoV-2 replication. _Sci. Rep._ 10, 13093 (2020). Article Google Scholar * Xing, J. et al. Reversal of infected host gene

expression identifies repurposed drug candidates for COVID-19. Preprint at https://www.biorxiv.org/content/10.1101/2020.04.07.030734v1 (2020). * Riva, L. et al. Discovery of SARS-CoV-2

antiviral drugs through large-scale compound repurposing. _Nature_ 586, 113–119 (2020). Article Google Scholar * Choy, K.-T. et al. Remdesivir, lopinavir, emetine, and homoharringtonine

inhibit SARS-CoV-2 replication in vitro. _Antiviral Res._ 178, 104786 (2020). Article Google Scholar * Mirabelli, C. et al. Morphological cell profiling of SARS-CoV-2 infection identifies

drug repurposing candidates for COVID-19. Preprint at https://www.biorxiv.org/content/10.1101/2020.05.27.117184v4 (2020). * Riva, L. et al. Discovery of SARS-CoV-2 antiviral drugs through

large-scale compound repurposing. _Nature_ 586, 113–119 (2020). * Jin, Z. et al. Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors. _Nature_ 582, 289–293 (2020). Article

Google Scholar * Ellinger, B. et al. A SARS-CoV-2 cytopathicity dataset generated by high-content screening of a large drug repurposing collection. _Sci. Data_ 8, 70 (2021). Article Google

Scholar * Shrimp, J. H. et al. An enzymatic TMPRSS2 assay for assessment of clinical candidates and discovery of inhibitors as potential treatment of COVID-19. _ACS Pharmacol. Transl.

Sci_. 5, 997–1007 (2020). * Hoffmann, M. et al. SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. _Cell_ 181, 271–280.e8 (2020).

Article Google Scholar * Millet, J. K. & Whittaker, G. R. Physiological and molecular triggers for SARS-CoV membrane fusion and entry into host cells. _Virology_ 517, 3–8 (2018).

Article Google Scholar * Keidar, S., Kaplan, M. & Gamliel-Lazarovich, A. ACE2 of the heart: from angiotensin I to angiotensin (1–7). _Cardiovasc. Res._ 73, 463–469 (2007). Article

Google Scholar * Pillaiyar, T., Manickam, M., Namasivayam, V., Hayashi, Y. & Jung, S.-H. An overview of severe acute respiratory syndrome–coronavirus (SARS-CoV) 3CL protease inhibitors:

peptidomimetics and small molecule chemotherapy. _J. Med. Chem._ 59, 6595–6628 (2016). Article Google Scholar * Millet, J. K. et al. Production of pseudotyped particles to study highly

pathogenic coronaviruses in a biosafety level 2 setting. _J. Vis. Exp_. https://doi.org/10.3791/59010 (2019). * Heller, S. R., McNaught, A., Pletnev, I., Stein, S. & Tchekhovskoi, D.

InChI, the IUPAC international chemical identifier. _J. Cheminform._ 7, 23 (2015). Article Google Scholar * Landrum, G. et al. _RDKit: Open-source Cheminformatics Software_ (RDKit,

accessed 10 May 2020); https://www.rdkit.org/ * Riniker, S. & Landrum, G. A. Open-source platform to benchmark fingerprints for ligand-based virtual screening. _J. Cheminform._ 5, 26

(2013). Article Google Scholar * Rogers, D. & Hahn, M. Extended-connectivity fingerprints. _J. Chem. Inf. Model._ 50, 742–754 (2010). Article Google Scholar * Sud, M. MayaChemTools:

an open source package for computational drug discovery. _J. Chem. Inf. Model._ 56, 2292–2297 (2016). Article Google Scholar * Bonachéra, F., Parent, B., Barbosa, F., Froloff, N. &

Horvath, D. Fuzzy tricentric pharmacophore fingerprints. 1. Topological fuzzy pharmacophore triplets and adapted molecular similarity scoring schemes. _J. Chem. Inf. Model._ 46, 2457–2477

(2006). Article Google Scholar * Goodford, P. J. A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. _J. Med. Chem._

28, 849–857 (1985). Article Google Scholar * Zamora, I., Oprea, T., Cruciani, G., Pastor, M. & Ungell, A.-L. Surface descriptors for protein–ligand affinity prediction. _J. Med. Chem_.

46, 25–33 (2003). Download references ACKNOWLEDGEMENTS We thank the High-Performance Computing support staff (M. T. Hertlein and L. A. Hernandez) and J. D. Garcia at The University of Texas

at El Paso for assistance in using the Chanti cluster and web portal maintenance. We also acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for

providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu. Access to unpublished SARS-CoV-2 experimental data from C.

Jonsson (University of Tennessee Health Sciences Center) and S. Bradfute (University of New Mexico Health Sciences Center) is gratefully acknowledged. S. Sirimulla acknowledges support from

the National Science Foundation through NSF-PREM grant no. DMR-1827745. The DrugCentral component of this work is funded by NIH Common Fund U24 CA224370. AUTHOR INFORMATION Author notes *

These authors contributed equally: Govinda B. KC, Giovanni Bocci. * These authors jointly supervised this work: Suman Sirimulla, Tudor I. Oprea. AUTHORS AND AFFILIATIONS * Department of

Pharmaceutical Sciences, School of Pharmacy, The University of Texas at El Paso, El Paso, TX, USA Govinda B. KC, Srijan Verma, Md Mahmudulla Hassan & Suman Sirimulla * Computational

Science Program, The University of Texas at El Paso, El Paso, TX, USA Govinda B. KC & Suman Sirimulla * Translational Informatics Division, Department of Internal Medicine, University of

New Mexico School of Medicine, Albuquerque, NM, USA Giovanni Bocci, Jayme Holmes, Jeremy J. Yang & Tudor I. Oprea * Department of Pharmacy, Birla Institute of Technology and Science,

Pilani, Pilani Campus, Rajasthan, India Srijan Verma * Department of Computer Science, The University of Texas at El Paso, El Paso, TX, USA Md Mahmudulla Hassan & Suman Sirimulla *

Autophagy Inflammation and Metabolism Center of Biomedical Research Excellence, University of New Mexico Health Sciences Center, Albuquerque, NM, USA Tudor I. Oprea * Department of

Rheumatology and Inflammation Research, Institute of Medicine, Sahlgrenska Academy at University of Gothenburg, Gothenburg, Sweden Tudor I. Oprea * Novo Nordisk Foundation Center for Protein

Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark Tudor I. Oprea Authors * Govinda B. KC View author publications You can also search for this

author inPubMed Google Scholar * Giovanni Bocci View author publications You can also search for this author inPubMed Google Scholar * Srijan Verma View author publications You can also

search for this author inPubMed Google Scholar * Md Mahmudulla Hassan View author publications You can also search for this author inPubMed Google Scholar * Jayme Holmes View author

publications You can also search for this author inPubMed Google Scholar * Jeremy J. Yang View author publications You can also search for this author inPubMed Google Scholar * Suman

Sirimulla View author publications You can also search for this author inPubMed Google Scholar * Tudor I. Oprea View author publications You can also search for this author inPubMed Google

Scholar CONTRIBUTIONS S.S. and T.I.O. designed the research study. G.B.K. and S.V. developed the prediction models. G.B. curated the public data. G.B.K., S.V., M.M.H., J.J.Y., J.H. and S.S.

developed the web application. S.S., G.B.K., G.B. and T.I.O. wrote the paper. All authors read and approved the manuscript. CORRESPONDING AUTHORS Correspondence to Suman Sirimulla or Tudor

I. Oprea. ETHICS DECLARATIONS COMPETING INTERESTS T.I.O. has received honoraria from or consulted for Abbott, AstraZeneca, Chiron, Genentech, Infinity Pharmaceuticals, Merz Pharmaceuticals,

Merck Darmstadt, Mitsubishi Tanabe, Novartis, Ono Pharmaceuticals, Pfizer, Roche, Sanofi and Wyeth, and is on the Scientific Advisory Board of ChemDiv and InSilico Medicine. ADDITIONAL

INFORMATION PEER REVIEW INFORMATION _Nature Machine Intelligence_ thanks Feixiong Cheng, Junmei Wang and Kemal Yelekçi for their contribution to the peer review of this work. PUBLISHER’S

NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION Supplementary

Figs. 1–59. SUPPLEMENTARY TABLE Supplementary Tables 1–7. RIGHTS AND PERMISSIONS Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE KC, G.B., Bocci, G., Verma, S. _et al._ A

machine learning platform to estimate anti-SARS-CoV-2 activities. _Nat Mach Intell_ 3, 527–535 (2021). https://doi.org/10.1038/s42256-021-00335-w Download citation * Received: 12 September

2020 * Accepted: 17 March 2021 * Published: 03 May 2021 * Issue Date: June 2021 * DOI: https://doi.org/10.1038/s42256-021-00335-w SHARE THIS ARTICLE Anyone you share the following link with

will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt

content-sharing initiative