Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules

Play all audios:

ABSTRACT We here introduce the Aquamarine (AQM) dataset, an extensive quantum-mechanical (QM) dataset that contains the structural and electronic information of 59,783 low-and high-energy

conformers of 1,653 molecules with a total number of atoms ranging from 2 to 92 (mean: 50.9), and containing up to 54 (mean: 28.2) non-hydrogen atoms. To gain insights into the solvent

effects as well as collective dispersion interactions for drug-like molecules, we have performed QM calculations supplemented with a treatment of many-body dispersion (MBD) interactions of

structures and properties in the gas phase and implicit water. Thus, AQM contains over 40 global and local physicochemical properties (including ground-state and response properties) per

conformer computed at the tightly converged PBE0+MBD level of theory for gas-phase molecules, whereas PBE0+MBD with the modified Poisson-Boltzmann (MPB) model of water was used for solvated

molecules. By addressing both molecule-solvent and dispersion interactions, AQM dataset can serve as a challenging benchmark for state-of-the-art machine learning methods for property

modeling and _de novo_ generation of large (solvated) molecules with pharmaceutical and biological relevance. SIMILAR CONTENT BEING VIEWED BY OTHERS QMUGS, QUANTUM MECHANICAL PROPERTIES OF

DRUG-LIKE MOLECULES Article Open access 07 June 2022 QM7-X, A COMPREHENSIVE DATASET OF QUANTUM-MECHANICAL PROPERTIES SPANNING THE CHEMICAL SPACE OF SMALL ORGANIC MOLECULES Article Open

access 02 February 2021 THE QD_Π_ DATASET, TRAINING DATA FOR DRUG-LIKE MOLECULES AND BIOPOLYMER FRAGMENTS AND THEIR INTERACTIONS Article Open access 25 April 2025 BACKGROUND & SUMMARY

INTRODUCTION In pharmaceutical research and development, computational chemistry can play an integral role in expediting candidate drugs into the clinic. Particularly, quantum-mechanical

(QM) methods (_e.g_., density-functional theory (DFT), post-Hartree-Fock approaches, and quantum Monte Carlo) have been utilized to describe covalent and non-covalent interatomic

interactions and to estimate diverse physicochemical properties of molecular systems1,2. QM methods can for instance be used to understand the reactivity of covalent binders3,4, evaluate

conformational energy landscapes of ligands5, study the stability of potential active pharmaceutical ingredients, calculate theoretical acidity constants6, or calculating theoretical charges

to more accurately capture electrostatic properties and surfaces. However, the computational cost and the challenge of conducting QM calculations at a large scale present a limitation to

their widespread use in drug discovery pipelines. Scanning conformational landscapes through QM calculations performed at DFT levels of theory typically takes several hours for a single

ligand of typical pharmaceutical size (_e.g_., 30-40 heavy atoms) on a single computer. As these calculations are readily parallelizable, supercomputers can be employed to enhance throughput

enabling the screening of a few hundred compounds, but it remains challenging to perform these calculations at the scale of virtual libraries which can easily be composed of tens of

thousands of compounds. Accelerated QM methods have emerged as promising solutions in recent years, offering a balance between accuracy and computational efficiency. These can take the form

of quantum fragmentation methods7,8, semi-empirical methods (_e.g_., parametric method series9, density functional tight-binding (DFTB)10,11 or its extended version (GFNn-xTB)12,13) as well

as machine learning (ML) models14,15,16,17,18,19,20 capable of optimizing geometries or estimating physicochemical properties. The resulting acceleration enables researchers to include

QM-based knowledge as a part of their workflow. Accordingly, relevant QM datasets of small organic molecules have widely assisted the development of ML-based approaches for a fast and

accurate estimation of structural, vibrational, and electronic properties of complex organic molecules21,22,23,24. Among them, one can find QM725,26,27, QM928,29, QM7-X30, MD1715,31, MD2232,

ANI-114,33, ANI-1x/ANI-1ccx34 and AIMNet-NSE35. While these QM datasets have significantly advanced the field of computational chemistry, they do exhibit certain limitations that stem from

three important facts. First, they primarily consist of molecules that are considerably smaller than what is commonly encountered in modern medicinal chemistry. Second, their structures have

been optimized using theoretical models that do not account for molecule-solvent and collective dispersion interactions. Lastly, they have not fully explored the vast conformational

landscape inherent in these molecules. Especially, the interaction between the molecule and the chemical environment (_i.e_., solvent) is crucial when investigating molecules of

pharmaceutically relevant size, as it is well-known that drug binding does not occur _in vacuo_ but rather _in solutio_36. Indeed, there is extensive literature discussing the solvent

effects on the properties of specific molecular systems37,38,39,40,41. For instance, Gorges _et al_.38 found that solvation can have a substantial effect of several cal/mol ⋅ K on the

entropy of 25 commercially available drug molecules as a result of large conformational changes. The geometry, energetics, HOMO/LUMO energies, dipole moment, and polarizability of

formaldehyde and thioformaldehyde have also been reported to change upon solvation in solvents of low polarity39. Similarly, the molecule-solvent interaction can affect the dynamic stability

and antiviral inhibitory potential of Cissampeline40 as well as the chemical reaction type SN2 between Cl− and CH3Cl41. Omitting solvent effects can thus result in inappropriate treatment

of conformations, tautomers, physicochemical properties, or molecular reactivity. In the computational modeling of molecular systems, solvents can either be considered implicitly or

represented explicitly. However, owing to the computational cost and intricate nature of explicitly representing the solvent, most QM studies opt for the utilization of implicit solvent

models such as conductor-like screening model for real solvents (COSMO-RS)42, modified Poisson-Boltzmann (MPB)43, and Generalized Born (GB)44 model augmented with the hydrophobic solvent

accessible surface area term (GBSA)45. Lately, to overcome the molecular size limitation within benchmark QM datasets, several efforts have been made to generate datasets that

comprehensively explore the conformational space of large and flexible molecules along with their associated QM properties calculated in gas phase or solvent, see Table 1. For instance, the

QMugs46 collection comprises 19 QM properties of circa 2 M gas-phase conformers of 665, 911 molecules with up to 100 non-hydrogen atoms computed using _ω_B97X-D47 density functional and the

def2-SVP basis set. The OE6248 dataset covers 61, 489 molecules with up to 92 non-hydrogen atoms that were optimized in gas phase using PBE(tight) level of theory supplemented with

Tkatchenko-Scheffler van der Waals (TS) interaction49. This dataset also contains 3 QM properties for 30, 876 structures evaluated using PBE0(tight) level of theory together with implicit

water defined by the Multipole Expansion (MPE) model50. Regarding the vast GEOM collection (which stands for Geometric Ensemble Of Molecules)51, only 1.3 M conformers corresponding to 1, 511

BACE52 molecules were generated considering molecule-solvent interactions described by the analytical linearized Poisson-Boltzmann (ALPB)53 model of water. From here, 455, 000 conformers of

534 BACE molecules were selected and used for further geometry optimization calculations using r2scan-3c functional with C-PCM54 (which stands for conductor-like polarizable continuum

model) implicit model of water and, posteriorly, 6 QM properties were collected. Moreover, Eastman _et al_.55 have recently introduced the SPICE dataset (which is short for

Small-molecule/Protein Interaction Chemical Energies) that explicitly considers the interaction between the molecule and water molecules _via_ Amber14 classical force field to get 1, 300

(equilibrium and non-equilibrium) structures of 26 amino acids. Energies, atomic forces, and other 6 QM properties were computed using the _ω_B97M-D3(BJ) functional and def2-TZVPPD basis

set. Despite these efforts, challenges remain to enable a better understanding of solvent effects as well as collective dispersion interactions in the chemical space of large drug-like

molecules, including: (i) assessing the accuracy and reliability of QM structures and properties with respect to the employed density-functional approximation, especially for larger and more

flexible molecules in which van der Waals (vdW) and molecule-solvent interactions are stronger, (ii) offering a large set of molecular (global) and atom-in-a-molecule (local)

physicochemical properties that would enable a comprehensive exploration of these interactions in structure-property and property-property relationships throughout chemical space, and (iii)

providing accurate and reliable QM data that will enable the construction of models for describing covalent and non-covalent vdW interactions in large (solvated) molecules. In this work, we

introduce the Aquamarine (AQM) dataset with the aim of addressing these challenges. The current version of AQM contains an extensive conformational sampling of 1, 653 molecules with up to 54

(mean: 28.2) non-hydrogen atoms (including C, N, O, F, P, S, and Cl), producing a total of 59, 783 low-and high-energy conformers with a total number of atoms _N_ ranging from 2 until 92

(mean: 50.9), see Fig. 1. In doing so, QM conformers were generated using the conformational search workflow implemented in CREST code56 (which is short for Conformer-Rotamer Ensemble

Sampling Tool) that considers semi-empirical GFN2-xTB13 with GBSA implicit solvent model of water45. Since vdW interactions have a significant impact on the conformations of large molecules,

we have optimized a set of representative conformers using third-order DFTB method10,57,58 (or DFTB3) supplemented with a treatment of many-body dispersion (MBD) interactions59,60,61,62

(see “Methods”). Moreover, to have a better understanding of solvent effects, we have performed these calculations in gas phase and in implicit water described by the GBSA model. For each of

the (gas-phase and solvated) optimized conformers, AQM also includes an extensive number (over 40) of global (molecular) and local (atom-in-a-molecule) QM properties computed at a high

level of theory that depends on the chemical environment used during the geometry optimization. The majority of QM properties for gas-phase structures were evaluated using non-empirical

hybrid DFT with MBD interactions (_i.e_., PBE0+MBD) in conjunction with tightly-converged numeric atom-centered orbitals63. In addition, MPB implicit solvent model of water was considered to

obtain the properties for solvated structures. Hence, we have two different AQM subsets, namely AQM-gas and AQM-sol, which contain the QM structural and property data of molecules in gas

phase and implicit water, respectively. Based on its design, AQM holds the potential to enhance the comprehension of the influence of molecule-solvent and collective dispersion interactions

in structure-property and property-property relationships of molecules of pharmaceutically relevant size and composition. KEY ADVANCEMENTS Our main motivation for proposing AQM as a

benchmark dataset is to advance in the development of the next generation of ML models that enable a fast and accurate property estimate and ideation of drug-like molecules synthesized in a

chemical environment. In pursuit of this aim, we have ensured that the AQM dataset exhibits the following characteristics, * The idea of combining the CREST conformational search workflow

with the subsequent DFTB3+MBD geometry optimization, both in gas phase and implicit water, has provided us with access to a more extensive and reliable exploration of low- and high-energy

(compact/extended) conformers of large molecules. To the best of our knowledge, this procedure has not been considered in previous works. Also, notice that MBD interaction is a key factor in

accurately describing and identifying diverse conformations in large molecular complexes, due to their anisotropic shapes. * The AQM dataset provides a more accurate set of over 40 global

(molecular) and local (atom-in-a-molecule) QM properties of gas-phase and solvated structures when compared to already public datasets (see Table 1). These properties can assist in the

estimation and comprehension of the impact of molecule-solvent interactions in the structure-property and property-property relationships of large molecules, _e.g_., _via_ delta learning

approach. * The property data stored in AQM-gas and AQM-sol can potentially be used to develop more robust QM descriptors for large molecules, enabling fast and accurate calculations of

their physicochemical or biological properties. Moreover, such quantum property-based molecular descriptors are complementary to the widely used geometric descriptors and both can be used in

synergy for estimating molecular observables measured in experiment. * The gas-phase and solvated conformations of AQM molecules, along with their highly accurate QM properties make the AQM

dataset a valuable resource for in silico assisted methods. METHODS SELECTION OF REPRESENTATIVE CHEMISTRIES We sought to select a set of compounds from the public domain that approximate a

typical corporate library including H, C, N, O, F, P, Cl, and S atoms. To this end, we sampled 5000 compounds from ChEMBL64 and compared them to the Johnson & Johnson Innovative

Medicines corporate database. Compounds with molecular weights over 1200, more than 30 rotatable bonds, a quantitative estimate of drug-likeness (QED) score65 under 0.4, and heavy atom count

over 200 were removed. Then, we got a reduced ChEMBL set of compounds with similar molecular weights, numbers of rotatable bonds, and fraction of sp3 to the corporate database (see Fig. S1

of the Supplementary Information (SI)). This subset was subjected to diversity selection, followed by manual inspection to remove molecules containing undesirable or unusual chemical

substructures. As a result, we initially selected SMILES (which stands for Simplified Molecular Input Line Entry System) of molecular building blocks and typical lead-like compounds as well

as a few protein degraders and macrocycles to produce a total of 2, 635 unique molecules with up to 60 non-hydrogen atoms (_N _≤ 116). To have a more extensive sampling of the chemical space

described by the selected SMILES, we generated all possible stereoisomers for each structure using the RDKit code—an open source toolkit developed for cheminformatics66,67. In our script,

we generate unique stereoisomers with the option to perturb each stereocenter while keeping the same atomic connectivity (_i.e_., tautomers are not considered), yielding a larger number of

isomers whose stability is later checked _via_ quantum mechanical (QM) calculations. Accordingly, the new number of molecular structures considering the stereoisomers is circa 10 k. Initial

3D structures were subsequently generated with RDKit and optimized using the MMFF94 force field68,69,70,71,72. GENERATION OF MOLECULAR CONFORMERS Conformational sampling plays a crucial role

in the generation of AQM dataset. We have meticulously explored different conformational search workflows to identify the most effective approach for comprehensively sampling the potential

energy surface (PES) and the molecular property space of large drug-like molecules (see “Technical Validation” and Sec. 2 of the SI). In doing so, we opted to use the approach implemented in

CREST56 code which uses extensive sampling based on the much faster and yet reliable semi-empirical extended tight-binding method (GFN2-xTB)12,13 to generate 3D conformations. The

semi-empirical energies and structures are thought to be more accurate than classical force fields, accounting for electronic effects, rare functional groups, and bond-breaking/formation of

labile bonds13,46,51,73. Moreover, the CREST search algorithm is based on metadynamics (MTD), a well-established thermodynamic sampling approach that can efficiently explore the low-energy

search space. The collective variables used for the MTD sampling are the atomic root-mean-squared deviation (RMSD) values between the previous structures on the PES of a given molecule56.

The atomic RMSD values are introduced into the expression of the bias potential, which is used to compute the guiding forces. These forces are responsible for driving the structure further

away from previous geometries, providing an extensive exploration of PES. Conformers are thus generated in an iterative manner of MTD and GFN2-xTB optimization, where those geometries are

added to the conformer rotamer ensemble (CRE) that overcome certain energy ( > 12.0 kcal/mol) and root-mean-square deviation (Δ_R_ > 0.1 Å) thresholds concerning the input structure.

The procedure is restarted using the conformer as input if a new conformer has a lower energy than the input structure. The three conformers of lowest energy undergo two normal molecular

dynamics (MD) simulations at 400K and 500K, which are used to sample low-energy barrier crossings, such as simple torsional motions. Finally, a genetic Z-matrix crossing algorithm is used

and the results are added to the CRE. Then, a normal-type convergence optimization separates the geometries into conformers, rotamers, and duplicates, where duplicates are deleted and

conformers and rotamers added to the CRE. Both geometry optimization and conformational search calculations were carried out considering implicit water described by the Generalized Born (GB)

model augmented with the hydrophobic solvent accessible surface area term (GBSA)45. Finally, we obtained 2, 242, 490 conformers for the initial set of 2, 635 molecules. Unlike other already

public datasets of conformers of large molecules (see Table 1), we have here defined a method to select a set of representative conformers per molecule (_i.e_., per SMILE) instead of

considering all conformers generated by CREST. The purpose of using this method is to filter out conformers that are similar in regions of the chemical space defined by the atomic structure,

total energy _E_tot and many-body dispersion (MBD) energy _E_MBD. We have here considered _E_MBD due to its relevance in the definition of stability rankings in large molecules and

molecular crystals59,60,61,62. Accordingly, our initial step involves determining clusters that consist of conformers exhibiting a root-mean-square deviation (Δ_R_) between their structures

of less than 1.5 Å. Δ_R_ is computed with the help of DockRMSD tool74. After clustering the conformers, we obtained _E_tot and _E_MBD of all conformers per cluster _via_ single-point

calculations using third-order self-consistent charge density functional tight binding (DFTB3)10,57,58 supplemented with a treatment of MBD interactions59,60,61,62, making use of 3ob

parameters75,76. Then, we select the conformers with the most distinct values of both energies (_i.e_., _E_tot > 0.24 eV and _E_MBD > 0.048 eV) per cluster, see Fig. 1. This

exemplifies a new approach for selecting conformers of large molecules within chemical space, utilizing an in-depth analysis of their electronic properties. To showcase its efficacy, we have

considered only 1, 653 molecules ( ≈ 60% of total initial unique molecules) with up to 54 non-hydrogen atoms (_N _≤ 92), reducing the number of conformers for these molecules from 280, 182

to 59, 783. While this method does indeed yield a more diverse set of conformers, it remains essential to confirm the energetic and mechanical stability of these molecular structures,

especially, taking into account the treatment of MBD interactions. To maintain consistency with our earlier publication of the QM7-X dataset for small organic molecules, we have conducted

the geometry optimization calculations utilizing the DFTB3+MBD level of theory. Moreover, to construct a dataset that can be used to understand the influence of molecule-solvent interaction

on the physicochemical properties of large drug-like molecules, our generation procedure considers the optimization of structures in gas phase and in implicit water described by the GBSA

model, as implemented in the DFTB+ code77. We have stored the gas-phase and solvated optimized structures into two subsets named AQM-gas and AQM-sol, respectively. These DFTB calculations

were performed by interfacing DFTB+ code with the Atomic Simulation Environment (ASE)78. Despite the majority of these molecular structures being identified as local minima at the DFTB3+MBD

level, both in gas phase and in implicit water, it is worth mentioning that some of them are situated at saddle points on the respective PES. CALCULATION OF PHYSICOCHEMICAL PROPERTIES These

≈ 60 k DFTB optimized structures were now utilized for more accurate QM single-point calculations using dispersion-inclusive hybrid DFT. Energies, forces, and several other physicochemical

properties (as detailed in Table 2) were calculated at a higher level of theory that varied depending on the chemical environment used in the structure optimization process. Property

calculations for AQM-gas molecules were computed using PBE0+MBD59,79,80 level, while, for AQM-sol molecules, the modified Poisson-Boltzmann (MPB)43,81 model of water was also considered. The

MPB model solves the size-modified Poisson-Boltzmann equation for the implicit inclusion of electrolytic solvation effects into DFT calculations. It also includes a model for the well-known

Stern layer that separates the diffusing ions from the solvation cavity by introducing non-mean-field ion-solute interactions. For these calculations, we have used the FHI-aims code82,83

(version 221103) together with “tight” settings for basis functions and integration grids. Energies were converged to 10−6 eV and the accuracy of the forces was set to 10−4 eV/Å. The

convergence criteria used during self-consistent field (SCF) optimizations were 10−3 eV for the sum of eigenvalues and 10−6 electrons/Å3 for the charge density. The MBD energies and MBD

atomic forces were here computed using the range-separated self-consistent screening (rsSCS) approach60, while the atomic _C_6 coefficients, isotropic atomic polarizabilities, molecular _C_6

coefficients and molecular polarizabilities (both isotropic and tensor) were obtained _via_ the SCS approach59. Hirshfeld ratios correspond to the Hirshfeld volumes divided by the free atom

volumes. The TS dispersion energy refers to the pairwise Tkatchenko-Scheffler (TS) dispersion energy in conjunction with the PBE0 functional49. The vdW radii were also obtained using the

SCS approach _via_\({R}_{{\rm{vdW}}}={\left({\alpha }^{{\rm{SCS}}}/{\alpha }^{{\rm{TS}}}\right)}^{1/3}{R}_{{\rm{vdW}}}^{{\rm{TS}}}\), where _α_TS and \({R}_{{\rm{vdW}}}^{{\rm{TS}}}\) are the

atomic polarizability and vdW radius computed according to the TS scheme, respectively. Atomization energies were obtained by subtracting the atomic PBE0 energies from the PBE0 total energy

of each gas-phase and solvated molecular conformation (see Table S1 of the SI). The exact exchange energy is the amount of exact (or Hartree-Fock) exchange that has been admixed into the

exchange-correlation energy. DATA RECORDS The AQM dataset is provided in two HDF5 files in a ZENODO.ORG data repository84. The QM structural and property data of the 59, 783 conformations

corresponding to 1, 653 molecules in both gas phase and implicit water were stored in the AQM-gas.hdf5 and AQM-sol.hdf5 files, respectively. Additionally, we have uploaded the

AQM-initial.hdf5 file which only contains the structural data of the 2, 242, 490 conformations corresponding to the initial set of 2, 635 molecules (obtained by using CREST code). One can

also find there a README file with technical usage details and an example of how to access the information stored in AQM (see readAQM.py file). HDF5 FILE FORMAT Independent of the AQM

subset, the information for each molecular structure is stored in a Python dictionary (dict) type containing all relevant properties and recorded in _groups_ in HDF5 file format30. HDF5 keys

to access the atomic numbers, atomic positions (coordinates), and physicochemical properties in each dictionary are provided in Table 2. The dimension of each array depends on the number of

atoms _N_ and the required property, _e.g_., for a methane (CH4) molecule, ’atNUM’ is a 1D array of _N_ = 5 elements ([6, 1, 1, 1, 1]) while ’atXYZ’ is a 2D array comprised of _N_ = 5 rows

and three columns (_x_, _y_, _z_ coordinates). All structures are labeled as _Geom-mr-ct_, where _r_ enumerates the SMILE strings and _t_ the considered conformer. Note that the indices _t_

used in the AQM dataset reflect the order in which a given structure was generated and do not correspond to sorted xTB/DFTB (or DFT) total energies. TECHNICAL VALIDATION A significant

challenge in simulating the physicochemical properties of large drug-like molecules lies in the fact that, in experiments, their conformations and electronic structures are influenced by

interactions with the surrounding solvent. However, the standard approach in contemporary QM simulations involves running them in gas phase, without accounting for molecule-solvent

interactions. Unlike another recently published dataset of large molecules (see Table 1), AQM dataset considers the molecule-solvent interactions as well as a treatment of van der Waals

(vdW) interactions in its generation procedure—two important physical and chemical effects in determining structural conformations and stability rankings of molecules of pharmaceutically

relevant size. As mentioned above, the AQM comprises the structural and electronic data of 59, 783 gas-phase and solvated (low-and high-energy) conformers of 1, 653 molecules with up to 54

non-hydrogen atoms (_N _≤ 92), including C, N, O, F, P, S and Cl. The structures of solvated conformers were obtained using DFTB3+MBD method supplemented with the GBSA implicit model of

water. This model has been successfully used in the study of free solvation energies of neutral/ionic molecules45 and the folding of short peptides85. Whereas, the level of theory selected

to compute the QM properties per conformer was PBE0+MBD supplemented with the MPB implicit model of water. The MPB model has been shown to provide a more accurate description in the study of

diverse electrochemical reactions86,87,88,89. In all calculations, many-body dispersion (MBD) interactions have been included to deal with long-range interactions that are not adequately

represented by the baseline level of theory. These advanced theoretical models have thus generated a more accurate collection of molecular and atom-in-a-molecule (as well as ground state and

response) QM properties of conformers in implicit water, which are stored in AQM-sol. Moreover, when integrated with the property information in AQM-gas, these data can assist in

fine-tuning ML models for the precise estimation of electronic properties of solvated molecules, _e.g_., _via_ a delta learning approach. An essential step in the generation of AQM dataset

involved the thoughtful selection of the conformational search workflow. Here, we exhaustively analyzed the sampling method implemented in four different codes: CREST56, Maestro, Omega90 and

RDKit66,91. The last three codes are of standard use in cheminformatics, primarily relying on stochastic algorithms for their application. They explore the conformational space of molecules

very sparsely through a combination of pre-defined distances and stochastic samples92 and can miss many low-energy conformations. Moreover, in most standalone applications, conformer

energies are typically computed using classical force fields without the incorporation of solvent models, which made these values rather inaccurate93 (for more details of these methods see

Sec. 2 of the SI). On the contrary, CREST code utilizes a robust sampling strategy, leveraging the semi-empirical extended tight-binding method (GFN2-xTB)13, supplemented by the GBSA

implicit solvent model of water, in order to generate more reliable 3D conformations compared to those obtained using classical force fields (see “Methods”). The conformational search

workflow implemented in CREST provides access to a more extensive exploration of low- and high-energy conformers, generating molecular structures inaccessible by distance geometry methods

(see Fig. 2)46,51,73. To gain a better understanding of the influence of these sampling methods on the conformational search of large drug-like molecules, we analyzed the structural and

energetic data of conformers corresponding to 18 randomly selected compositions, each containing approximately _N_ = 50 atoms. Fig. 2(a) displays the output for the variation of the averaged

number of clusters, denoted as ⟨_M_⟩, as a function of the root-mean-square deviation, Δ_R_, among conformers that constitute a cluster. This calculation was first done per molecule and

then averaged over the 18 cases. This showed that more diverse conformers become part of the same cluster when Δ_R_ increases, resulting in a reduction of the number of distinct clusters.

The decrease in ⟨_M_⟩ for Maestro and Omega is faster compared to CREST, which may indicate that conformational search workflows based on distance geometry methods are insufficient for

probing the PES of more flexible molecules. However, thanks to the random ensemble option for conformer generation, RDKit produces a larger ⟨_M_⟩ than CREST for Δ_R_ > 1.0 Å. To examine

this result further, we compute _E_AT and _E_MBD for a set of representative conformers per molecule (see selection criterion in “Methods”) using DFTB3+MBD level of theory. Notice that the

resulting total number of conformers depends on the code employed for their generation, _i.e_., CREST → 3747 conformers, Maestro → 100 conformers, Omega → 204 conformers and RDKit → 1872

conformers. Thus, we have found that, compared to other methods, the conformers generated by CREST code show a more organized coverage of the energetic space defined by DFTB-_E_AT and

DFTB-_E_MBD (see the well-defined cluster in Fig. 2(b)). Moreover, the larger coverage of DFTB-_E_MBD values is a clear indicator that CREST sampling method can generate conformations of

increased complexity, characterized by a more folded structure and a stronger dispersion interaction, as illustrated with the inserted structures in Fig. 2(b). This underscores the relevance

of taking vdW interactions into account when generating and identifying conformers of large drug-like molecules. Similarly, this provides compelling evidence of the efficient conformational

search workflow implemented in CREST code, which improves coverage of both conformational and property molecular space. It is noteworthy that the calculations executed by the CREST code

incurred higher computational expenses compared to those carried out by chemoinformatics codes. After generating all conformers and selecting the representative conformers for each molecule,

we proceed to optimize the molecular structures. This optimization is carried out using the DFTB3+MBD level of theory in the gas phase as well as in implicit water described by the GBSA

model. Once this optimization step is complete, we obtain the final set of molecular structures for AQM-gas (gas phase) and AQM-sol (implicit water) subsets. To quantitatively analyze the

impact of molecule-solvent interaction on the structure of AQM molecules, we have computed Δ_R_ between the molecular structures stored in AQM-gas and AQM-sol. Indeed, Fig. 3(a) displays the

size dependence of the averaged Δ_R_ (blue dots) together with the total range of Δ_R_ values spanned at different _N_ (blue shadow). The results show that the structures of small molecules

(_N_ ≤ 20 atoms) present ⟨Δ_R_⟩ < 0.1 Å, _i.e_., they are minimally affected by the interaction with the solvent. In contrast, when dealing with molecules with _N_ > 40 atoms, solvent

effects become significantly more pronounced, and, as a result, we observe greater deviations in the Δ_R_ values (> 2.0 Å). A similar effect can be observed when comparing the gyration

radius _R__g_ for gas-phase and solvated molecular structures, see Fig. 3(b). Notice that there also are large compounds (_N_ ≈ 90) characterized by extensively constrained structures, which

remain unaffected by the interaction with implicit water. These findings hold significant relevance in the context of advancing QM-based pipelines for the creation of datasets of molecules

of pharmaceutical relevant size. Particularly in cases where the research objectives encompass the generation of non-equilibrium structures for training ML force fields since the PES of

these molecules will be largely modified by solvent effects—a phenomenon that can be inferred from the outcomes of the geometry optimization process. The AQM dataset considers an extensive

array of more than 40 distinct molecular (global) and atom-in-a-molecule (local) QM properties, which were computed to gain insights into the effect of molecule-solvent interactions on

structure-property and property-property relationships of large molecules. In Table 2, we list the properties of gas-phase and solvated molecules, derived from QM calculations performed at

the PBE0+MBD level (AQM-gas) and further enhanced with the MPB implicit solvent model of water (AQM-sol), respectively. PBE0+MBD has been chosen as our baseline level of theory for property

calculations due to its well-established accuracy and reliability, demonstrated in the description of intramolecular degrees of freedom as well as intermolecular interactions in organic

molecular dimers, supramolecular complexes, and molecular crystals59,79,80,94,95,96,97. The use of these DFT methods also provides interesting insights into the effect of molecule-solvent

interactions on the potential energy surface of molecules, highlighting the importance of the dataset generation procedure. In Fig. S2 of the SI, one can see that the energy range and the

energetic ranking of molecules in AQM-gas and AQM-sol are different, but further analysis is required. We therefore consider that our QM calculations are suitable to validate the quality of

future research utilizing the AQM dataset. To understand the relevance of accessing QM data for solvated molecules, we first discuss the influence of implicit water on extensive and

intensive molecular QM properties. In doing so, as an illustrative example, we have analyzed the 2D property space defined by two contrasting properties98,99 such as the isotropic molecular

polarizability _α_ and the HOMO-LUMO gap _E_gap (_i.e_., \(\left(\alpha ,{E}_{{\rm{gap}}}\right)\)-space) for the 59, 783 conformations in AQM-gas and AQM-sol as well as the set of most

stable conformer per molecule in AQM-sol (only 1, 653 conformations), see Fig. 4(a). For comparison, the values corresponding to QM7-X equilibrium molecules are also plotted (green circles).

Our findings reveal that AQM molecules exhibit a significantly broader coverage of the _α_ range, surpassing QM7-X molecules by a factor of 6. This expanded coverage is attributed to the

inherently extensive character of _α_. Whereas, _E_gap range now covers molecules with circa 2.5 eV of energy gap, and the mean value for the entire dataset reduced from 7.0 eV to 4.5 eV,

see distribution plots on the top panel of Fig. 4(a). The slight differences between the \(\left(\alpha ,{E}_{{\rm{gap}}}\right)\)-space covered by AQM-gas and AQM-sol may be attributed to

compensation between the pronounced fluctuations in _α_, which are predominantly observed in molecules with \(\alpha > 300\,{a}_{0}^{3}\), and the more sensitive behavior of _E_gap to the

presence of implicit water, as displayed in the correlation plots in Fig. 4(b). Accordingly, these findings could carry crucial implications in the “freedom of design” when searching for

large drug-like molecules with targeted \(\left(\alpha ,{E}_{{\rm{gap}}}\right)\) values98,99,100. Notice that the conformational sampling per molecule largely improved the coverage of both

properties, connecting isolated regions associated with a single molecular structure with specific size and chemical composition. Fig. 4(b) also shows the correlation plots between HOMO

energy _E_HOMO and total dipole moment _D__s_ of the 59, 783 conformers contained in AQM-gas and AQM-sol. Thus, it becomes evident that intensive properties are more sensitive to the

incorporation of molecule-solvent interactions in the QM calculations when contrasted with extensive properties. On the other hand, atom-in-a-molecule QM properties can provide important

insight into the distinct chemical environments within large drug-like molecules. To illustrate this, Fig. 5(a) shows the 2D property space defined by Hirshfeld charges _q_H and atomic

polarizabilities \({\widetilde{\alpha }}_{{\rm{s}}}\) (_i.e_., \(\left({q}_{{\rm{H}}},{\widetilde{\alpha }}_{{\rm{s}}}\right)\)-space) for the 59, 783 conformations in AQM-gas and AQM-sol.

For comparison, the values corresponding to QM7-X equilibrium molecules are also plotted (green circles). Fig. 5(a) shows the existence of well-defined clusters that are mostly related to a

specific atom type. The slight overlap between these clusters is a clear example of the need to develop more robust geometric and electronic descriptors capable of effectively representing

intricate chemical environments in large drug-like molecules for ML applications. Furthermore, our calculations have revealed that implicit solvation has a pronounced influence on _q_H

values, particularly for heavier atoms such as P, S, and Cl—relevant atoms in the design of pharmaceutical compounds as well as in the determination of their physicochemical and biological

properties. Certainly, the molecule-solvent interaction has a stronger effect on the local properties compared to global ones, as illustrated by the correlation plots in Fig. 5(b). This

becomes more evident by observing the atomic forces _F_tot, where the significant variations in values can strongly affect the accuracy of ML force fields when applied to run the dynamics of

large molecules. Up to now, we have been focused on the impact of considering molecule-solvent interaction when computing molecular/atom-in-a-molecule QM properties of AQM molecules.

However, the data of the non-electrostatic part of solvation energy due to molecule-solvent interaction _E_nelec in conformers contained in AQM-sol can also be crucial for having a better

understanding of the solvation effect on structure-property and property-property relationships of large molecules. As an example, Fig. 6(a) shows the correlation plot between _E_nelec and

dispersion interaction energy _E_disp calculated using two well-established methods: many-body dispersion (MBD) and Tkatchenko-Scheffler (TS). The datapoints are colored according to the

gyration radius _R_g of each solvated structure. The high degree of correlation between these properties underscores the importance of considering both molecule-solvent and dispersion

interactions when investigating large molecules, as we did to generate AQM dataset. Moreover, the growing difference in energies obtained by MBD and TS methods, particularly as the system

size increases, highlights the significant influence of many-body interactions in the energetic description of these compounds. To further elucidate the role of both interactions in the

generation of AQM, we have selected the molecule C29H39N5O3S2 (_N_ = 78 atoms, ID in dataset: 2070) with 720 conformers and then examined their respective energy values. These conformers

show a difference in dispersion energies Δ_E_disp of up to ≈1.0 eV, where the smallest Δ_E_disp values correspond to more compact molecular structures while the largest Δ_E_disp values are

observed for more extended ones (see Fig. 6(b)). Besides presenting a size dependence, the data plotted in Fig. 6(c) demonstrate that, similar to _E_MBD, _E_nelec also depends on the

structural conformation of molecules. These findings show the significance of both interactions in the generation of a robust QM dataset comprising large and more flexible molecules that

bear pharmaceutical relevance. In summary, we demonstrated that the extensive structural and property data contained in AQM dataset hold the potential to enhance the understanding of how

molecular-solvent interactions influence both structure-property and property-property relationships of large drug-like molecules. As such, AQM may in some cases be employed as a benchmark

dataset for direct/delta learning and generative methods, estimating the properties of pharmaceutical compounds in solution from their gas-phase counterparts. CODE AVAILABILITY The initial

structure generation was carried out using RDKit 2020.09.566,67. Further structure optimization and the creation of conformers were performed by utilizing CREST13,56 and DFTB+10,77 codes

together with ASE78. Note that all necessary features regarding the utilized DFTB3+MBD (with and without GBSA implicit solvent) approach are available in the current DFTB+ version11. All DFT

calculations were carried out using FHI-aims82 (version 221103). Additional conformer generation experiments were performed with RDKit 2020.09.5, OpenEye’s Omega 4.0.0.490 and Schrodinger’s

Maestro suite v2020-4 (see SI for detailed procedures). REFERENCES * Friesner, R. A. ab initio quantum chemistry: Methodology and applications. _Proc. Natl. Acad. Sci._ 102, 6648–6653

(2005). Article ADS CAS PubMed PubMed Central Google Scholar * Marzari, N., Ferretti, A. & Wolverton, C. Electronic-structure methods for materials design. _Nat. Mater._ 20,

736–749 (2021). Article ADS CAS PubMed Google Scholar * Palazzesi, F., Grundl, M. A., Pautsch, A., Weber, A. & Tautermann, C. S. A fast ab initio predictor tool for covalent

reactivity estimation of acrylamides. _J. Chem. Inf. Model_ 59, 3565–3571 (2019). Article CAS PubMed Google Scholar * Mihalovits, L. M., Ferenczy, G. G. & Keserũ, G. M. Affinity and

selectivity assessment of covalent inhibitors by free energy calculations. _J. Chem. Inf. Model_ 60, 6579–6594 (2020). Article CAS PubMed Google Scholar * Hofmans, S. _et al_. Tozasertib

analogues as inhibitors of necroptotic cell death. _J. Medicinal Chem_ 61, 1895–1920 (2018). Article CAS Google Scholar * Prasad, S., Huang, J., Zeng, Q. & Brooks, B. R. An

explicit-solvent hybrid QM and MM approach for predicting pKa of small molecules in SAMPL6 challenge. _J. Comput. Mol. Des._ 32, 1191–1201 (2018). Article CAS Google Scholar *

Raghavachari, K. & Saha, A. Accurate composite and fragment-based quantum chemical models for large molecules. _Chem. Rev._ 115, 5643–5677 (2015). Article CAS PubMed Google Scholar *

Pruitt, S. R., Bertoni, C., Brorsen, K. R. & Gordon, M. S. Efficient and accurate fragmentation methods. _Acc. Chem. Res._ 47, 2786–2794 (2014). Article CAS PubMed Google Scholar *

Stewart, J. J. P. Optimization of parameters for semiempirical methods II. applications. _J. Comput. Chem._ 10, 221–264 (1989). Article CAS Google Scholar * Seifert, G., Porezag, D. &

Frauenheim, T. Calculations of molecules, clusters, and solids with a simplified LCAO-DFT-LDA scheme. _Int. J. Quantum Chem._ 58, 185–192 (1996). Article CAS Google Scholar * Hourahine,

B. _et al_. DFTB+, a software package for efficient approximate density functional theory based atomistic simulations. _J. Chem. Phys_ 152, 124101 (2020). Article ADS CAS PubMed Google

Scholar * Bannwarth, C. _et al_. Extended tight-binding quantum chemistry methods. _WIREs Comput. Mol. Sci._ 11, e1493 (2021). Article CAS Google Scholar * Bannwarth, C., Ehlert, S.

& Grimme, S. GFN2-xTB—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion

contributions. _J. Chem. Theory Comput._ 15, 1652–1671 (2019). Article CAS PubMed Google Scholar * Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1: An extensible neural network

potential with DFT accuracy at force field computational cost. _Chem. Sci._ 8, 3192–3203 (2017). Article CAS PubMed PubMed Central Google Scholar * Chmiela, S., Sauceda, H. E., Müller,

K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. _Nat. Commun._ 9, 3887 (2018). Article ADS PubMed PubMed Central Google Scholar

* Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. _Nat. Commun._ 8, 13890 (2017). Article ADS

PubMed PubMed Central Google Scholar * Unke, O. T. _et al_. Spookynet: Learning force fields with electronic degrees of freedom and nonlocal effects. _Nat. Commun._ 12, 7273 (2021).

Article ADS CAS PubMed PubMed Central Google Scholar * Batzner, S. _et al_. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. _Nat.

Commun._ 13, 2453 (2022). Article ADS CAS PubMed PubMed Central Google Scholar * Musaelian, A. _et al_. Learning local equivariant representations for large-scale atomistic dynamics.

_Nat. Commun._ 14, 579 (2023). Article ADS CAS PubMed PubMed Central Google Scholar * Batatia, I. _et al_. (eds.) Advances in Neural Information Processing Systems, vol. 35,

11423–11436 (Curran Associates, Inc., 2022). * Huang, B., von Rudorff, G. F. & von Lilienfeld, O. A. The central role of density functional theory in the AI age. _Science_ 381, 170–175

(2023). Article ADS CAS PubMed Google Scholar * Kulik, H. J. _et al_. Roadmap on machine learning in electronic structure. Electron. _Struct_ 4, 023004 (2022). CAS Google Scholar *

Stöhr, M., Medrano Sandonas, L. & Tkatchenko, A. Accurate many-body repulsive potentials for density-functional tight binding from deep tensor neural networks. _J. Phys. Chem. Lett_ 11,

6835–6843 (2020). Article PubMed Google Scholar * Qiao, Z., Welborn, M., Anandkumar, A., Manby, F. R. & Miller, T. F. OrbNet: Deep learning for quantum chemistry using

symmetry-adapted atomic-orbital features. _J. Chem. Phys_ 153, 124111 (2020). Article ADS CAS PubMed Google Scholar * Blum, L. C. & Reymond, J.-L. 970 million druglike small

molecules for virtual screening in the chemical universe database GDB-13. _J. Am. Chem. Soc._ 131, 8732–8733 (2009). Article CAS PubMed Google Scholar * Montavon, G. _et al_. Machine

learning of molecular electronic properties in chemical compound space. _New J. Phys._ 15, 095003 (2013). Article ADS CAS Google Scholar * Yang, Y. _et al_. Quantum mechanical static

dipole polarizabilities in the QM7b and AlphaML showcase databases. _Sci. Data_ 6, 152 (2019). Article PubMed PubMed Central Google Scholar * Ruddigkeit, L., van Deursen, R., Blum, L. C.

& Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. _J. Chem. Inf. Model._ 52, 2864–2875 (2012). Article CAS PubMed Google

Scholar * Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. _Sci. Data_ 1, 140022 (2014). Article CAS

PubMed PubMed Central Google Scholar * Hoja, J. _et al_. QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules. _Sci.

Data_ 8, 43 (2021). Article CAS PubMed PubMed Central Google Scholar * Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet – a deep learning

architecture for molecules and materials. _J. Chem. Phys_ 148, 241722 (2018). Article ADS PubMed Google Scholar * Chmiela, S. _et al_. Accurate global machine learning force fields for

molecules with hundreds of atoms. _Sci. Adv._ 9, eadf0873 (2023). Article PubMed PubMed Central Google Scholar * Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1, A data set of 20

million calculated off-equilibrium conformations for organic molecules. _Sci. Data_ 4, 170193 (2017). Article CAS PubMed PubMed Central Google Scholar * Smith, J. S. _et al_. The

ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. _Sci. Data_ 7, 134 (2020). Article CAS PubMed PubMed Central Google Scholar *

Zubatyuk, R., Smith, J. S., Nebgen, B. T., Tretiak, S. & Isayev, O. Teaching a neural network to attach and detach electrons from molecules. _Nat. Commun._ 12, 4870 (2021). Article ADS

CAS PubMed PubMed Central Google Scholar * Decherchi, S. & Cavalli, A. Thermodynamics and kinetics of drug-target binding by molecular simulation. _Chem. Rev._ 120, 12788–12833

(2020). Article CAS PubMed PubMed Central Google Scholar * Hirata, F. Molecular theory of solvation, vol. 24 (Springer Science & Business Media, 2003). * Gorges, J., Grimme, S.,

Hansen, A. & Pracht, P. Towards understanding solvation effects on the conformational entropy of non-rigid molecules. _Phys. Chem. Chem. Phys._ 24, 12249–12259 (2022). Article CAS

PubMed Google Scholar * Matczak, P. & Domagała, M. Heteroatom and solvent effects on molecular properties of formaldehyde and thioformaldehyde symmetrically disubstituted with

heterocyclic groups C4H3Y (where Y= O–Po). _J. Mol. Model._ 23, 1–11 (2017). Article CAS Google Scholar * Odey, M. O. _et al_. Unraveling the impact of polar solvation on the molecular

geometry, spectroscopy (ft-ir, uv, nmr), reactivity (elf, nbo, homo-lumo) and antiviral inhibitory potential of cissampeline by molecular docking approach. _Chem. Phys. Impact_ 7, 100346

(2023). Article Google Scholar * Ensing, B., Meijer, E. J., Blöchl, P. & Baerends, E. J. Solvation effects on the sn 2 reaction between ch3cl and cl-in water. _J. Phys. Chem. A_ 105,

3300–3310 (2001). Article CAS Google Scholar * Klamt, A. Conductor-like screening model for real solvents: A new approach to the quantitative calculation of solvation phenomena. _J. Phys.

Chem_ 99, 2224–2235 (1995). Article CAS Google Scholar * Ringe, S., Oberhofer, H., Hille, C., Matera, S. & Reuter, K. Function-space-based solution scheme for the size-modified

poisson–boltzmann equation in full-potential DFT. _J. Chem. Theory Comput._ 12, 4052–4066 (2016). Article CAS PubMed Google Scholar * Onufriev, A. V. & Case, D. A. Generalized born

implicit solvent models for biomolecules. _Annu. Rev. Biophys._ 48, 275–296 (2019). Article CAS PubMed PubMed Central Google Scholar * Xie, L. & Liu, H. The treatment of solvation

by a generalized born model and a self-consistent charge-density functional theory-based tight-binding method. _J. Comput. Chem_ 23, 1404–1415 (2002). Article CAS PubMed Google Scholar *

Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. _Sci. Data_ 9, 273 (2022). Article CAS PubMed PubMed Central Google

Scholar * Chai, J.-D. & Head-Gordon, M. Systematic optimization of long-range corrected hybrid density functionals. _J. Chem. Phys_ 128, 084106 (2008). Article ADS PubMed Google

Scholar * Stuke, A. _et al_. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. _Sci. Data_ 7, 58 (2020). Article CAS PubMed PubMed Central Google

Scholar * Tkatchenko, A. & Scheffler, M. Accurate molecular van der Waals interactions from ground-state electron density and free-atom reference data. _Phy. Rev. Lett._ 102, 073005

(2009). Article ADS Google Scholar * Sinstein, M. _et al_. Efficient implicit solvation method for full potential DFT. _J. Chem. Theory Comput._ 13, 5582–5603 (2017). Article CAS PubMed

Google Scholar * Axelrod, S. & Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. _Sci. Data_ 9, 185 (2022). Article

CAS PubMed PubMed Central Google Scholar * Subramanian, G., Ramsundar, B., Pande, V. & Denny, R. A. Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based

approaches. _J. Chem. Inf. Model._ 56, 1936–1949 (2016). Article CAS PubMed Google Scholar * Ehlert, S., Stahn, M., Spicher, S. & Grimme, S. Robust and efficient implicit solvation

model for fast semiempirical methods. _J. Chem. Theory Comput._ 17, 4250–4261 (2021). Article CAS PubMed Google Scholar * Barone, V. & Cossi, M. Quantum calculation of molecular

energies and energy gradients in solution by a conductor solvent model. _J. Phys. Chem. A_ 102, 1995–2001 (1998). Article CAS Google Scholar * Eastman, P. _et al_. SPICE, A Dataset of

Drug-like Molecules and Peptides for Training Machine Learning Potentials. _Sci. Data_ 10, 11 (2023). Article CAS PubMed PubMed Central Google Scholar * Pracht, P., Bohle, F. &

Grimme, S. Automated exploration of the low-energy chemical space with fast quantum chemical methods. _Phys. Chem. Chem. Phys._ 22, 7169–7192 (2020). Article CAS PubMed Google Scholar *

Elstner, M. _et al_. Self-consistent-charge density-functional tight-binding method for simulations of complex materials properties. _Phys. Rev. B_ 58, 7260–7268 (1998). Article ADS CAS

Google Scholar * Gaus, M., Cui, Q. & Elstner, M. DFTB3: Extension of the self-consistent-charge density-functional tight-binding method (SCC-DFTB). _J. Chem. Theory Comput._ 7, 931–948

(2011). Article CAS Google Scholar * Tkatchenko, A., DiStasio, R. A. Jr, Car, R. & Scheffler, M. Accurate and efficient method for many-body van der Waals interactions. _Phys. Rev.

Lett._ 108, 236402 (2012). Article ADS PubMed Google Scholar * Ambrosetti, A., Reilly, A. M., DiStasio, R. A. Jr & Tkatchenko, A. Long-range correlation energy calculated from

coupled atomic response functions. _J. Chem. Phys_ 140, 18A508 (2014). Article PubMed Google Scholar * Stöhr, M., Michelitsch, G. S., Tully, J. C., Reuter, K. & Maurer, R. J.

Communication: Charge-population based dispersion interactions for molecules and materials. _J. Chem. Phys_ 144, 151101 (2016). Article ADS PubMed Google Scholar * Mortazavi, M.,

Brandenburg, J. G., Maurer, R. J. & Tkatchenko, A. Structure and stability of molecular crystals with many-body dispersion-inclusive density functional tight binding. _J. Phys. Chem.

Lett_ 9, 399–405 (2018). Article CAS PubMed Google Scholar * Havu, V., Blum, V., Havu, P. & Scheffler, M. Efficient O(N) integration for all-electron electronic structure calculation

using numeric basis functions. _J. Comput. Phys_ 228, 8367–8379 (2009). Article ADS CAS Google Scholar * Gaulton, A. _et al_. ChEMBL: a large-scale bioactivity database for drug

discovery. _Nucleic Acids Res_ 40, D1100–D1107 (2012). Article CAS PubMed Google Scholar * Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the

chemical beauty of drugs. _Nat. Chem._ 4, 90–98 (2012). Article CAS PubMed PubMed Central Google Scholar * Landrum, G. _et al_. RDKit: Open-source cheminformatics.

https://www.rdkit.org (2020). * Landrum, G. _et al_. rdkit/rdkit: 2020_03_1 (q1 2020) release https://doi.org/10.5281/zenodo.3732262 (2020). * Halgren, T. A. Merck molecular force field. I.

basis, form, scope, parameterization, and performance of MMFF94. _J. Comput. Chem._ 17, 490–519 (1996). Article CAS Google Scholar * Halgren, T. A. Merck molecular force field. II. MMFF94

van der Waals and electrostatic parameters for intermolecular interactions. _J. Comput. Chem_ 17, 520–552 (1996). Article CAS Google Scholar * Halgren, T. A. Merck molecular force field.

III. molecular geometries and vibrational frequencies for MMFF94. _J. Comput. Chem._ 17, 553–586 (1996). Article CAS Google Scholar * Halgren, T. A. & Nachbar, R. B. Merck molecular

force field. IV. conformational energies and geometries for MMFF94. _J. Comput. Chem._ 17, 587–615 (1996). Article CAS Google Scholar * Halgren, T. A. Merck molecular force field. V.

extension of MMFF94 using experimental data, additional computational data, and empirical rules. _J. Comput. Chem_ 17, 616–641 (1996). Article CAS Google Scholar * Cremer, J., Medrano

Sandonas, L., Tkatchenko, A., Clevert, D.-A. & De Fabritiis, G. Equivariant graph neural networks for toxicity prediction. _Chem. Res. Toxicol._ 36, 1561–1573 (2023). CAS PubMed PubMed

Central Google Scholar * Bell, E. W. & Zhang, Y. DockRMSD: an open-source tool for atom mapping and RMSD calculation of symmetric molecules through graph isomorphism. _J.

Cheminformatics_ 11, 40 (2019). Article Google Scholar * Gaus, M., Goez, A. & Elstner, M. Parametrization and benchmark of DFTB3 for organic molecules. _J. Chem. Theory Comput._ 9,

338–354 (2013). Article CAS PubMed Google Scholar * Gaus, M., Lu, X., Elstner, M. & Cui, Q. Parameterization of DFTB3/3OB for sulfur and phosphorus for chemical and biological

applications. _J. Chem. Theory Comput._ 10, 1518–1537 (2014). Article CAS PubMed PubMed Central Google Scholar * Aradi, B., Hourahine, B. & Frauenheim, T. DFTB+, a sparse

matrix-based implementation of the DFTB method. _J. Phys. Chem. A_ 111, 5678–5684 (2007). Article CAS PubMed Google Scholar * Larsen, A. H. _et al_. The atomic simulation environment—a

python library for working with atoms. _J. Phys. Condens. Matter_ 29, 273002 (2017). Article Google Scholar * Perdew, J. P., Ernzerhof, M. & Burke, K. Rationale for mixing exact

exchange with density functional approximations. _J. Chem. Phys_ 105, 9982–9985 (1996). Article ADS CAS Google Scholar * Adamo, C. & Barone, V. Toward reliable density functional

methods without adjustable parameters: The PBE0 model. _J. Chem. Phys._ 110, 6158–6170 (1999). Article ADS CAS Google Scholar * Ringe, S., Oberhofer, H. & Reuter, K. Transferable

ionic parameters for first-principles Poisson-Boltzmann solvation calculations: Neutral solutes in aqueous monovalent salt solutions. _J. Chem. Phys_ 146, 134103 (2017). Article ADS PubMed

Google Scholar * Blum, V. _et al_. Ab initio molecular simulations with numeric atom-centered orbitals. _Comp. Phys. Commun._ 180, 2175–2196 (2009). Article ADS CAS Google Scholar *

Ren, X. _et al_. Resolution-of-identity approach to Hartree–Fock, hybrid density functionals, RPA, MP2 and GW with numeric atom-centered orbital basis functions. _New J. Phys._ 14, 053020

(2012). Article ADS Google Scholar * Medrano Sandonas, L. _et al_. Aquamarine: Quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. _ZENODO_

https://doi.org/10.5281/zenodo.10208010 (2024). * Ho, B. K. & Dill, K. A. Folding very short peptides using molecular dynamics. _PLOS Comput. Biol._ 2, 1–10 (2006). ADS Google Scholar

* Ringe, S. _et al_. Understanding cation effects in electrochemical CO2 reduction. _Energy Environ. Sci._ 12, 3001–3014 (2019). Article CAS Google Scholar * Abidi, N., Lim, K. R. G.,

Seh, Z. W. & Steinmann, S. N. Atomistic modeling of electrocatalysis: Are we there yet? WIREs Comput. _Mol. Sci._ 11, e1499 (2021). Article CAS Google Scholar * Gauthier, J. A. _et

al_. Unified approach to implicit and explicit solvent simulations of electrochemical reaction energetics. _J. Chem. Theory Comput._ 15, 6895–6906 (2019). Article CAS PubMed Google

Scholar * Ringe, S., Hörmann, N. G., Oberhofer, H. & Reuter, K. Implicit solvation methods for catalysis at electrified interfaces. _Chem. Rev._ 122, 10777–10820 (2022). Article CAS

PubMed Google Scholar * Hawkins, P. C., Skillman, A. G., Warren, G. L., Ellingson, B. A. & Stahl, M. T. Conformer generation with omega: algorithm and validation using high quality

structures from the protein databank and cambridge structural database. _J. Chem. Inf. Model_ 50, 572–584 (2010). Article CAS PubMed PubMed Central Google Scholar * Wang, S., Witek, J.,

Landrum, G. A. & Riniker, S. Improving conformer generation for small rings and macrocycles based on distance geometry and experimental torsional-angle preferences. _J. Chem. Inf.

Model_ 60, 2044–2058 (2020). Article CAS PubMed Google Scholar * Spellmeyer, D. C., Wong, A. K., Bower, M. J. & Blaney, J. M. Conformational analysis using distance geometry methods.

_J. Mol. Graph. Model._ 15, 18–36 (1997). Article CAS PubMed Google Scholar * Kanal, I. Y., Keith, J. A. & Hutchison, G. R. A sobering assessment of small-molecule force field

methods for low energy conformer predictions. _Int. J. Quantum Chem._ 118, e25512 (2018). Article Google Scholar * Ernzerhof, M. & Scuseria, G. E. Assessment of the

Perdew–Burke–Ernzerhof exchange-correlation functional. _J. Chem. Phys._ 110, 5029–5036 (1999). Article ADS CAS Google Scholar * Lynch, B. J. & Truhlar, D. G. Robust and affordable

multicoefficient methods for thermochemistry and thermochemical kinetics: the MCCM/3 suite and SAC/3. _J. Phys. Chem. A_ 107, 3898–3906 (2003). Article CAS Google Scholar * Reilly, A. M.

& Tkatchenko, A. Understanding the role of vibrations, exact exchange, and many-body van der Waals interactions in the cohesive properties of molecular crystals. _J. Chem. Phys_ 139,

024705 (2013). Article ADS PubMed Google Scholar * Hoja, J. _et al_. Reliable and practical computational description of molecular crystal polymorphs. _Sci. Adv._ 5, eaau3338 (2019).

Article ADS PubMed PubMed Central Google Scholar * Góger, S., Medrano Sandonas, L., Müller, C. & Tkatchenko, A. Data-driven tailoring of molecular dipole polarizability and frontier

orbital energies in chemical compound space. _Phys. Chem. Chem. Phys._ 25, 22211–22222 (2023). Article PubMed PubMed Central Google Scholar * Medrano Sandonas, L. _et al_. “Freedom of

design” in chemical compound space: towards rational in silico design of molecules with targeted quantum-mechanical properties. _Chem. Sci._ 14, 10702–10717 (2023). Article CAS PubMed

PubMed Central Google Scholar * Fallani, A., Medrano Sandonas, L. & Tkatchenko, A. Enabling inverse design in chemical compound space: Mapping quantum properties to structures for

small organic molecules. _ArXiv_ https://doi.org/10.48550/arXiv.2309.00506 (2023). Download references ACKNOWLEDGEMENTS LMS and AT acknowledge financial support from Janssen Pharmaceuticals

(Aquamarine project). AF and MH are grateful for financial support from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Innovative Training

Network - European Industrial Doctorate grant agreement No 956832, “Advanced Machine Learning for Innovative Drug Discovery” (AIDD). The results presented in this publication have been

partially obtained using the HPC facilities of the University of Luxembourg and Meluxina supercomputer (PoC project). This research also used computational resources provided by the Center

for Information Services and High-Performance Computing (ZIH) at TU Dresden. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Physics and Materials Science, University of

Luxembourg, L-1511, Luxembourg City, Luxembourg Leonardo Medrano Sandonas, Alessio Fallani, Mathias Hilfiker & Alexandre Tkatchenko * Institute for Materials Science and Max Bergmann

Center of Biomaterials, TU Dresden, 01062, Dresden, Germany Leonardo Medrano Sandonas * Drug Discovery Data Sciences (D3S), Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium

Dries Van Rompaey, Alessio Fallani, Jonas Verhoeven, Joerg Kurt Wegner & Hugo Ceulemans * Computational Chemistry, Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium David

Hahn, Laura Perez-Benito & Gary Tresadern * Drug Discovery Data Sciences (D3S), Johnson & Johnson Innovative Medicine, 301 Binney Street, MA 02142, Cambridge, USA Joerg Kurt Wegner

Authors * Leonardo Medrano Sandonas View author publications You can also search for this author inPubMed Google Scholar * Dries Van Rompaey View author publications You can also search for

this author inPubMed Google Scholar * Alessio Fallani View author publications You can also search for this author inPubMed Google Scholar * Mathias Hilfiker View author publications You can

also search for this author inPubMed Google Scholar * David Hahn View author publications You can also search for this author inPubMed Google Scholar * Laura Perez-Benito View author

publications You can also search for this author inPubMed Google Scholar * Jonas Verhoeven View author publications You can also search for this author inPubMed Google Scholar * Gary

Tresadern View author publications You can also search for this author inPubMed Google Scholar * Joerg Kurt Wegner View author publications You can also search for this author inPubMed

Google Scholar * Hugo Ceulemans View author publications You can also search for this author inPubMed Google Scholar * Alexandre Tkatchenko View author publications You can also search for

this author inPubMed Google Scholar CONTRIBUTIONS D.V.R. and J.V. selected relevant compounds from public datasets to include in the dataset, with input from G.T. L.M.S. generated the 3D

molecular structures with CREST/xTB and DFTB3+MBD. D.V.R., L.P.B., and D.H. generated the additional molecular structures with RDKit, Maestro, and Omega. LMS performed the PBE0+MBD

calculations in gas phase and implicit water for all structures. L.M.S. and D.V.R. designed and wrote the manuscript. A.F. and M.H. contributed to the curation and technical validation of

the dataset. A.T., J.K.W., and H.C. supervised and revised all stages of the work. All authors discussed the results and contributed to the final manuscript. CORRESPONDING AUTHORS

Correspondence to Leonardo Medrano Sandonas, Dries Van Rompaey or Alexandre Tkatchenko. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL

INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY

INFORMATION RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution

and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if

changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the

material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to

obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS

ARTICLE Medrano Sandonas, L., Van Rompaey, D., Fallani, A. _et al._ Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. _Sci Data_ 11,

742 (2024). https://doi.org/10.1038/s41597-024-03521-8 Download citation * Received: 18 March 2024 * Accepted: 13 June 2024 * Published: 07 July 2024 * DOI:

https://doi.org/10.1038/s41597-024-03521-8 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not

currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative