Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules

Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules

Play all audios:

Loading...

ABSTRACT We here introduce the Aquamarine (AQM) dataset, an extensive quantum-mechanical (QM) dataset that contains the structural and electronic information of 59,783 low-and high-energy


conformers of 1,653 molecules with a total number of atoms ranging from 2 to 92 (mean: 50.9), and containing up to 54 (mean: 28.2) non-hydrogen atoms. To gain insights into the solvent


effects as well as collective dispersion interactions for drug-like molecules, we have performed QM calculations supplemented with a treatment of many-body dispersion (MBD) interactions of


structures and properties in the gas phase and implicit water. Thus, AQM contains over 40 global and local physicochemical properties (including ground-state and response properties) per


conformer computed at the tightly converged PBE0+MBD level of theory for gas-phase molecules, whereas PBE0+MBD with the modified Poisson-Boltzmann (MPB) model of water was used for solvated


molecules. By addressing both molecule-solvent and dispersion interactions, AQM dataset can serve as a challenging benchmark for state-of-the-art machine learning methods for property


modeling and _de novo_ generation of large (solvated) molecules with pharmaceutical and biological relevance. SIMILAR CONTENT BEING VIEWED BY OTHERS QMUGS, QUANTUM MECHANICAL PROPERTIES OF


DRUG-LIKE MOLECULES Article Open access 07 June 2022 QM7-X, A COMPREHENSIVE DATASET OF QUANTUM-MECHANICAL PROPERTIES SPANNING THE CHEMICAL SPACE OF SMALL ORGANIC MOLECULES Article Open


access 02 February 2021 THE QD_Π_ DATASET, TRAINING DATA FOR DRUG-LIKE MOLECULES AND BIOPOLYMER FRAGMENTS AND THEIR INTERACTIONS Article Open access 25 April 2025 BACKGROUND & SUMMARY


INTRODUCTION In pharmaceutical research and development, computational chemistry can play an integral role in expediting candidate drugs into the clinic. Particularly, quantum-mechanical


(QM) methods (_e.g_., density-functional theory (DFT), post-Hartree-Fock approaches, and quantum Monte Carlo) have been utilized to describe covalent and non-covalent interatomic


interactions and to estimate diverse physicochemical properties of molecular systems1,2. QM methods can for instance be used to understand the reactivity of covalent binders3,4, evaluate


conformational energy landscapes of ligands5, study the stability of potential active pharmaceutical ingredients, calculate theoretical acidity constants6, or calculating theoretical charges


to more accurately capture electrostatic properties and surfaces. However, the computational cost and the challenge of conducting QM calculations at a large scale present a limitation to


their widespread use in drug discovery pipelines. Scanning conformational landscapes through QM calculations performed at DFT levels of theory typically takes several hours for a single


ligand of typical pharmaceutical size (_e.g_., 30-40 heavy atoms) on a single computer. As these calculations are readily parallelizable, supercomputers can be employed to enhance throughput


enabling the screening of a few hundred compounds, but it remains challenging to perform these calculations at the scale of virtual libraries which can easily be composed of tens of


thousands of compounds. Accelerated QM methods have emerged as promising solutions in recent years, offering a balance between accuracy and computational efficiency. These can take the form


of quantum fragmentation methods7,8, semi-empirical methods (_e.g_., parametric method series9, density functional tight-binding (DFTB)10,11 or its extended version (GFNn-xTB)12,13) as well


as machine learning (ML) models14,15,16,17,18,19,20 capable of optimizing geometries or estimating physicochemical properties. The resulting acceleration enables researchers to include


QM-based knowledge as a part of their workflow. Accordingly, relevant QM datasets of small organic molecules have widely assisted the development of ML-based approaches for a fast and


accurate estimation of structural, vibrational, and electronic properties of complex organic molecules21,22,23,24. Among them, one can find QM725,26,27, QM928,29, QM7-X30, MD1715,31, MD2232,


ANI-114,33, ANI-1x/ANI-1ccx34 and AIMNet-NSE35. While these QM datasets have significantly advanced the field of computational chemistry, they do exhibit certain limitations that stem from


three important facts. First, they primarily consist of molecules that are considerably smaller than what is commonly encountered in modern medicinal chemistry. Second, their structures have


been optimized using theoretical models that do not account for molecule-solvent and collective dispersion interactions. Lastly, they have not fully explored the vast conformational


landscape inherent in these molecules. Especially, the interaction between the molecule and the chemical environment (_i.e_., solvent) is crucial when investigating molecules of


pharmaceutically relevant size, as it is well-known that drug binding does not occur _in vacuo_ but rather _in solutio_36. Indeed, there is extensive literature discussing the solvent


effects on the properties of specific molecular systems37,38,39,40,41. For instance, Gorges _et al_.38 found that solvation can have a substantial effect of several cal/mol ⋅ K on the


entropy of 25 commercially available drug molecules as a result of large conformational changes. The geometry, energetics, HOMO/LUMO energies, dipole moment, and polarizability of


formaldehyde and thioformaldehyde have also been reported to change upon solvation in solvents of low polarity39. Similarly, the molecule-solvent interaction can affect the dynamic stability


and antiviral inhibitory potential of Cissampeline40 as well as the chemical reaction type SN2 between Cl− and CH3Cl41. Omitting solvent effects can thus result in inappropriate treatment


of conformations, tautomers, physicochemical properties, or molecular reactivity. In the computational modeling of molecular systems, solvents can either be considered implicitly or


represented explicitly. However, owing to the computational cost and intricate nature of explicitly representing the solvent, most QM studies opt for the utilization of implicit solvent


models such as conductor-like screening model for real solvents (COSMO-RS)42, modified Poisson-Boltzmann (MPB)43, and Generalized Born (GB)44 model augmented with the hydrophobic solvent


accessible surface area term (GBSA)45. Lately, to overcome the molecular size limitation within benchmark QM datasets, several efforts have been made to generate datasets that


comprehensively explore the conformational space of large and flexible molecules along with their associated QM properties calculated in gas phase or solvent, see Table 1. For instance, the


QMugs46 collection comprises 19 QM properties of circa 2 M gas-phase conformers of 665, 911 molecules with up to 100 non-hydrogen atoms computed using _ω_B97X-D47 density functional and the


def2-SVP basis set. The OE6248 dataset covers 61, 489 molecules with up to 92 non-hydrogen atoms that were optimized in gas phase using PBE(tight) level of theory supplemented with


Tkatchenko-Scheffler van der Waals (TS) interaction49. This dataset also contains 3 QM properties for 30, 876 structures evaluated using PBE0(tight) level of theory together with implicit


water defined by the Multipole Expansion (MPE) model50. Regarding the vast GEOM collection (which stands for Geometric Ensemble Of Molecules)51, only 1.3 M conformers corresponding to 1, 511


BACE52 molecules were generated considering molecule-solvent interactions described by the analytical linearized Poisson-Boltzmann (ALPB)53 model of water. From here, 455, 000 conformers of


534 BACE molecules were selected and used for further geometry optimization calculations using r2scan-3c functional with C-PCM54 (which stands for conductor-like polarizable continuum


model) implicit model of water and, posteriorly, 6 QM properties were collected. Moreover, Eastman _et al_.55 have recently introduced the SPICE dataset (which is short for


Small-molecule/Protein Interaction Chemical Energies) that explicitly considers the interaction between the molecule and water molecules _via_ Amber14 classical force field to get 1, 300


(equilibrium and non-equilibrium) structures of 26 amino acids. Energies, atomic forces, and other 6 QM properties were computed using the _ω_B97M-D3(BJ) functional and def2-TZVPPD basis


set. Despite these efforts, challenges remain to enable a better understanding of solvent effects as well as collective dispersion interactions in the chemical space of large drug-like


molecules, including: (i) assessing the accuracy and reliability of QM structures and properties with respect to the employed density-functional approximation, especially for larger and more


flexible molecules in which van der Waals (vdW) and molecule-solvent interactions are stronger, (ii) offering a large set of molecular (global) and atom-in-a-molecule (local)


physicochemical properties that would enable a comprehensive exploration of these interactions in structure-property and property-property relationships throughout chemical space, and (iii)


providing accurate and reliable QM data that will enable the construction of models for describing covalent and non-covalent vdW interactions in large (solvated) molecules. In this work, we


introduce the Aquamarine (AQM) dataset with the aim of addressing these challenges. The current version of AQM contains an extensive conformational sampling of 1, 653 molecules with up to 54


(mean: 28.2) non-hydrogen atoms (including C, N, O, F, P, S, and Cl), producing a total of 59, 783 low-and high-energy conformers with a total number of atoms _N_ ranging from 2 until 92


(mean: 50.9), see Fig. 1. In doing so, QM conformers were generated using the conformational search workflow implemented in CREST code56 (which is short for Conformer-Rotamer Ensemble


Sampling Tool) that considers semi-empirical GFN2-xTB13 with GBSA implicit solvent model of water45. Since vdW interactions have a significant impact on the conformations of large molecules,


we have optimized a set of representative conformers using third-order DFTB method10,57,58 (or DFTB3) supplemented with a treatment of many-body dispersion (MBD) interactions59,60,61,62


(see “Methods”). Moreover, to have a better understanding of solvent effects, we have performed these calculations in gas phase and in implicit water described by the GBSA model. For each of


the (gas-phase and solvated) optimized conformers, AQM also includes an extensive number (over 40) of global (molecular) and local (atom-in-a-molecule) QM properties computed at a high


level of theory that depends on the chemical environment used during the geometry optimization. The majority of QM properties for gas-phase structures were evaluated using non-empirical


hybrid DFT with MBD interactions (_i.e_., PBE0+MBD) in conjunction with tightly-converged numeric atom-centered orbitals63. In addition, MPB implicit solvent model of water was considered to


obtain the properties for solvated structures. Hence, we have two different AQM subsets, namely AQM-gas and AQM-sol, which contain the QM structural and property data of molecules in gas


phase and implicit water, respectively. Based on its design, AQM holds the potential to enhance the comprehension of the influence of molecule-solvent and collective dispersion interactions


in structure-property and property-property relationships of molecules of pharmaceutically relevant size and composition. KEY ADVANCEMENTS Our main motivation for proposing AQM as a


benchmark dataset is to advance in the development of the next generation of ML models that enable a fast and accurate property estimate and ideation of drug-like molecules synthesized in a


chemical environment. In pursuit of this aim, we have ensured that the AQM dataset exhibits the following characteristics, * The idea of combining the CREST conformational search workflow


with the subsequent DFTB3+MBD geometry optimization, both in gas phase and implicit water, has provided us with access to a more extensive and reliable exploration of low- and high-energy


(compact/extended) conformers of large molecules. To the best of our knowledge, this procedure has not been considered in previous works. Also, notice that MBD interaction is a key factor in


accurately describing and identifying diverse conformations in large molecular complexes, due to their anisotropic shapes. * The AQM dataset provides a more accurate set of over 40 global


(molecular) and local (atom-in-a-molecule) QM properties of gas-phase and solvated structures when compared to already public datasets (see Table 1). These properties can assist in the


estimation and comprehension of the impact of molecule-solvent interactions in the structure-property and property-property relationships of large molecules, _e.g_., _via_ delta learning


approach. * The property data stored in AQM-gas and AQM-sol can potentially be used to develop more robust QM descriptors for large molecules, enabling fast and accurate calculations of


their physicochemical or biological properties. Moreover, such quantum property-based molecular descriptors are complementary to the widely used geometric descriptors and both can be used in


synergy for estimating molecular observables measured in experiment. * The gas-phase and solvated conformations of AQM molecules, along with their highly accurate QM properties make the AQM


dataset a valuable resource for in silico assisted methods. METHODS SELECTION OF REPRESENTATIVE CHEMISTRIES We sought to select a set of compounds from the public domain that approximate a


typical corporate library including H, C, N, O, F, P, Cl, and S atoms. To this end, we sampled 5000 compounds from ChEMBL64 and compared them to the Johnson & Johnson Innovative


Medicines corporate database. Compounds with molecular weights over 1200, more than 30 rotatable bonds, a quantitative estimate of drug-likeness (QED) score65 under 0.4, and heavy atom count


over 200 were removed. Then, we got a reduced ChEMBL set of compounds with similar molecular weights, numbers of rotatable bonds, and fraction of sp3 to the corporate database (see Fig. S1


of the Supplementary Information (SI)). This subset was subjected to diversity selection, followed by manual inspection to remove molecules containing undesirable or unusual chemical


substructures. As a result, we initially selected SMILES (which stands for Simplified Molecular Input Line Entry System) of molecular building blocks and typical lead-like compounds as well


as a few protein degraders and macrocycles to produce a total of 2, 635 unique molecules with up to 60 non-hydrogen atoms (_N _≤ 116). To have a more extensive sampling of the chemical space


described by the selected SMILES, we generated all possible stereoisomers for each structure using the RDKit code—an open source toolkit developed for cheminformatics66,67. In our script,


we generate unique stereoisomers with the option to perturb each stereocenter while keeping the same atomic connectivity (_i.e_., tautomers are not considered), yielding a larger number of


isomers whose stability is later checked _via_ quantum mechanical (QM) calculations. Accordingly, the new number of molecular structures considering the stereoisomers is circa 10 k. Initial


3D structures were subsequently generated with RDKit and optimized using the MMFF94 force field68,69,70,71,72. GENERATION OF MOLECULAR CONFORMERS Conformational sampling plays a crucial role


in the generation of AQM dataset. We have meticulously explored different conformational search workflows to identify the most effective approach for comprehensively sampling the potential


energy surface (PES) and the molecular property space of large drug-like molecules (see “Technical Validation” and Sec. 2 of the SI). In doing so, we opted to use the approach implemented in


CREST56 code which uses extensive sampling based on the much faster and yet reliable semi-empirical extended tight-binding method (GFN2-xTB)12,13 to generate 3D conformations. The


semi-empirical energies and structures are thought to be more accurate than classical force fields, accounting for electronic effects, rare functional groups, and bond-breaking/formation of


labile bonds13,46,51,73. Moreover, the CREST search algorithm is based on metadynamics (MTD), a well-established thermodynamic sampling approach that can efficiently explore the low-energy


search space. The collective variables used for the MTD sampling are the atomic root-mean-squared deviation (RMSD) values between the previous structures on the PES of a given molecule56.


The atomic RMSD values are introduced into the expression of the bias potential, which is used to compute the guiding forces. These forces are responsible for driving the structure further


away from previous geometries, providing an extensive exploration of PES. Conformers are thus generated in an iterative manner of MTD and GFN2-xTB optimization, where those geometries are


added to the conformer rotamer ensemble (CRE) that overcome certain energy ( > 12.0 kcal/mol) and root-mean-square deviation (Δ_R_ > 0.1 Å) thresholds concerning the input structure.


The procedure is restarted using the conformer as input if a new conformer has a lower energy than the input structure. The three conformers of lowest energy undergo two normal molecular


dynamics (MD) simulations at 400K and 500K, which are used to sample low-energy barrier crossings, such as simple torsional motions. Finally, a genetic Z-matrix crossing algorithm is used


and the results are added to the CRE. Then, a normal-type convergence optimization separates the geometries into conformers, rotamers, and duplicates, where duplicates are deleted and


conformers and rotamers added to the CRE. Both geometry optimization and conformational search calculations were carried out considering implicit water described by the Generalized Born (GB)


model augmented with the hydrophobic solvent accessible surface area term (GBSA)45. Finally, we obtained 2, 242, 490 conformers for the initial set of 2, 635 molecules. Unlike other already


public datasets of conformers of large molecules (see Table 1), we have here defined a method to select a set of representative conformers per molecule (_i.e_., per SMILE) instead of


considering all conformers generated by CREST. The purpose of using this method is to filter out conformers that are similar in regions of the chemical space defined by the atomic structure,


total energy _E_tot and many-body dispersion (MBD) energy _E_MBD. We have here considered _E_MBD due to its relevance in the definition of stability rankings in large molecules and


molecular crystals59,60,61,62. Accordingly, our initial step involves determining clusters that consist of conformers exhibiting a root-mean-square deviation (Δ_R_) between their structures


of less than 1.5 Å. Δ_R_ is computed with the help of DockRMSD tool74. After clustering the conformers, we obtained _E_tot and _E_MBD of all conformers per cluster _via_ single-point


calculations using third-order self-consistent charge density functional tight binding (DFTB3)10,57,58 supplemented with a treatment of MBD interactions59,60,61,62, making use of 3ob


parameters75,76. Then, we select the conformers with the most distinct values of both energies (_i.e_., _E_tot > 0.24 eV and _E_MBD > 0.048 eV) per cluster, see Fig. 1. This


exemplifies a new approach for selecting conformers of large molecules within chemical space, utilizing an in-depth analysis of their electronic properties. To showcase its efficacy, we have


considered only 1, 653 molecules ( ≈ 60% of total initial unique molecules) with up to 54 non-hydrogen atoms (_N _≤ 92), reducing the number of conformers for these molecules from 280, 182


to 59, 783. While this method does indeed yield a more diverse set of conformers, it remains essential to confirm the energetic and mechanical stability of these molecular structures,


especially, taking into account the treatment of MBD interactions. To maintain consistency with our earlier publication of the QM7-X dataset for small organic molecules, we have conducted


the geometry optimization calculations utilizing the DFTB3+MBD level of theory. Moreover, to construct a dataset that can be used to understand the influence of molecule-solvent interaction


on the physicochemical properties of large drug-like molecules, our generation procedure considers the optimization of structures in gas phase and in implicit water described by the GBSA


model, as implemented in the DFTB+ code77. We have stored the gas-phase and solvated optimized structures into two subsets named AQM-gas and AQM-sol, respectively. These DFTB calculations


were performed by interfacing DFTB+ code with the Atomic Simulation Environment (ASE)78. Despite the majority of these molecular structures being identified as local minima at the DFTB3+MBD


level, both in gas phase and in implicit water, it is worth mentioning that some of them are situated at saddle points on the respective PES. CALCULATION OF PHYSICOCHEMICAL PROPERTIES These 


≈ 60 k DFTB optimized structures were now utilized for more accurate QM single-point calculations using dispersion-inclusive hybrid DFT. Energies, forces, and several other physicochemical


properties (as detailed in Table 2) were calculated at a higher level of theory that varied depending on the chemical environment used in the structure optimization process. Property


calculations for AQM-gas molecules were computed using PBE0+MBD59,79,80 level, while, for AQM-sol molecules, the modified Poisson-Boltzmann (MPB)43,81 model of water was also considered. The


MPB model solves the size-modified Poisson-Boltzmann equation for the implicit inclusion of electrolytic solvation effects into DFT calculations. It also includes a model for the well-known


Stern layer that separates the diffusing ions from the solvation cavity by introducing non-mean-field ion-solute interactions. For these calculations, we have used the FHI-aims code82,83


(version 221103) together with “tight” settings for basis functions and integration grids. Energies were converged to 10−6 eV and the accuracy of the forces was set to 10−4 eV/Å. The


convergence criteria used during self-consistent field (SCF) optimizations were 10−3 eV for the sum of eigenvalues and 10−6 electrons/Å3 for the charge density. The MBD energies and MBD


atomic forces were here computed using the range-separated self-consistent screening (rsSCS) approach60, while the atomic _C_6 coefficients, isotropic atomic polarizabilities, molecular _C_6


coefficients and molecular polarizabilities (both isotropic and tensor) were obtained _via_ the SCS approach59. Hirshfeld ratios correspond to the Hirshfeld volumes divided by the free atom


volumes. The TS dispersion energy refers to the pairwise Tkatchenko-Scheffler (TS) dispersion energy in conjunction with the PBE0 functional49. The vdW radii were also obtained using the


SCS approach _via_\({R}_{{\rm{vdW}}}={\left({\alpha }^{{\rm{SCS}}}/{\alpha }^{{\rm{TS}}}\right)}^{1/3}{R}_{{\rm{vdW}}}^{{\rm{TS}}}\), where _α_TS and \({R}_{{\rm{vdW}}}^{{\rm{TS}}}\) are the


atomic polarizability and vdW radius computed according to the TS scheme, respectively. Atomization energies were obtained by subtracting the atomic PBE0 energies from the PBE0 total energy


of each gas-phase and solvated molecular conformation (see Table S1 of the SI). The exact exchange energy is the amount of exact (or Hartree-Fock) exchange that has been admixed into the


exchange-correlation energy. DATA RECORDS The AQM dataset is provided in two HDF5 files in a ZENODO.ORG data repository84. The QM structural and property data of the 59, 783 conformations


corresponding to 1, 653 molecules in both gas phase and implicit water were stored in the AQM-gas.hdf5 and AQM-sol.hdf5 files, respectively. Additionally, we have uploaded the


AQM-initial.hdf5 file which only contains the structural data of the 2, 242, 490 conformations corresponding to the initial set of 2, 635 molecules (obtained by using CREST code). One can


also find there a README file with technical usage details and an example of how to access the information stored in AQM (see readAQM.py file). HDF5 FILE FORMAT Independent of the AQM


subset, the information for each molecular structure is stored in a Python dictionary (dict) type containing all relevant properties and recorded in _groups_ in HDF5 file format30. HDF5 keys


to access the atomic numbers, atomic positions (coordinates), and physicochemical properties in each dictionary are provided in Table 2. The dimension of each array depends on the number of


atoms _N_ and the required property, _e.g_., for a methane (CH4) molecule, ’atNUM’ is a 1D array of _N_ = 5 elements ([6, 1, 1, 1, 1]) while ’atXYZ’ is a 2D array comprised of _N_ = 5 rows


and three columns (_x_, _y_, _z_ coordinates). All structures are labeled as _Geom-mr-ct_, where _r_ enumerates the SMILE strings and _t_ the considered conformer. Note that the indices _t_


used in the AQM dataset reflect the order in which a given structure was generated and do not correspond to sorted xTB/DFTB (or DFT) total energies. TECHNICAL VALIDATION A significant


challenge in simulating the physicochemical properties of large drug-like molecules lies in the fact that, in experiments, their conformations and electronic structures are influenced by


interactions with the surrounding solvent. However, the standard approach in contemporary QM simulations involves running them in gas phase, without accounting for molecule-solvent


interactions. Unlike another recently published dataset of large molecules (see Table 1), AQM dataset considers the molecule-solvent interactions as well as a treatment of van der Waals


(vdW) interactions in its generation procedure—two important physical and chemical effects in determining structural conformations and stability rankings of molecules of pharmaceutically


relevant size. As mentioned above, the AQM comprises the structural and electronic data of 59, 783 gas-phase and solvated (low-and high-energy) conformers of 1, 653 molecules with up to 54


non-hydrogen atoms (_N _≤ 92), including C, N, O, F, P, S and Cl. The structures of solvated conformers were obtained using DFTB3+MBD method supplemented with the GBSA implicit model of


water. This model has been successfully used in the study of free solvation energies of neutral/ionic molecules45 and the folding of short peptides85. Whereas, the level of theory selected


to compute the QM properties per conformer was PBE0+MBD supplemented with the MPB implicit model of water. The MPB model has been shown to provide a more accurate description in the study of


diverse electrochemical reactions86,87,88,89. In all calculations, many-body dispersion (MBD) interactions have been included to deal with long-range interactions that are not adequately


represented by the baseline level of theory. These advanced theoretical models have thus generated a more accurate collection of molecular and atom-in-a-molecule (as well as ground state and


response) QM properties of conformers in implicit water, which are stored in AQM-sol. Moreover, when integrated with the property information in AQM-gas, these data can assist in


fine-tuning ML models for the precise estimation of electronic properties of solvated molecules, _e.g_., _via_ a delta learning approach. An essential step in the generation of AQM dataset


involved the thoughtful selection of the conformational search workflow. Here, we exhaustively analyzed the sampling method implemented in four different codes: CREST56, Maestro, Omega90 and


RDKit66,91. The last three codes are of standard use in cheminformatics, primarily relying on stochastic algorithms for their application. They explore the conformational space of molecules


very sparsely through a combination of pre-defined distances and stochastic samples92 and can miss many low-energy conformations. Moreover, in most standalone applications, conformer


energies are typically computed using classical force fields without the incorporation of solvent models, which made these values rather inaccurate93 (for more details of these methods see


Sec. 2 of the SI). On the contrary, CREST code utilizes a robust sampling strategy, leveraging the semi-empirical extended tight-binding method (GFN2-xTB)13, supplemented by the GBSA


implicit solvent model of water, in order to generate more reliable 3D conformations compared to those obtained using classical force fields (see “Methods”). The conformational search


workflow implemented in CREST provides access to a more extensive exploration of low- and high-energy conformers, generating molecular structures inaccessible by distance geometry methods


(see Fig. 2)46,51,73. To gain a better understanding of the influence of these sampling methods on the conformational search of large drug-like molecules, we analyzed the structural and


energetic data of conformers corresponding to 18 randomly selected compositions, each containing approximately _N_ = 50 atoms. Fig. 2(a) displays the output for the variation of the averaged


number of clusters, denoted as ⟨_M_⟩, as a function of the root-mean-square deviation, Δ_R_, among conformers that constitute a cluster. This calculation was first done per molecule and


then averaged over the 18 cases. This showed that more diverse conformers become part of the same cluster when Δ_R_ increases, resulting in a reduction of the number of distinct clusters.


The decrease in ⟨_M_⟩ for Maestro and Omega is faster compared to CREST, which may indicate that conformational search workflows based on distance geometry methods are insufficient for


probing the PES of more flexible molecules. However, thanks to the random ensemble option for conformer generation, RDKit produces a larger ⟨_M_⟩ than CREST for Δ_R_ > 1.0 Å. To examine


this result further, we compute _E_AT and _E_MBD for a set of representative conformers per molecule (see selection criterion in “Methods”) using DFTB3+MBD level of theory. Notice that the


resulting total number of conformers depends on the code employed for their generation, _i.e_., CREST → 3747 conformers, Maestro → 100 conformers, Omega → 204 conformers and RDKit → 1872


conformers. Thus, we have found that, compared to other methods, the conformers generated by CREST code show a more organized coverage of the energetic space defined by DFTB-_E_AT and


DFTB-_E_MBD (see the well-defined cluster in Fig. 2(b)). Moreover, the larger coverage of DFTB-_E_MBD values is a clear indicator that CREST sampling method can generate conformations of


increased complexity, characterized by a more folded structure and a stronger dispersion interaction, as illustrated with the inserted structures in Fig. 2(b). This underscores the relevance


of taking vdW interactions into account when generating and identifying conformers of large drug-like molecules. Similarly, this provides compelling evidence of the efficient conformational


search workflow implemented in CREST code, which improves coverage of both conformational and property molecular space. It is noteworthy that the calculations executed by the CREST code


incurred higher computational expenses compared to those carried out by chemoinformatics codes. After generating all conformers and selecting the representative conformers for each molecule,


we proceed to optimize the molecular structures. This optimization is carried out using the DFTB3+MBD level of theory in the gas phase as well as in implicit water described by the GBSA


model. Once this optimization step is complete, we obtain the final set of molecular structures for AQM-gas (gas phase) and AQM-sol (implicit water) subsets. To quantitatively analyze the


impact of molecule-solvent interaction on the structure of AQM molecules, we have computed Δ_R_ between the molecular structures stored in AQM-gas and AQM-sol. Indeed, Fig. 3(a) displays the


size dependence of the averaged Δ_R_ (blue dots) together with the total range of Δ_R_ values spanned at different _N_ (blue shadow). The results show that the structures of small molecules


(_N_ ≤ 20 atoms) present ⟨Δ_R_⟩ < 0.1 Å, _i.e_., they are minimally affected by the interaction with the solvent. In contrast, when dealing with molecules with _N_ > 40 atoms, solvent


effects become significantly more pronounced, and, as a result, we observe greater deviations in the Δ_R_ values (> 2.0 Å). A similar effect can be observed when comparing the gyration


radius _R__g_ for gas-phase and solvated molecular structures, see Fig. 3(b). Notice that there also are large compounds (_N_ ≈ 90) characterized by extensively constrained structures, which


remain unaffected by the interaction with implicit water. These findings hold significant relevance in the context of advancing QM-based pipelines for the creation of datasets of molecules


of pharmaceutical relevant size. Particularly in cases where the research objectives encompass the generation of non-equilibrium structures for training ML force fields since the PES of


these molecules will be largely modified by solvent effects—a phenomenon that can be inferred from the outcomes of the geometry optimization process. The AQM dataset considers an extensive


array of more than 40 distinct molecular (global) and atom-in-a-molecule (local) QM properties, which were computed to gain insights into the effect of molecule-solvent interactions on


structure-property and property-property relationships of large molecules. In Table 2, we list the properties of gas-phase and solvated molecules, derived from QM calculations performed at


the PBE0+MBD level (AQM-gas) and further enhanced with the MPB implicit solvent model of water (AQM-sol), respectively. PBE0+MBD has been chosen as our baseline level of theory for property


calculations due to its well-established accuracy and reliability, demonstrated in the description of intramolecular degrees of freedom as well as intermolecular interactions in organic


molecular dimers, supramolecular complexes, and molecular crystals59,79,80,94,95,96,97. The use of these DFT methods also provides interesting insights into the effect of molecule-solvent


interactions on the potential energy surface of molecules, highlighting the importance of the dataset generation procedure. In Fig. S2 of the SI, one can see that the energy range and the


energetic ranking of molecules in AQM-gas and AQM-sol are different, but further analysis is required. We therefore consider that our QM calculations are suitable to validate the quality of


future research utilizing the AQM dataset. To understand the relevance of accessing QM data for solvated molecules, we first discuss the influence of implicit water on extensive and


intensive molecular QM properties. In doing so, as an illustrative example, we have analyzed the 2D property space defined by two contrasting properties98,99 such as the isotropic molecular


polarizability _α_ and the HOMO-LUMO gap _E_gap (_i.e_., \(\left(\alpha ,{E}_{{\rm{gap}}}\right)\)-space) for the 59, 783 conformations in AQM-gas and AQM-sol as well as the set of most


stable conformer per molecule in AQM-sol (only 1, 653 conformations), see Fig. 4(a). For comparison, the values corresponding to QM7-X equilibrium molecules are also plotted (green circles).


Our findings reveal that AQM molecules exhibit a significantly broader coverage of the _α_ range, surpassing QM7-X molecules by a factor of 6. This expanded coverage is attributed to the


inherently extensive character of _α_. Whereas, _E_gap range now covers molecules with circa 2.5 eV of energy gap, and the mean value for the entire dataset reduced from 7.0 eV to 4.5 eV,


see distribution plots on the top panel of Fig. 4(a). The slight differences between the \(\left(\alpha ,{E}_{{\rm{gap}}}\right)\)-space covered by AQM-gas and AQM-sol may be attributed to


compensation between the pronounced fluctuations in _α_, which are predominantly observed in molecules with \(\alpha > 300\,{a}_{0}^{3}\), and the more sensitive behavior of _E_gap to the


presence of implicit water, as displayed in the correlation plots in Fig. 4(b). Accordingly, these findings could carry crucial implications in the “freedom of design” when searching for


large drug-like molecules with targeted \(\left(\alpha ,{E}_{{\rm{gap}}}\right)\) values98,99,100. Notice that the conformational sampling per molecule largely improved the coverage of both


properties, connecting isolated regions associated with a single molecular structure with specific size and chemical composition. Fig. 4(b) also shows the correlation plots between HOMO


energy _E_HOMO and total dipole moment _D__s_ of the 59, 783 conformers contained in AQM-gas and AQM-sol. Thus, it becomes evident that intensive properties are more sensitive to the


incorporation of molecule-solvent interactions in the QM calculations when contrasted with extensive properties. On the other hand, atom-in-a-molecule QM properties can provide important


insight into the distinct chemical environments within large drug-like molecules. To illustrate this, Fig. 5(a) shows the 2D property space defined by Hirshfeld charges _q_H and atomic


polarizabilities \({\widetilde{\alpha }}_{{\rm{s}}}\) (_i.e_., \(\left({q}_{{\rm{H}}},{\widetilde{\alpha }}_{{\rm{s}}}\right)\)-space) for the 59, 783 conformations in AQM-gas and AQM-sol.


For comparison, the values corresponding to QM7-X equilibrium molecules are also plotted (green circles). Fig. 5(a) shows the existence of well-defined clusters that are mostly related to a


specific atom type. The slight overlap between these clusters is a clear example of the need to develop more robust geometric and electronic descriptors capable of effectively representing


intricate chemical environments in large drug-like molecules for ML applications. Furthermore, our calculations have revealed that implicit solvation has a pronounced influence on _q_H


values, particularly for heavier atoms such as P, S, and Cl—relevant atoms in the design of pharmaceutical compounds as well as in the determination of their physicochemical and biological


properties. Certainly, the molecule-solvent interaction has a stronger effect on the local properties compared to global ones, as illustrated by the correlation plots in Fig. 5(b). This


becomes more evident by observing the atomic forces _F_tot, where the significant variations in values can strongly affect the accuracy of ML force fields when applied to run the dynamics of


large molecules. Up to now, we have been focused on the impact of considering molecule-solvent interaction when computing molecular/atom-in-a-molecule QM properties of AQM molecules.


However, the data of the non-electrostatic part of solvation energy due to molecule-solvent interaction _E_nelec in conformers contained in AQM-sol can also be crucial for having a better


understanding of the solvation effect on structure-property and property-property relationships of large molecules. As an example, Fig. 6(a) shows the correlation plot between _E_nelec and


dispersion interaction energy _E_disp calculated using two well-established methods: many-body dispersion (MBD) and Tkatchenko-Scheffler (TS). The datapoints are colored according to the


gyration radius _R_g of each solvated structure. The high degree of correlation between these properties underscores the importance of considering both molecule-solvent and dispersion


interactions when investigating large molecules, as we did to generate AQM dataset. Moreover, the growing difference in energies obtained by MBD and TS methods, particularly as the system


size increases, highlights the significant influence of many-body interactions in the energetic description of these compounds. To further elucidate the role of both interactions in the


generation of AQM, we have selected the molecule C29H39N5O3S2 (_N_ = 78 atoms, ID in dataset: 2070) with 720 conformers and then examined their respective energy values. These conformers


show a difference in dispersion energies Δ_E_disp of up to ≈1.0 eV, where the smallest Δ_E_disp values correspond to more compact molecular structures while the largest Δ_E_disp values are


observed for more extended ones (see Fig. 6(b)). Besides presenting a size dependence, the data plotted in Fig. 6(c) demonstrate that, similar to _E_MBD, _E_nelec also depends on the


structural conformation of molecules. These findings show the significance of both interactions in the generation of a robust QM dataset comprising large and more flexible molecules that


bear pharmaceutical relevance. In summary, we demonstrated that the extensive structural and property data contained in AQM dataset hold the potential to enhance the understanding of how


molecular-solvent interactions influence both structure-property and property-property relationships of large drug-like molecules. As such, AQM may in some cases be employed as a benchmark


dataset for direct/delta learning and generative methods, estimating the properties of pharmaceutical compounds in solution from their gas-phase counterparts. CODE AVAILABILITY The initial


structure generation was carried out using RDKit 2020.09.566,67. Further structure optimization and the creation of conformers were performed by utilizing CREST13,56 and DFTB+10,77 codes


together with ASE78. Note that all necessary features regarding the utilized DFTB3+MBD (with and without GBSA implicit solvent) approach are available in the current DFTB+ version11. All DFT


calculations were carried out using FHI-aims82 (version 221103). Additional conformer generation experiments were performed with RDKit 2020.09.5, OpenEye’s Omega 4.0.0.490 and Schrodinger’s


Maestro suite v2020-4 (see SI for detailed procedures). REFERENCES * Friesner, R. A. ab initio quantum chemistry: Methodology and applications. _Proc. Natl. Acad. Sci._ 102, 6648–6653


(2005). Article  ADS  CAS  PubMed  PubMed Central  Google Scholar  * Marzari, N., Ferretti, A. & Wolverton, C. Electronic-structure methods for materials design. _Nat. Mater._ 20,


736–749 (2021). Article  ADS  CAS  PubMed  Google Scholar  * Palazzesi, F., Grundl, M. A., Pautsch, A., Weber, A. & Tautermann, C. S. A fast ab initio predictor tool for covalent


reactivity estimation of acrylamides. _J. Chem. Inf. Model_ 59, 3565–3571 (2019). Article  CAS  PubMed  Google Scholar  * Mihalovits, L. M., Ferenczy, G. G. & Keserũ, G. M. Affinity and


selectivity assessment of covalent inhibitors by free energy calculations. _J. Chem. Inf. Model_ 60, 6579–6594 (2020). Article  CAS  PubMed  Google Scholar  * Hofmans, S. _et al_. Tozasertib


analogues as inhibitors of necroptotic cell death. _J. Medicinal Chem_ 61, 1895–1920 (2018). Article  CAS  Google Scholar  * Prasad, S., Huang, J., Zeng, Q. & Brooks, B. R. An


explicit-solvent hybrid QM and MM approach for predicting pKa of small molecules in SAMPL6 challenge. _J. Comput. Mol. Des._ 32, 1191–1201 (2018). Article  CAS  Google Scholar  *


Raghavachari, K. & Saha, A. Accurate composite and fragment-based quantum chemical models for large molecules. _Chem. Rev._ 115, 5643–5677 (2015). Article  CAS  PubMed  Google Scholar  *


Pruitt, S. R., Bertoni, C., Brorsen, K. R. & Gordon, M. S. Efficient and accurate fragmentation methods. _Acc. Chem. Res._ 47, 2786–2794 (2014). Article  CAS  PubMed  Google Scholar  *


Stewart, J. J. P. Optimization of parameters for semiempirical methods II. applications. _J. Comput. Chem._ 10, 221–264 (1989). Article  CAS  Google Scholar  * Seifert, G., Porezag, D. &


Frauenheim, T. Calculations of molecules, clusters, and solids with a simplified LCAO-DFT-LDA scheme. _Int. J. Quantum Chem._ 58, 185–192 (1996). Article  CAS  Google Scholar  * Hourahine,


B. _et al_. DFTB+, a software package for efficient approximate density functional theory based atomistic simulations. _J. Chem. Phys_ 152, 124101 (2020). Article  ADS  CAS  PubMed  Google


Scholar  * Bannwarth, C. _et al_. Extended tight-binding quantum chemistry methods. _WIREs Comput. Mol. Sci._ 11, e1493 (2021). Article  CAS  Google Scholar  * Bannwarth, C., Ehlert, S.


& Grimme, S. GFN2-xTB—an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion


contributions. _J. Chem. Theory Comput._ 15, 1652–1671 (2019). Article  CAS  PubMed  Google Scholar  * Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1: An extensible neural network


potential with DFT accuracy at force field computational cost. _Chem. Sci._ 8, 3192–3203 (2017). Article  CAS  PubMed  PubMed Central  Google Scholar  * Chmiela, S., Sauceda, H. E., Müller,


K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. _Nat. Commun._ 9, 3887 (2018). Article  ADS  PubMed  PubMed Central  Google Scholar


  * Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. _Nat. Commun._ 8, 13890 (2017). Article  ADS 


PubMed  PubMed Central  Google Scholar  * Unke, O. T. _et al_. Spookynet: Learning force fields with electronic degrees of freedom and nonlocal effects. _Nat. Commun._ 12, 7273 (2021).


Article  ADS  CAS  PubMed  PubMed Central  Google Scholar  * Batzner, S. _et al_. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. _Nat.


Commun._ 13, 2453 (2022). Article  ADS  CAS  PubMed  PubMed Central  Google Scholar  * Musaelian, A. _et al_. Learning local equivariant representations for large-scale atomistic dynamics.


_Nat. Commun._ 14, 579 (2023). Article  ADS  CAS  PubMed  PubMed Central  Google Scholar  * Batatia, I. _et al_. (eds.) Advances in Neural Information Processing Systems, vol. 35,


11423–11436 (Curran Associates, Inc., 2022). * Huang, B., von Rudorff, G. F. & von Lilienfeld, O. A. The central role of density functional theory in the AI age. _Science_ 381, 170–175


(2023). Article  ADS  CAS  PubMed  Google Scholar  * Kulik, H. J. _et al_. Roadmap on machine learning in electronic structure. Electron. _Struct_ 4, 023004 (2022). CAS  Google Scholar  *


Stöhr, M., Medrano Sandonas, L. & Tkatchenko, A. Accurate many-body repulsive potentials for density-functional tight binding from deep tensor neural networks. _J. Phys. Chem. Lett_ 11,


6835–6843 (2020). Article  PubMed  Google Scholar  * Qiao, Z., Welborn, M., Anandkumar, A., Manby, F. R. & Miller, T. F. OrbNet: Deep learning for quantum chemistry using


symmetry-adapted atomic-orbital features. _J. Chem. Phys_ 153, 124111 (2020). Article  ADS  CAS  PubMed  Google Scholar  * Blum, L. C. & Reymond, J.-L. 970 million druglike small


molecules for virtual screening in the chemical universe database GDB-13. _J. Am. Chem. Soc._ 131, 8732–8733 (2009). Article  CAS  PubMed  Google Scholar  * Montavon, G. _et al_. Machine


learning of molecular electronic properties in chemical compound space. _New J. Phys._ 15, 095003 (2013). Article  ADS  CAS  Google Scholar  * Yang, Y. _et al_. Quantum mechanical static


dipole polarizabilities in the QM7b and AlphaML showcase databases. _Sci. Data_ 6, 152 (2019). Article  PubMed  PubMed Central  Google Scholar  * Ruddigkeit, L., van Deursen, R., Blum, L. C.


& Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. _J. Chem. Inf. Model._ 52, 2864–2875 (2012). Article  CAS  PubMed  Google


Scholar  * Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. _Sci. Data_ 1, 140022 (2014). Article  CAS 


PubMed  PubMed Central  Google Scholar  * Hoja, J. _et al_. QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules. _Sci.


Data_ 8, 43 (2021). Article  CAS  PubMed  PubMed Central  Google Scholar  * Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet – a deep learning


architecture for molecules and materials. _J. Chem. Phys_ 148, 241722 (2018). Article  ADS  PubMed  Google Scholar  * Chmiela, S. _et al_. Accurate global machine learning force fields for


molecules with hundreds of atoms. _Sci. Adv._ 9, eadf0873 (2023). Article  PubMed  PubMed Central  Google Scholar  * Smith, J. S., Isayev, O. & Roitberg, A. E. ANI-1, A data set of 20


million calculated off-equilibrium conformations for organic molecules. _Sci. Data_ 4, 170193 (2017). Article  CAS  PubMed  PubMed Central  Google Scholar  * Smith, J. S. _et al_. The


ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules. _Sci. Data_ 7, 134 (2020). Article  CAS  PubMed  PubMed Central  Google Scholar  *


Zubatyuk, R., Smith, J. S., Nebgen, B. T., Tretiak, S. & Isayev, O. Teaching a neural network to attach and detach electrons from molecules. _Nat. Commun._ 12, 4870 (2021). Article  ADS


  CAS  PubMed  PubMed Central  Google Scholar  * Decherchi, S. & Cavalli, A. Thermodynamics and kinetics of drug-target binding by molecular simulation. _Chem. Rev._ 120, 12788–12833


(2020). Article  CAS  PubMed  PubMed Central  Google Scholar  * Hirata, F. Molecular theory of solvation, vol. 24 (Springer Science & Business Media, 2003). * Gorges, J., Grimme, S.,


Hansen, A. & Pracht, P. Towards understanding solvation effects on the conformational entropy of non-rigid molecules. _Phys. Chem. Chem. Phys._ 24, 12249–12259 (2022). Article  CAS 


PubMed  Google Scholar  * Matczak, P. & Domagała, M. Heteroatom and solvent effects on molecular properties of formaldehyde and thioformaldehyde symmetrically disubstituted with


heterocyclic groups C4H3Y (where Y= O–Po). _J. Mol. Model._ 23, 1–11 (2017). Article  CAS  Google Scholar  * Odey, M. O. _et al_. Unraveling the impact of polar solvation on the molecular


geometry, spectroscopy (ft-ir, uv, nmr), reactivity (elf, nbo, homo-lumo) and antiviral inhibitory potential of cissampeline by molecular docking approach. _Chem. Phys. Impact_ 7, 100346


(2023). Article  Google Scholar  * Ensing, B., Meijer, E. J., Blöchl, P. & Baerends, E. J. Solvation effects on the sn 2 reaction between ch3cl and cl-in water. _J. Phys. Chem. A_ 105,


3300–3310 (2001). Article  CAS  Google Scholar  * Klamt, A. Conductor-like screening model for real solvents: A new approach to the quantitative calculation of solvation phenomena. _J. Phys.


Chem_ 99, 2224–2235 (1995). Article  CAS  Google Scholar  * Ringe, S., Oberhofer, H., Hille, C., Matera, S. & Reuter, K. Function-space-based solution scheme for the size-modified


poisson–boltzmann equation in full-potential DFT. _J. Chem. Theory Comput._ 12, 4052–4066 (2016). Article  CAS  PubMed  Google Scholar  * Onufriev, A. V. & Case, D. A. Generalized born


implicit solvent models for biomolecules. _Annu. Rev. Biophys._ 48, 275–296 (2019). Article  CAS  PubMed  PubMed Central  Google Scholar  * Xie, L. & Liu, H. The treatment of solvation


by a generalized born model and a self-consistent charge-density functional theory-based tight-binding method. _J. Comput. Chem_ 23, 1404–1415 (2002). Article  CAS  PubMed  Google Scholar  *


Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs, quantum mechanical properties of drug-like molecules. _Sci. Data_ 9, 273 (2022). Article  CAS  PubMed  PubMed Central  Google


Scholar  * Chai, J.-D. & Head-Gordon, M. Systematic optimization of long-range corrected hybrid density functionals. _J. Chem. Phys_ 128, 084106 (2008). Article  ADS  PubMed  Google


Scholar  * Stuke, A. _et al_. Atomic structures and orbital energies of 61,489 crystal-forming organic molecules. _Sci. Data_ 7, 58 (2020). Article  CAS  PubMed  PubMed Central  Google


Scholar  * Tkatchenko, A. & Scheffler, M. Accurate molecular van der Waals interactions from ground-state electron density and free-atom reference data. _Phy. Rev. Lett._ 102, 073005


(2009). Article  ADS  Google Scholar  * Sinstein, M. _et al_. Efficient implicit solvation method for full potential DFT. _J. Chem. Theory Comput._ 13, 5582–5603 (2017). Article  CAS  PubMed


  Google Scholar  * Axelrod, S. & Gómez-Bombarelli, R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation. _Sci. Data_ 9, 185 (2022). Article


  CAS  PubMed  PubMed Central  Google Scholar  * Subramanian, G., Ramsundar, B., Pande, V. & Denny, R. A. Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based


approaches. _J. Chem. Inf. Model._ 56, 1936–1949 (2016). Article  CAS  PubMed  Google Scholar  * Ehlert, S., Stahn, M., Spicher, S. & Grimme, S. Robust and efficient implicit solvation


model for fast semiempirical methods. _J. Chem. Theory Comput._ 17, 4250–4261 (2021). Article  CAS  PubMed  Google Scholar  * Barone, V. & Cossi, M. Quantum calculation of molecular


energies and energy gradients in solution by a conductor solvent model. _J. Phys. Chem. A_ 102, 1995–2001 (1998). Article  CAS  Google Scholar  * Eastman, P. _et al_. SPICE, A Dataset of


Drug-like Molecules and Peptides for Training Machine Learning Potentials. _Sci. Data_ 10, 11 (2023). Article  CAS  PubMed  PubMed Central  Google Scholar  * Pracht, P., Bohle, F. &


Grimme, S. Automated exploration of the low-energy chemical space with fast quantum chemical methods. _Phys. Chem. Chem. Phys._ 22, 7169–7192 (2020). Article  CAS  PubMed  Google Scholar  *


Elstner, M. _et al_. Self-consistent-charge density-functional tight-binding method for simulations of complex materials properties. _Phys. Rev. B_ 58, 7260–7268 (1998). Article  ADS  CAS 


Google Scholar  * Gaus, M., Cui, Q. & Elstner, M. DFTB3: Extension of the self-consistent-charge density-functional tight-binding method (SCC-DFTB). _J. Chem. Theory Comput._ 7, 931–948


(2011). Article  CAS  Google Scholar  * Tkatchenko, A., DiStasio, R. A. Jr, Car, R. & Scheffler, M. Accurate and efficient method for many-body van der Waals interactions. _Phys. Rev.


Lett._ 108, 236402 (2012). Article  ADS  PubMed  Google Scholar  * Ambrosetti, A., Reilly, A. M., DiStasio, R. A. Jr & Tkatchenko, A. Long-range correlation energy calculated from


coupled atomic response functions. _J. Chem. Phys_ 140, 18A508 (2014). Article  PubMed  Google Scholar  * Stöhr, M., Michelitsch, G. S., Tully, J. C., Reuter, K. & Maurer, R. J.


Communication: Charge-population based dispersion interactions for molecules and materials. _J. Chem. Phys_ 144, 151101 (2016). Article  ADS  PubMed  Google Scholar  * Mortazavi, M.,


Brandenburg, J. G., Maurer, R. J. & Tkatchenko, A. Structure and stability of molecular crystals with many-body dispersion-inclusive density functional tight binding. _J. Phys. Chem.


Lett_ 9, 399–405 (2018). Article  CAS  PubMed  Google Scholar  * Havu, V., Blum, V., Havu, P. & Scheffler, M. Efficient O(N) integration for all-electron electronic structure calculation


using numeric basis functions. _J. Comput. Phys_ 228, 8367–8379 (2009). Article  ADS  CAS  Google Scholar  * Gaulton, A. _et al_. ChEMBL: a large-scale bioactivity database for drug


discovery. _Nucleic Acids Res_ 40, D1100–D1107 (2012). Article  CAS  PubMed  Google Scholar  * Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the


chemical beauty of drugs. _Nat. Chem._ 4, 90–98 (2012). Article  CAS  PubMed  PubMed Central  Google Scholar  * Landrum, G. _et al_. RDKit: Open-source cheminformatics.


https://www.rdkit.org (2020). * Landrum, G. _et al_. rdkit/rdkit: 2020_03_1 (q1 2020) release https://doi.org/10.5281/zenodo.3732262 (2020). * Halgren, T. A. Merck molecular force field. I.


basis, form, scope, parameterization, and performance of MMFF94. _J. Comput. Chem._ 17, 490–519 (1996). Article  CAS  Google Scholar  * Halgren, T. A. Merck molecular force field. II. MMFF94


van der Waals and electrostatic parameters for intermolecular interactions. _J. Comput. Chem_ 17, 520–552 (1996). Article  CAS  Google Scholar  * Halgren, T. A. Merck molecular force field.


III. molecular geometries and vibrational frequencies for MMFF94. _J. Comput. Chem._ 17, 553–586 (1996). Article  CAS  Google Scholar  * Halgren, T. A. & Nachbar, R. B. Merck molecular


force field. IV. conformational energies and geometries for MMFF94. _J. Comput. Chem._ 17, 587–615 (1996). Article  CAS  Google Scholar  * Halgren, T. A. Merck molecular force field. V.


extension of MMFF94 using experimental data, additional computational data, and empirical rules. _J. Comput. Chem_ 17, 616–641 (1996). Article  CAS  Google Scholar  * Cremer, J., Medrano


Sandonas, L., Tkatchenko, A., Clevert, D.-A. & De Fabritiis, G. Equivariant graph neural networks for toxicity prediction. _Chem. Res. Toxicol._ 36, 1561–1573 (2023). CAS  PubMed  PubMed


Central  Google Scholar  * Bell, E. W. & Zhang, Y. DockRMSD: an open-source tool for atom mapping and RMSD calculation of symmetric molecules through graph isomorphism. _J.


Cheminformatics_ 11, 40 (2019). Article  Google Scholar  * Gaus, M., Goez, A. & Elstner, M. Parametrization and benchmark of DFTB3 for organic molecules. _J. Chem. Theory Comput._ 9,


338–354 (2013). Article  CAS  PubMed  Google Scholar  * Gaus, M., Lu, X., Elstner, M. & Cui, Q. Parameterization of DFTB3/3OB for sulfur and phosphorus for chemical and biological


applications. _J. Chem. Theory Comput._ 10, 1518–1537 (2014). Article  CAS  PubMed  PubMed Central  Google Scholar  * Aradi, B., Hourahine, B. & Frauenheim, T. DFTB+, a sparse


matrix-based implementation of the DFTB method. _J. Phys. Chem. A_ 111, 5678–5684 (2007). Article  CAS  PubMed  Google Scholar  * Larsen, A. H. _et al_. The atomic simulation environment—a


python library for working with atoms. _J. Phys. Condens. Matter_ 29, 273002 (2017). Article  Google Scholar  * Perdew, J. P., Ernzerhof, M. & Burke, K. Rationale for mixing exact


exchange with density functional approximations. _J. Chem. Phys_ 105, 9982–9985 (1996). Article  ADS  CAS  Google Scholar  * Adamo, C. & Barone, V. Toward reliable density functional


methods without adjustable parameters: The PBE0 model. _J. Chem. Phys._ 110, 6158–6170 (1999). Article  ADS  CAS  Google Scholar  * Ringe, S., Oberhofer, H. & Reuter, K. Transferable


ionic parameters for first-principles Poisson-Boltzmann solvation calculations: Neutral solutes in aqueous monovalent salt solutions. _J. Chem. Phys_ 146, 134103 (2017). Article  ADS  PubMed


  Google Scholar  * Blum, V. _et al_. Ab initio molecular simulations with numeric atom-centered orbitals. _Comp. Phys. Commun._ 180, 2175–2196 (2009). Article  ADS  CAS  Google Scholar  *


Ren, X. _et al_. Resolution-of-identity approach to Hartree–Fock, hybrid density functionals, RPA, MP2 and GW with numeric atom-centered orbital basis functions. _New J. Phys._ 14, 053020


(2012). Article  ADS  Google Scholar  * Medrano Sandonas, L. _et al_. Aquamarine: Quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. _ZENODO_


https://doi.org/10.5281/zenodo.10208010 (2024). * Ho, B. K. & Dill, K. A. Folding very short peptides using molecular dynamics. _PLOS Comput. Biol._ 2, 1–10 (2006). ADS  Google Scholar 


* Ringe, S. _et al_. Understanding cation effects in electrochemical CO2 reduction. _Energy Environ. Sci._ 12, 3001–3014 (2019). Article  CAS  Google Scholar  * Abidi, N., Lim, K. R. G.,


Seh, Z. W. & Steinmann, S. N. Atomistic modeling of electrocatalysis: Are we there yet? WIREs Comput. _Mol. Sci._ 11, e1499 (2021). Article  CAS  Google Scholar  * Gauthier, J. A. _et


al_. Unified approach to implicit and explicit solvent simulations of electrochemical reaction energetics. _J. Chem. Theory Comput._ 15, 6895–6906 (2019). Article  CAS  PubMed  Google


Scholar  * Ringe, S., Hörmann, N. G., Oberhofer, H. & Reuter, K. Implicit solvation methods for catalysis at electrified interfaces. _Chem. Rev._ 122, 10777–10820 (2022). Article  CAS 


PubMed  Google Scholar  * Hawkins, P. C., Skillman, A. G., Warren, G. L., Ellingson, B. A. & Stahl, M. T. Conformer generation with omega: algorithm and validation using high quality


structures from the protein databank and cambridge structural database. _J. Chem. Inf. Model_ 50, 572–584 (2010). Article  CAS  PubMed  PubMed Central  Google Scholar  * Wang, S., Witek, J.,


Landrum, G. A. & Riniker, S. Improving conformer generation for small rings and macrocycles based on distance geometry and experimental torsional-angle preferences. _J. Chem. Inf.


Model_ 60, 2044–2058 (2020). Article  CAS  PubMed  Google Scholar  * Spellmeyer, D. C., Wong, A. K., Bower, M. J. & Blaney, J. M. Conformational analysis using distance geometry methods.


_J. Mol. Graph. Model._ 15, 18–36 (1997). Article  CAS  PubMed  Google Scholar  * Kanal, I. Y., Keith, J. A. & Hutchison, G. R. A sobering assessment of small-molecule force field


methods for low energy conformer predictions. _Int. J. Quantum Chem._ 118, e25512 (2018). Article  Google Scholar  * Ernzerhof, M. & Scuseria, G. E. Assessment of the


Perdew–Burke–Ernzerhof exchange-correlation functional. _J. Chem. Phys._ 110, 5029–5036 (1999). Article  ADS  CAS  Google Scholar  * Lynch, B. J. & Truhlar, D. G. Robust and affordable


multicoefficient methods for thermochemistry and thermochemical kinetics: the MCCM/3 suite and SAC/3. _J. Phys. Chem. A_ 107, 3898–3906 (2003). Article  CAS  Google Scholar  * Reilly, A. M.


& Tkatchenko, A. Understanding the role of vibrations, exact exchange, and many-body van der Waals interactions in the cohesive properties of molecular crystals. _J. Chem. Phys_ 139,


024705 (2013). Article  ADS  PubMed  Google Scholar  * Hoja, J. _et al_. Reliable and practical computational description of molecular crystal polymorphs. _Sci. Adv._ 5, eaau3338 (2019).


Article  ADS  PubMed  PubMed Central  Google Scholar  * Góger, S., Medrano Sandonas, L., Müller, C. & Tkatchenko, A. Data-driven tailoring of molecular dipole polarizability and frontier


orbital energies in chemical compound space. _Phys. Chem. Chem. Phys._ 25, 22211–22222 (2023). Article  PubMed  PubMed Central  Google Scholar  * Medrano Sandonas, L. _et al_. “Freedom of


design” in chemical compound space: towards rational in silico design of molecules with targeted quantum-mechanical properties. _Chem. Sci._ 14, 10702–10717 (2023). Article  CAS  PubMed 


PubMed Central  Google Scholar  * Fallani, A., Medrano Sandonas, L. & Tkatchenko, A. Enabling inverse design in chemical compound space: Mapping quantum properties to structures for


small organic molecules. _ArXiv_ https://doi.org/10.48550/arXiv.2309.00506 (2023). Download references ACKNOWLEDGEMENTS LMS and AT acknowledge financial support from Janssen Pharmaceuticals


(Aquamarine project). AF and MH are grateful for financial support from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Innovative Training


Network - European Industrial Doctorate grant agreement No 956832, “Advanced Machine Learning for Innovative Drug Discovery” (AIDD). The results presented in this publication have been


partially obtained using the HPC facilities of the University of Luxembourg and Meluxina supercomputer (PoC project). This research also used computational resources provided by the Center


for Information Services and High-Performance Computing (ZIH) at TU Dresden. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Physics and Materials Science, University of


Luxembourg, L-1511, Luxembourg City, Luxembourg Leonardo Medrano Sandonas, Alessio Fallani, Mathias Hilfiker & Alexandre Tkatchenko * Institute for Materials Science and Max Bergmann


Center of Biomaterials, TU Dresden, 01062, Dresden, Germany Leonardo Medrano Sandonas * Drug Discovery Data Sciences (D3S), Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium


Dries Van Rompaey, Alessio Fallani, Jonas Verhoeven, Joerg Kurt Wegner & Hugo Ceulemans * Computational Chemistry, Janssen Pharmaceutica NV, Turnhoutseweg 30, 2340, Beerse, Belgium David


Hahn, Laura Perez-Benito & Gary Tresadern * Drug Discovery Data Sciences (D3S), Johnson & Johnson Innovative Medicine, 301 Binney Street, MA 02142, Cambridge, USA Joerg Kurt Wegner


Authors * Leonardo Medrano Sandonas View author publications You can also search for this author inPubMed Google Scholar * Dries Van Rompaey View author publications You can also search for


this author inPubMed Google Scholar * Alessio Fallani View author publications You can also search for this author inPubMed Google Scholar * Mathias Hilfiker View author publications You can


also search for this author inPubMed Google Scholar * David Hahn View author publications You can also search for this author inPubMed Google Scholar * Laura Perez-Benito View author


publications You can also search for this author inPubMed Google Scholar * Jonas Verhoeven View author publications You can also search for this author inPubMed Google Scholar * Gary


Tresadern View author publications You can also search for this author inPubMed Google Scholar * Joerg Kurt Wegner View author publications You can also search for this author inPubMed 


Google Scholar * Hugo Ceulemans View author publications You can also search for this author inPubMed Google Scholar * Alexandre Tkatchenko View author publications You can also search for


this author inPubMed Google Scholar CONTRIBUTIONS D.V.R. and J.V. selected relevant compounds from public datasets to include in the dataset, with input from G.T. L.M.S. generated the 3D


molecular structures with CREST/xTB and DFTB3+MBD. D.V.R., L.P.B., and D.H. generated the additional molecular structures with RDKit, Maestro, and Omega. LMS performed the PBE0+MBD


calculations in gas phase and implicit water for all structures. L.M.S. and D.V.R. designed and wrote the manuscript. A.F. and M.H. contributed to the curation and technical validation of


the dataset. A.T., J.K.W., and H.C. supervised and revised all stages of the work. All authors discussed the results and contributed to the final manuscript. CORRESPONDING AUTHORS


Correspondence to Leonardo Medrano Sandonas, Dries Van Rompaey or Alexandre Tkatchenko. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL


INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY


INFORMATION RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution


and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if


changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the


material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to


obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS


ARTICLE Medrano Sandonas, L., Van Rompaey, D., Fallani, A. _et al._ Dataset for quantum-mechanical exploration of conformers and solvent effects in large drug-like molecules. _Sci Data_ 11,


742 (2024). https://doi.org/10.1038/s41597-024-03521-8 Download citation * Received: 18 March 2024 * Accepted: 13 June 2024 * Published: 07 July 2024 * DOI:


https://doi.org/10.1038/s41597-024-03521-8 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not


currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative