Outbreak. Info research library: a standardized, searchable platform to discover and explore covid-19 resources

Outbreak. Info research library: a standardized, searchable platform to discover and explore covid-19 resources

Play all audios:

Loading...

ABSTRACT Outbreak.info Research Library is a standardized, searchable interface of coronavirus disease 2019 (COVID-19) and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)


publications, clinical trials, datasets, protocols and other resources, built with a reusable framework. We developed a rigorous schema to enforce consistency across different sources and


resource types and linked related resources. Researchers can quickly search the latest research across data repositories, regardless of resource type or repository location, via a search


interface, public application programming interface (API) and R package. SIMILAR CONTENT BEING VIEWED BY OTHERS INTERNET SEARCH PATTERNS REVEAL CLINICAL COURSE OF COVID-19 DISEASE


PROGRESSION AND PANDEMIC SPREAD ACROSS 32 COUNTRIES Article Open access 11 February 2021 COVID-19 OPEN-DATA A GLOBAL-SCALE SPATIALLY GRANULAR META-DATASET FOR _CORONAVIRUS_ DISEASE Article


Open access 12 April 2022 OUTBREAK.INFO GENOMIC REPORTS: SCALABLE AND DYNAMIC SURVEILLANCE OF SARS-COV-2 VARIANTS AND MUTATIONS Article 23 February 2023 MAIN In January 2020, SARS-CoV-2 was


identified as the virus responsible for a series of pneumonia cases with unknown origin1. As the virus spread globally, the scientific community rapidly released research outputs (such as


publications, clinical trials and datasets) and resources (websites, portals and more). The frequently uncoordinated generation and curation of resources exacerbated four challenges in


finding and using them: volume, fragmentation, variety and standardization (Supplementary Fig. 1). While many specialized websites were developed independently2,3,4,5,6,7, a centralized and


standardized repository for finding COVID-19 research has limited researchers’ ability to discover these resources and translate them into insights about the virus. To address the fragmented


research landscape, individual and community efforts created shared Google spreadsheets8,9,10 to aid in discoverability, but these efforts were not scalable and often lacked metadata to


promote findability (aside from Navarro and Capdarest-Arest10). Several projects attempted to address the volume and fragmentation issues through large-scale aggregation but failed to tackle


variety, focusing on a single resource type such as publications11,12. Even within a particular type of resource, standardization issues abound. Repositories pivoted quickly to curate


COVID-19 content from their collections using pre-existing metadata standards but were often not interoperable with other sources. For example, PubMed created LitCovid13 based on their


MEDLINE standards, and the National Clinical Trials Registry cataloged COVID-19 clinical trials using their schema14, but the World Health Organization (WHO) International Clinical Trials


Registry uses different conventions. Similarly, Zenodo15 and Figshare16 do not agree on the marginality, cardinality and property names17,18, despite compatibility with the standards of


https://schema.org. We address issues in metadata volume, variety, standardization and fragmentation by creating a single searchable index of COVID-19 publications, clinical trials, datasets


and more: the outbreak.info Research Library. To address variety and standardization, we developed a harmonized schema based on https://schema.org, a framework standardizing metadata across


the internet. Using this schema, we harvested and harmonized metadata from 16 resources (Fig. 1a). Daily updates ensure that site users have up-to-date information, essential amid a


constantly changing research landscape. Next, to address volume and fragmentation, we developed a web-based search portal for researchers to browse across the centralized and standardized


resources (https://outbreak.info/resources) and an API to access and analyze information en masse (https://api.outbreak.info). Within the search interface, users can search, filter and view


related records and share the associated metadata to easily query across resource repositories and types. For instance, a single query (for example, ‘Delta variant’) to our API can return


relevant publications, datasets, clinical trials and more (Fig. 1b), and the Research Library summarizes the search results in visualizations to promote exploration. For instance, the


histogram in Fig. 1b indicates that the number of resources mentioning ‘Delta variant’ began growing in mid 2021 and declined in the summer of 2022, and the donut charts show that LitCovid


is the dominant source. To ensure ease of use of our Research Library, we conducted usability studies and iteratively improved our site (Supplementary Fig. 2). To further address


fragmentation and maintenance issues, we use modular infrastructure, allowing easy addition of new data sources, including community contributions. Citizen scientists have played an active


role in data collection19 (https://covidsample.org/) and accessibility12,20 throughout the pandemic. Given the highly fragmented, diffuse and frequently changing nature inherent to


biomedical research, we built in three mechanisms to expand the Research Library through community participation (Supplementary Fig. 3a). First, contributors can submit individual or


multiple datasets via an online form that ensures that the curated metadata conform to our schema. Second, leveraging the benefits of human curation, the community-contributed metadata using


the form can be exhaustively detailed (Supplementary Fig. 3b) and can further be augmented through pull requests on GitHub. Lastly, anyone with Python coding skills can submit collections


of standardized datasets, publications and other resources to the Outbreak Resources API by contributing a resource parser. Our community-contribution pipeline allows us to integrate the


uncoordinated data-curation efforts quickly and flexibly, particularly apparent at the start of the pandemic (Supplementary Fig. 4). To support resource exploration and interpretation, we


added properties (value-added metadata) to every class in our schema that would support searching, filtering and browsing (topicCategories, Supplementary Fig. 5a); linkage and exploration


(correction, citedBy, isBasedOn, isRelatedTo; Supplementary Fig. 5b); and interpretation (qualitative evaluations) of resources. We selected these properties based on pre-existing citizen


science- and resource-curation activities, suggesting their value in promoting discoverability. For example, citizen scientists categorized resources in their lists or collections by type


(Dataset, ClinicalTrials, etc.) in their outputs10 or area of research (epidemiological, prevention, etc.)20 as they found these classifications helpful for searching, filtering and browsing


their lists or collections. Given the ability of citizen scientists to perform information extraction21 and their immense contributions to classification tasks22, we incorporated citizen


science contributions into the training data for classifying resources into topic categories. Citizen scientists also provided Oxford 2011 Levels of Evidence annotations to improve its


interpretability (that is, understanding the credibility or quality of the resource)20. To further enable assessment of the quality of a resource, we leveraged Digital Science’s Altmetric


ratings23. Finally, we integrated resources with the analyses that we developed to track SARS-CoV-2 variants of concern (VOCs)24, sets of mutations within the virus associated with increased


transmissibility, virulence and/or immune evasion. Researchers can seamlessly traverse from a specific variant report such as Omicron to resources in the Research Library that help


understand its behavior (Supplementary Fig. 5c), and variant searches are among our most commonly queried terms (Supplementary Table 1a). Without a centralized search interface with linked


records such as outbreak.info, a similar attempt to explore resources would require extensive manual searching from multiple different sites (Supplementary Fig. 6), each with their own


interfaces and corresponding search capabilities. To demonstrate the unique features of the outbreak.info Research Library, we explored the dynamics of research into SARS-CoV-2 variants over


time to address two key questions: (1) how has the research community responded to the emergence of new variants and (2) how has that response changed over time? We extracted research


related to variants in the Research Library using the query ‘variant OR lineage’, allowing us to query metadata from 16 sources of different research types simultaneously (Fig. 2a). Over


10,000 separate entries about variants are within the Library as of October 2022, including publications, datasets, clinical trials, protocols and more. Using filters and the quality metrics


provided through Altmetric badges, we quickly identified which results have been recognized by the community via Altmetric scores, such as a quantitative PCR protocol with reverse


transcription (RT–qPCR) to screen VOCs (Fig. 2b). Clearly, variants are an active area of research, but has this enthusiasm changed over time? Using the outbreak.info R package, we accessed


the harmonized metadata to examine the proportion of research related to variants in the Research Library over time. We observed an increase in research on variants following the first


identification of VOCs such as Alpha (B.1.1.7*) and Beta (B.1.351*) (Fig. 2c). This increase was even more prominent for the Omicron (B.1.1.529*) variant in late 2021; we hypothesize that


this increase was due to the heightened awareness of the value in studying variants among the scientific community, and early indications that the variant could be of global concern (high


growth rate of Omicron and the presence of many mutations in important sites). To examine how research differed by VOC over time, we constructed queries for each VOC, including its Pango


lineage name and associated sublineages. With the three VOCs that became the dominant worldwide form of SARS-CoV-2 (Alpha, Delta and Omicron), we find that the increase in research on these


VOCs mirrors the rise in worldwide prevalence for each variant, with the research output roughly proportional to global prevalence (Fig. 2d). With Alpha and Delta, there was a slight lag in


research publications that was not observed with Omicron, and research on Omicron over the last 10 months has dwarfed that for the other VOCs. Lastly, research on previously circulating


variants (Alpha, Beta, Gamma, Delta) continues, even though these variants are rarely detected presently, and focuses on retrospective analyses, fundamental studies on mechanisms of action,


Omicron comparisons and studies of recombinant variants. In sum, the research community’s response to the emergence of new variants has been robust, has become a greater focus of overall


research effort over the last year and quickly pivots to studying the dominant variant. The outbreak.info Research Library and resources API have been widely used by the external community,


including journalists, members of the medical and public health communities, students and biomedical researchers25. For instance, the RADx-Rad Data Coordination Center created the


SearchOutbreak app (https://searchoutbreak.netlify.app), which uses the Outbreak API to collect articles for customized research digests for its partners26. On average, the Research Library


receives nearly 3,000 pageviews per month, of which 85% are unique visitors (Supplementary Table 1b). The Research Library site has been used for over 11,000 unique searches, and the


Research Library API receives an average of nearly 63,000 unique hits per month (including web traffic and programmatic access). Some limitations of the Research Library include incomplete


or unstructured metadata descriptions provided by the sources and optimally querying these descriptions, which often include acronyms and synonyms. Future work will focus on augmenting the


harvested metadata and optimizing search results to provide the most salient results to users. While the unprecedented amount of research on COVID-19 offers new opportunities to accelerate


the pace of research, the difficulty in finding research amid this ‘infodemic’ remains a fundamental challenge. In the outbreak.info Research Library, we address many of these challenges to


assemble a collection of heterogeneous research outputs and data from distributed data sources into a searchable platform. Our metadata-processing platform is modular, allowing easy


extension to add new metadata sources including contributions from the community, allowing the Research Library to grow with the pandemic as research changes. To enable further analysis, we


enable programmatic access to the standardized library. Lastly, with the embrace of open science stored in decentralized sources, quickly finding information will be critical for the next


pandemic. Our approach to unify metadata across repositories will serve as a template for rapidly creating a unified search interface to aggregate research outputs for any pathogen or any


research domain. METHODS SCHEMA DEVELOPMENT The development of the schema for standardizing our collection of resources is as previously described27. Briefly, we prioritized six classes of


resources that had seen a rapid expansion at the start of the pandemic due to their importance to the research community: publications, datasets, clinical trials, analyses, protocols and


computational tools. We identified the most closely related classes from https://schema.org and mapped their properties to available metadata from two to five of the most prolific sources.


Additionally, we identified subclasses that were needed to support the aforementioned six classes and standardized the properties within each class. In addition to standardizing


ready-to-harvest metadata, we created new properties that would support the linkage, exploration and evaluation of our resources. Our schema was then refined as we iterated through the


available metadata when assembling COVID-19 resources. For example, publication providers such as PubMed typically use the ‘author’ property in their metadata, while dataset providers such


as Figshare and Zenodo are compliant with the DataCite schema and typically prefer ‘creator’. Although both properties are valid for their respective https://schema.org classes, we


normalized our schema to use ‘author’ for all six of our classes (Dataset, ClinicalTrial, Analysis, Protocol, Publication, ComputationalTool), because we expected the volume of publications


to dwarf that of all other classes of resources. We added this schema to the Schema Registry of the Data Discovery Engine (DDE)27, a project to share and reuse schemas and register datasets


according to a particular schema. The Outbreak schema is available at https://discovery.biothings.io/view/outbreak. ASSEMBLY OF COVID-19 RESOURCES The resource metadata pipeline for


outbreak.info includes two ways to ingest metadata (Supplementary Fig. 7). First, metadata can be ingested from other resource repositories or collections using the BioThings SDK28 data


plugins. By leveraging the BioThings SDK, we developed a technology stack that addresses the fragmentation issue by easily integrating metadata from different pre-existing resources. For


each resource repository or collection, a parser or data plugin enables automated import and updates from that resource. To import the data, the metadata is harvested from the source using


API calls (if available), HTML web scraping or .CSV or .TXT tables of metadata. All structured metadata provided by the sources is compiled and mapped to our schema using custom Python


scripts. The harmonized metadata is dumped into a JSON output. Supplementary Fig. 8 shows the completeness of each metadata property within our schema, broken down by resource type (data are


provided in Supplementary Table 2). Data plugin code for the sources is available at https://github.com/outbreak-info (Code availability). In the second mechanism, metadata for individually


curated resources can be submitted via an online form through the DDE Metadata Registry27. To assemble the outbreak.info collection of resources, we collected a list of over a hundred


separate resources on COVID-19 and SARS-CoV-2. This list (Supplementary Table 3) included generalist open data repositories, biomedical-specific data projects including those recommended by


the NIH29 and the NSF30 to house open data and individual websites that we came across through search engines and other COVID-19 publications. Prioritizing those resources that had a large


number of resources related to COVID-19, we selected an initial set of two to three sources per resource type to import into our collection. Given the lack of widespread repositories for


analysis resources, only one source would be included in our initial import (Imperial College London31). An analysis resource is defined as a frequently updated, web-based, data


visualization, data interpretation and/or data analysis resource. CREATION OF THE RESEARCH LIBRARY API AND QUERY INTERFACE To accommodate a large number of heterogeneous data sources, each


of which is independently harvested, we used the BioThings SDK framework to combine the data sources into a combined, public searchable index (Supplementary Fig. 7). The JSON outputs of our


data plugins are ingested by the BioThings framework and merged into an intermediary MongoDB database, and the processed data are indexed in an Elasticsearch index that can be accessed


through our public API (api.outbreak.info). The BioThings SDK plugin architecture handles errors in individual parsers without affecting the availability of the API itself. Errors thrown by


individual parsers may result in a lack of updates of an individual resource until the error is resolved, but the API will serve the latest version of data from the broken parser and


up-to-date data from all functional parsers, which will continue to be updated independently. Using the plugin architecture also allows the creation and maintenance of individual resource


parsers to be crowdsourced to anyone with basic Python knowledge and a GitHub account. Although resource plugins allow outbreak.info to ingest large amounts of standardized metadata, there


are still many individual datasets and research outputs scattered throughout the web that are not located in large repositories. As it is not feasible for one team to locate, identify and


collect standardized metadata from these individual datasets and research outputs, we leveraged the DDE27 to enable crowdsourcing and citizen science participation in the curation of


individual resource metadata. A Tornado server is used to create an API endpoint, api.outbreak.info/resources, that leverages the search capabilities of Elasticsearch to efficiently query


data. Within the search results, Elasticsearch sorts them by relevance based on Lucene’s Practical Scoring Function32, which prioritizes the query normalization factor, coordination factor,


term frequency, inverse document frequency and any custom query-boosting fields selected by the user33. To adjust this behavior based on common search patterns, we upweighted queries for


which the search term occurs in the name field and/or the name of a clinical trial therapeutic intervention (for example, ‘remdesivir’) with the following parameters: weight of 4 for ‘name’


and 3 for ‘interventions.name’. We continue to monitor common query patterns using our analytics to refine the scoring algorithm to improve the list of results for the user. Within the web


interface, the user has the option to sort by the best match-relevance score, update date for the document or alphabetically by name. Within search queries, terms are automatically combined


by ‘AND’. For instance, the search ‘long COVID’ will be interpreted as ‘long AND COVID’. This search will find resources containing both terms, although not necessarily together; the


Elasticsearch default scoring function will first list resources that contain both words together and that frequently mention the terms. Exact phrases can be explicitly declared by


encapsulating the terms in quotes (for example, ‘long COVID’ to search only for the phrase ‘long COVID’). Additionally, terms can be combined by the term ‘OR’ (for example, (Moderna OR


Pfizer) AND (‘side effects’ OR ‘adverse effects’)). Further details on advanced searching behavior are provided in our guide to the outbreak.info R package at


https://outbreak-info.github.io/R-outbreak-info/articles/researchlibrary.html#some-notes-on-constructing-queries. Further optimization will be the subject of future work, based on continuing


analysis of analytic patterns for the most common search queries and filters to promote user-driven design. Additional work will also focus on creating an advanced query builder to make it


easier to combine terms by any combination of ‘AND’, ‘OR’ and ‘NOT’ and to help the user search for exact phrases. To update the API with new data provided by the data sources, the BioThings


Hub schedules daily updates to pull data upstream and add them to the existing index. The BioThings Hub independently maintains each data source, enabling independence if an individual data


source pipeline breaks, and maintains historical data by default, creating automated backups. The code for the server-side application is available at


https://github.com/outbreak-info/outbreak.api (https://doi.org/10.5281/zenodo.7343503). OUTBREAK.INFO RESEARCH LIBRARY WEB APPLICATION AND METADATA ACCESS The web application was built using


Vue.js, a model–view–viewmodel JavaScript framework that enables the two-way binding of user interface elements and the underlying data allowing the user interface to reflect any changes in


underlying data and vice versa. The client-side application uses the high-performance API to interactively perform operations on the database. To iteratively improve the interface, we


conducted usability studies as described in Supplementary Fig. 2. The code for the client-side application is available at https://github.com/outbreak-info/outbreak.info


(https://doi.org/10.5281/zenodo.7343497). To enable programmatic access to all our harmonized metadata collection, all data are available in our API (api.outbreak.info) and can be accessed


through an R package as described by Gangavarapu et al.24 (package website, https://outbreak-info.github.io/R-outbreak-info/; code, https://github.com/outbreak-info/R-outbreak-info,


https://doi.org/10.5281/zenodo.7343501). COMMUNITY CURATION OF RESOURCE METADATA Resource plugins such as those used in the assembly of COVID-19 resources do not necessarily have to be built


by our own team. We used the BioThings SDK28 and the DDE27 so that individual resource collections can be added by writing BioThings plugins that conform to our schema. Expanding available


classes of resources can be easily carried out by extending other classes from https://schema.org via the DDE Schema Playground at https://discovery.biothings.io/schema-playground. Community


contributions of resource plugins can be carried out via GitHub. In addition to contributing resource plugins for collections or repositories of metadata, users can enter metadata for


individual resources via the automatic guides created by the DDE. To investigate potential areas of community contribution, we asked two volunteers to inspect 30 individual datasets


sprinkled around the web and collect the metadata for these datasets. We compared the results between the two volunteers, and their combined results were subsequently submitted into the


collection via the DDE’s Outbreak Data Portal Guide at https://discovery.biothings.io/guide/outbreak/dataset. Although limited by the original submission form (Google forms), the raw and


merged responses illustrating the thoroughness of the submissions from the two volunteers can be found at


https://docs.google.com/spreadsheets/d/1q1c400UFIOyXedFf2L81zROVkXi3BWBhU46Ic0cMYsI/edit?usp=sharing. Although both of our volunteers provided values for many of the available metadata


properties (name, description, topicCategories, keywords, etc.), one provided an extensive list of authors. Using the BioThings SDK in conjunction with the DDE allows us to centralize and


leverage individualized curation efforts that often occur at the start of a pandemic. Improvements or updates for manually curated metadata can be submitted via GitHub pull requests.


COMMUNITY CURATION OF SEARCHING, LINKAGE AND EVALUATION METADATA AND SCALING WITH MACHINE LEARNING In an effort to enable improved searching and filtering, we developed a nested list of


thematic or topic-based categories based on an initial list developed by LitCovid13 with input from the infectious disease research community and volunteer curators. The list consists of 11


broad categories and 24 specific child categories. LitCovid organized publications into eight research areas such as treatments or prevention, but these classifications are not available in


the actual metadata records for each publication. To obtain these classifications from LitCovid, subsetted exports of identifiers were downloaded from LitCovid and then mapped to the


metadata records from PubMed. Whenever possible, sources with thematic categories were mapped to our list of categories to develop a training set for basic binary (in-group–out-group)


classifications of required metadata fields such as (title, abstract and/or description). If an already curated training set could not be found for a broad category, it would be created


using an iterative process involving term–phrase searching on LitCovid, evaluating the specificity of the results, identifying new search terms by keyword frequency and repeating the


process. To generate training data for classifying resources into specific topic categories, the results from several approaches were combined. These approaches include direct mapping from


LitCovid research areas, keyword mapping from LitCovid, logical mapping from NCT ClinicalTrials metadata, the aforementioned term search iteration and citizen science curation of Zenodo and


Figshare datasets. Details on the logical mapping from NCT ClinicalTrials metadata can be found at https://github.com/gtsueng/outbreak_CT_classifier (https://doi.org/10.5281/zenodo.7442988).


The keyword mapping from LitCovid can be found at https://github.com/outbreak-info/topic_classifier/tree/main/data/keyword and


https://github.com/outbreak-info/topic_classifier/tree/main/data/subtopics/keywords. While positive categorical data were identified via the aforementioned methods, negative controls were


generated by randomly selecting from alternative topics and ensuring no overlap. The categorical data were randomly split into training (80%) and test (20%) sets per test, and five tests


were performed per topic by applying out-of-the-box logistic regression and multinomial naive Bayes and random forest algorithms from scikit-learn. These three algorithms were found to


perform best on this binary classification task using out-of-the-box tests. Topics were only added to the record if all three methods agreed on the classification. The set size and test


results using default tests from scikit-learn for each algorithm for each topic and subtopic for each of the five test runs can be found at


https://github.com/outbreak-info/topic_classifier/blob/main/results/in_depth_classifier_test.tsv. The efforts of our two volunteers suggested that non-experts were capable of thematically


categorizing datasets; therefore, we built a simple interface to allow citizen scientists to thematically classify the datasets that were available in our collection at that point in time.


Each dataset was assigned up to five topics by at least three different citizen scientists to ensure quality of the results. Citizen scientists were asked to prioritize specific topic


categories over broader ones. Ninety citizen scientists recruited via either participation in the Mark2Cure project34 or a Scripps Research summer program participated in classifying 530


datasets pulled from Figshare and Zenodo, increasing the likelihood of quality submissions and decreasing the likelihood of abuse and false information. The citizen science-curation site was


originally hosted at https://curate.outbreak.info. The code for the site can be found at https://github.com/outbreak-info/outbreak.info-resources/tree/master/citsciclassify. The citizen


science classifications can be found at https://github.com/outbreak-info/topic_classifier/blob/main/data/subtopics/curated_training_df.pickle. To evaluate the quality of the citizen


scientist classifications, we first filtered classifications where at least two or three of three to five curators agreed on the topic category. We then compared the results of their


classification with predictions by an out-of-the-box algorithm that was trained on LitCovid-classified abstracts. A total of 186 of 530 classifications did not agree and were manually


inspected; only about 10% of the categorization (54) was worse with citizen scientists over the predictions, and, in many cases, the curators provided more precise categorization. Full


details of the evaluation are available at https://github.com/gtsueng/curate_outbreak_data (https://doi.org/10.5281/zenodo.7442949). These classifications have been incorporated into the


appropriate datasets in our collection and have been used to build our models for topic categorization. Basic in-group–out-group classification models were developed for each category using


out-of-the-box logistic regression and multinomial naive Bayes and random forest algorithms available from scikit-learn. The topic classifier can be found at


https://github.com/outbreak-info/topic_classifier (https://doi.org/10.5281/zenodo.7439573). In addition to community curation of topic categorizations, we identified a citizen science


effort, the COVID-19 Literature Surveillance Team (COVID-19 LST), that was evaluating the quality of COVID-19 related literature. The COVID-19 LST consists of medical students (many of which


were in their third or fourth year), practitioners and researchers who evaluate publications on COVID-19 based on the Oxford Levels of Evidence criteria and write bottom line, up front


summaries20. With their permission, we integrated their outputs (daily reports or summaries and Levels of Evidence evaluations) into our collection. Although the project has since ended, the


valuable work by this team was integrated without further evaluation due to their background and training. We further integrated our publications by adding structured linkage metadata,


connecting preprints and their peer-reviewed versions. We performed separate Jaccard’s similarity calculations on the title and/or text and authors for preprint (bioRxiv or medRxiv)35 versus


LitCovid publications. We identified thresholds with high precision and low sensitivity and binned the matches into two groups: matched preprint or peer-reviewed publication versus ‘needs


review’. We also leveraged NLM’s pilot preprint program to identify and incorporate additional matches. The code used for the preprint matching and the .XLSX file detailing the


semi-automated and manual inspection of a sample of 1,500 matches from the results can be found at https://github.com/outbreak-info/outbreak_preprint_matcher


(https://doi.org/10.5281/zenodo.7439581). Briefly, a subsample of 1,500 preprint or peer-reviewed matches were inspected and confirmed to match via the preprint listed within the PubMed


record in the correction field (1,158 matches); manual inspection of preprint records, which listed the peer-reviewed publication (290 matches); and manual inspection of preprint and the


corresponding PubMed record and publication content (52 matches). The inspection confirmed that our threshold cutoff for preprint matching ensured the inclusion of a limited number of the


most accurate matches at the cost of many more potential but lower-quality matches. Expected matches were linked via the correction property in our schema. CASE STUDY ON VARIANT RESEARCH To


identify research about variants, we used the keyword phrase ‘variant OR lineage’ in the Research Library and within the R package outbreakinfo. For Fig. 2a, resources were counted by @type


(Publication, Dataset, ComputationalTool, ClinicalTrial, Protocol, Analysis). The number of resources was aggregated to the weekly level by the date of the latest update and normalized to


all resources within the Library for that week, creating a proportion of the Library for that week (Fig. 2c). For variant-specific queries, the WHO-designated name was combined with its


Pango lineage36 plus all descendants, as specified by the Pango team in October 2022 (https://raw.githubusercontent.com/cov-lineages/lineages-website/master/data/lineages.yml). To decrease


the likelihood of a spurious hit for the resource (for instance, a publication mentioning Alpha in the description but focusing only on Omicron), we used fielded queries to only search by


the name of the resource. For instance, for Gamma, the following query was used: name:Gamma OR name:‘P.1’ OR name:‘P.1.2’. Code to replicate the analysis and visualizations is available at


https://github.com/outbreak-info/outbreak-resources-paper/blob/main/Figure%204%20-%20Variant%20analysis.R. HARMONIZATION AND INTEGRATION OF RESOURCES AND GENOMIC DATA The integration of


genomic data from GISAID is discussed by Gangavarapu et al.24. We built separate API endpoints for our resources (metadata resource API) and genomics (genomic data API) using the BioThings


SDK28. Data are available via our API at http://api.outbreak.info and through our R package as described by Gangavarapu et al.24. LIMITATIONS While we have developed a framework for


addressing resource volume, fragmentation and variety that can be applicable to future pandemics, our efforts during this framework exposed additional limitations in how data and metadata


are currently collected and shared. Researchers have embraced preprints, but resources (especially datasets and computational tools) needed to replicate and extend research results are not


linked in ways that are discoverable. Although many journals and funders have embraced dataset and source code submission requirements, the result is that the publication of datasets and


software code is still heavily based in publications instead of in community repositories with well-described metadata to promote discoverability and reuse. In the outbreak.info Research


Library, the largest research output by far is publications, while dataset submission lags in standardized repositories encouraged by the NIH such as ImmPort, Figshare and Zenodo. We


hypothesize that this disparity between preprint and data sharing reflects the existing incentive structure, in which researchers are rewarded for writing papers and less for providing good,


reusable datasets. Ongoing efforts to improve metadata standardization and encourage schema adoption (such as the efforts in the Bioschemas community) will help make resources more


discoverable in the future, provided researchers adopt and use them. For this uptake to happen, fundamental changes in the incentive structure for sharing research outputs may be necessary.


As with many web-based, open-source resource sites, bugs and browser-compatibility issues may arise without notice for less-popular browsers. Users can bring these issues to our attention by


submitting them to our issue tracker on GitHub (https://github.com/outbreak-info/outbreak.info/issues). COMPARISON OF THE OUTBREAK.INFO RESEARCH LIBRARY WITH OTHER RESOURCES To illustrate


how our resource fits into the COVID-19 resource landscape, we compare features from our Research Library with other COVID-19 multisource aggregation efforts (Supplementary Table 4) and


provide a list of terms and features in Supplementary Table 5. We provide the most commonly searched sources (that is, filter by source) and resource types (that is, filter by resource type)


(Supplementary Table 1a). Usage statistics for record views and filtering by source are available in Supplementary Table 1b. Filtering was the most popular feature added to the Library,


with over a quarter of all queries using some sort of filtering (Supplementary Table 1c). Users were most likely to filter results by resource type, followed by keywords and source.


REPORTING SUMMARY Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. DATA AVAILABILITY All metadata harvested and


harmonized in the outbreak.info Research Library is freely available through an API (http://api.outbreak.info/) and in an associated R package


(https://outbreak-info.github.io/R-outbreak-info/). CODE AVAILABILITY All code used to generate the outbreak.info Research Library is freely available on GitHub


(https://github.com/outbreak-info) under open-source licenses. The outbreak.info web application is available at https://github.com/outbreak-info/outbreak.info (version of the code used in


this paper is available at https://doi.org/10.5281/zenodo.7343497). The outbreak.info R package to access all the genomics and epidemiology data and Research Library metadata compiled and


standardized on outbreak.info is available at https://github.com/outbreak-info/R-outbreak-info (version of the code used in this paper is available at


https://doi.org/10.5281/zenodo.7343501). The code to create the API (https://api.outbreak.info) to access Research Library metadata and case and death data is available at


https://github.com/outbreak-info/outbreak.api (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7343503). The harvester of bioRxiv and medRxiv preprint


publications is available at https://github.com/outbreak-info/biorxiv (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439483). The harvester of


clinical trials from https://clinicaltrials.gov is available at https://github.com/outbreak-info/clinical_trials (version of the code used in this paper is available at


https://doi.org/10.5281/zenodo.7439505). The harvester of COVID-19 LST level of evidence ratings is available at https://github.com/outbreak-info/covid19_LST_reports (version of the code


used in this paper is available at https://doi.org/10.5281/zenodo.7439527). The COVID-19 LST annotations code is available at https://github.com/outbreak-info/covid19_LST_annotations


(version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439515).The COVID-19 LST report data are available at


https://github.com/outbreak-info/covid19_LST_report_data (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439521). The harvester for manually curated


metadata from the DDE is available at https://github.com/biothings/discovery-app/blob/master/scripts/outbreak.py) (version of the code used in this paper is available at


https://doi.org/10.5281/zenodo.7439590). The harvester from Figshare COVID-19 is available at https://github.com/outbreak-info/covid_figshare (version of the code used in this paper is


available at https://doi.org/10.5281/zenodo.7439543). The harvester for COVID-19 collection of Harvard Dataverse is available at https://github.com/outbreak-info/dataverses (version of the


code used in this paper is available at https://doi.org/10.5281/zenodo.7439563). The harvester for analyses by Imperial College London is available at


https://github.com/outbreak-info/covid_imperial_college (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439545). The LitCovid publication harvester is


available at https://github.com/outbreak-info/litcovid (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439565). The harvester of metadata for


SARS-CoV-2 structures from the Protein Data Bank is available at https://github.com/outbreak-info/covid_pdb_datasets (version of the code used in this paper is available at


https://doi.org/10.5281/zenodo.7439549). The harvester of protocol metadata from protocols.io is available at https://github.com/outbreak-info/protocolsio (version of the code used in this


paper is available at https://doi.org/10.5281/zenodo.7439579). The harvester of clinical trials from WHO ICTR is available at


https://github.com/outbreak-info/covid_who_clinical_trials/blob/master/parser.py (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439553). The reusable


Research Library schemas for publications, datasets, clinical trials, protocols and analyses and associated data mappings are available at


https://github.com/outbreak-info/outbreak.info-resources (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439569). The reusable Research Library tools


for parsers are available at https://github.com/outbreak-info/outbreak_parser_tools (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439577). The code


to look up Altmetric ratings for outbreak.info resources is available at https://github.com/outbreak-info/covid_altmetrics (version of the code used in this paper is available at


https://doi.org/10.5281/zenodo.7439533). The code to match preprints to their peer-reviewed publications is available at https://github.com/outbreak-info/outbreak_preprint_matcher (version


of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439581). The machine learning topic classification of categories within the Research Library is available at


https://github.com/outbreak-info/topic_classifier (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439573). The mapping logic used to classify clinical


trial records using clinical trial-specific metadata is available at https://github.com/gtsueng/outbreak_CT_classifier (version of the code used in this paper is available at


https://doi.org/10.5281/zenodo.7442988). The evaluation of citizen scientist efforts is available at https://github.com/gtsueng/curate_outbreak_data (version of the code used in this paper


is available at https://doi.org/10.5281/zenodo.7442949). The code to generate the figures within this text, including for the case study, is available at


https://github.com/outbreak-info/outbreak-resources-paper (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439567). REFERENCES * _Novel Coronavirus


(2019-nCoV): Situation Report, 1_ (WHO, 2020); https://apps.who.int/iris/handle/10665/330760 * Dong, E. et al. An interactive web-based dashboard to track COVID-19 in real time. _Lancet


Infect. Dis._ 20, 533–534 (2020). Article  CAS  PubMed  PubMed Central  Google Scholar  * Kaiser, J. ‘Every day is a new surprise.’ Inside the effort to produce the world’s most popular


coronavirus tracker. _Science_ https://doi.org/10.1126/science.abc1085 (2020). * Noren, L. E. et al. _Institutional Response to COVID_


https://docs.google.com/spreadsheets/d/1IbF_wlmldVssG5spcmNE82nR9btcbF7rUlEqtcXW03o/edit#gid=0 (2020). * Morris, A. & citizen scientists. _USA COVID-19 K-12 School Closures, Quarantines,


and/or Deaths_


https://docs.google.com/spreadsheets/d/e/2PACX-1vQSD9mm5HTXhxAiHabZA6BPUByWBlP5HZ2jfOPEeGZkMB0ZFsmFBL5orqjIq22mjFNZ7n-11ObCylGn/pubhtml?fbclid=IwAR2tJ8yDVehGpxoP97Cco5HYAxoN014opwwm6uYt4s3E2xDr_8u9KF_LlgI#


(2020). * James, P. & citizen scientists. Staying home club. _GitHub_ https://github.com/phildini/stayinghomeclub (2020). * Pogkas, D. et al. The airlines halting flights as virus


outbreak spreads. _Bloomberg_ https://www.bloomberg.com/graphics/2020-china-coronavirus-airlines-business-effects/ (2020). * Joachimiak, M. et al. _SARS-COV-2 and COVID-19 Datasets_


https://docs.google.com/spreadsheets/d/1eMhot7MjusyM7_2IBnzqi7RlzWWoYnfheWhMgDlPToQ/edit#gid=0 (2020). * Skenderi, J. et al. _COVID-19 Resource Library_


https://docs.google.com/spreadsheets/u/2/d/1cqxDAg4jMHXI6gHOnoV8HqDdRHnmxEJRl-bhhpe1HEo/htmlview# (2020). * Navarro, C. & Capdarest-Arest, N. _COVID-19 Open Dataset Sources_


https://docs.google.com/spreadsheets/d/10t3vtULr3nTz7mrlKj0rldUys47wsIfOVReHnx3Xu18/edit#gid=0 (2020). * NIH OPA. _iSearch COVID-19 Portfolio_ (NIH, 2020);


https://icite.od.nih.gov/covid19/search * Allen Institute for AI. COVID-19 Open Research Dataset Challenge (CORD-19). _Kaggle_


https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge (2020). * Chen, Q. et al. LitCovid: an open database of COVID-19 literature. _Nucleic Acids Res._ 49, D1534–D1540


(2020). Article  PubMed Central  Google Scholar  * ClinicalTrials.gov. _Protocol Record Schema—XML Schema for Electronic Transfer of Protocol Information into the ClinicalTrials.gov Protocol


Registration System_ (National Library of Medicine, 2018) https://prsinfo.clinicaltrials.gov/ProtocolRecordSchema.xsd * Fava, I. et al. Coronavirus disease research community—COVID-19.


_Zenodo_ https://zenodo.org/communities/covid-19/?page=1&size=20 (2020). * Hyndman, A. A Figshare COVID-19 research publishing portal. _Figshare_


https://figshare.com/blog/A_Figshare_COVID-19_Research_Publishing_Portal/558 (2020). * European Organization for Nuclear Research. Zenodo FAIR principles. _Zenodo_


https://about.zenodo.org/principles/ (2013). * Hahnel, M. What Google dataset search means for academia. _Figshare_


https://figshare.com/blog/What_Google_Dataset_Search_means_for_academia/422 (2018). * Birkin, L. J. et al. Citizen science in the time of COVID-19. _Thorax_ 76, 636–637 (2021). Article 


PubMed  Google Scholar  * Rah, J. et al. COVID-19 Literature Surveillance Team. _Internet Archive_ https://web.archive.org/web/20211020140102;https://www.covid19lst.org/copy-of-about (2020).


* Tsueng, G. et al. Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts. _Bioinformatics_ 36, 1226–1233 (2020). Article  CAS  PubMed  Google


Scholar  * Blickhan, S. et al. Transforming research (and public engagement) through citizen science. _Proc. Int. Astron. Union_ 14, 518–523 (2018). * Digital Science. About us. _Altmetric_


https://www.altmetric.com/about-us/ (2022). * Gangavarapu, K. et al. Outbreak.info: real-time surveillance of SARS-CoV-2 mutations and variants. _Nat. Methods_


https://doi.org/10.1038/s41592-023-01769-3 (2023). * Haag, E. User stories Outbreak.info blog. _Sulab_ https://blog.outbreak.info/?tag=user_stories (2022). * Valentine, D. & RADx.


SearchOutbreak. Radical data coordination center. _Netlify_ https://searchoutbreak.netlify.app (2021). * Cano, M. et al. Schema Playground: a tool for authoring, extending, and using


metadata schemas to improve FAIRness of biomedical data. Preprint at _bioRxiv_ https://doi.org/10.1101/2021.09.02.458726 (2021). * Lelong, S. et al. BioThings SDK: a toolkit for building


high-performance data APIs in biomedical research. _Bioinformatics_ 38, 2077–2079 (2021). Article  Google Scholar  * BioMedical Informatics Coordinating Committee. _Data Sharing Resources_


(NIH, 2020) https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html * _Open Data at NSF_ (National Science Foundation, 2013); https://www.nsf.gov/data/ * Imperial College COVID-19


Response Team. _ONS Excess Deaths_ (Imperial College London, 2021); http://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/covid-19/covid-19-reports/ * _Controlling Relevance.


Elasticsearch: the Definitive Guide [2.x]_ (Elasticsearch B.V., 2023) https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-relevance.html * _Lucene’s Practical Scoring


Function. Elasticsearch: the Definitive Guide [2.x]_ (Elasticsearch B.V., 2023); https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html * Tsueng, G. et


al. Citizen science for mining the biomedical literature. _Citiz. Sci._ 1, 14 (2016). PubMed  PubMed Central  Google Scholar  * _COVID-19 SARS-CoV-2_ (medRxiv and bioRxiv, 2021);


https://connect.biorxiv.org/relate/content/181 * Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. _Nat. Microbiol._ 5, 1403–1407


(2020). Article  CAS  PubMed  PubMed Central  Google Scholar  Download references ACKNOWLEDGEMENTS We thank J. Rah, B.J. Enright, J. Doroshenko, T. Nishath and the rest of the COVID-19 LST


for allowing us to share their work. We thank T. Adams and C. Lazarchick for their work in identifying metadata from various individual datasets and their extensive feedback. We thank S.


Andarmani for her suggestions and feedback on dataset categories. We thank all Outbreak Curators contributors found at https://blog.outbreak.info/dataset-topic-category-contributors for


taking the time to categorize datasets. We thank S. Ul-Hasan for their feedback on the R package. We thank D. Valentine for sharing details about his netlify app as part of the RADx-Rad Data


Coordination Center, which is funded by the NIH (U24LM013755). Work on outbreak.info was supported by the National Institute for Allergy and Infectious Diseases (5 U19 AI135995: G.T.,


J.L.M., M.A., M.C., E. Haag, A.A.L., E. Hufbauer, M.Z., K.G.A., C.W., A.I.S., K.G., L.D.H.; 3 U19 AI135995-04S3: G.T., J.L.M., E. Haag, E. Hufbauer, K.G.A., C.W., A.I.S., K.G., L.D.H.; 3 U19


AI135995-03S2: G.T., J.L.M., E. Haag, E. Hufbauer, K.G.A., C.W., A.I.S., K.G., L.D.H.; 75N91019D00024: G.T., E. Haag, J.L., D.J.W., C.W., A.I.S., L.D.H.), the National Center for Advancing


Translational Sciences (5 U24 TR002306: G.T., J.L.M., M.C., C.W., A.I.S., L.D.H.), the Centers for Disease Control and Prevention (75D30120C09795: M.A., A.A.L., M.Z., K.G.A., K.G.) and the


National Institute of General Medical Sciences (R01GM083924: G.T., M.C., X.Z., Z.Q., C.W., A.I.S.). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Integrative, Structural and


Computational Biology, the Scripps Research Institute, La Jolla, CA, USA Ginger Tsueng, Julia L. Mullen, Marco Cano, Emily Haag, Jason Lin, Dylan J. Welzel, Xinghua Zhou, Zhongchao Qian, 


Chunlei Wu, Andrew I. Su & Laura D. Hughes * Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA Manar Alkuzweny * Department of Immunology and Microbiology,


the Scripps Research Institute, La Jolla, CA, USA Manar Alkuzweny, Alaa Abdel Latif, Emory Hufbauer, Mark Zeller, Kristian G. Andersen & Karthik Gangavarapu * Ocuvera, Lincoln, NE, USA


Benjamin Rush * Scripps Research Translational Institute, La Jolla, CA, USA Kristian G. Andersen, Chunlei Wu & Andrew I. Su * Department of Molecular Medicine, the Scripps Research


Institute, La Jolla, CA, USA Chunlei Wu & Andrew I. Su * Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA Karthik


Gangavarapu Authors * Ginger Tsueng View author publications You can also search for this author inPubMed Google Scholar * Julia L. Mullen View author publications You can also search for


this author inPubMed Google Scholar * Manar Alkuzweny View author publications You can also search for this author inPubMed Google Scholar * Marco Cano View author publications You can also


search for this author inPubMed Google Scholar * Benjamin Rush View author publications You can also search for this author inPubMed Google Scholar * Emily Haag View author publications You


can also search for this author inPubMed Google Scholar * Jason Lin View author publications You can also search for this author inPubMed Google Scholar * Dylan J. Welzel View author


publications You can also search for this author inPubMed Google Scholar * Xinghua Zhou View author publications You can also search for this author inPubMed Google Scholar * Zhongchao Qian


View author publications You can also search for this author inPubMed Google Scholar * Alaa Abdel Latif View author publications You can also search for this author inPubMed Google Scholar *


Emory Hufbauer View author publications You can also search for this author inPubMed Google Scholar * Mark Zeller View author publications You can also search for this author inPubMed 


Google Scholar * Kristian G. Andersen View author publications You can also search for this author inPubMed Google Scholar * Chunlei Wu View author publications You can also search for this


author inPubMed Google Scholar * Andrew I. Su View author publications You can also search for this author inPubMed Google Scholar * Karthik Gangavarapu View author publications You can also


search for this author inPubMed Google Scholar * Laura D. Hughes View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS L.D.H., K.G., M.C., E.


Haag, J.L.M., X.Z., Z.Q., E. Hufbauer, C.W., A.I.S., K.G.A., A.A.L., M.Z., G.T., J.L. and D.J.W. contributed to the design, construction and/or maintenance of the outbreak.info website and


data pipelines. K.G., M.A. and L.D.H. designed and built the R outbreak.info package. M.A., A.A.L., K.G., E. Haag, E. Hufbauer, M.Z., K.G.A. and L.D.H. designed and linked the variant


reports. L.D.H., J.L.M., G.T. and M.C. developed the schemas. E. Haag performed the usability studies. B.R. developed the curation app. L.D.H., G.T., E. Haag, K.G., M.Z. and J.L.M.


contributed to writing and editing the manuscript. CORRESPONDING AUTHORS Correspondence to Ginger Tsueng or Laura D. Hughes. ETHICS DECLARATIONS COMPETING INTERESTS K.G.A. has received


consulting fees and/or compensated expert testimony on SARS-CoV-2 and the COVID-19 pandemic. The remaining authors declare no competing interests. PEER REVIEW PEER REVIEW INFORMATION _Nature


Methods_ thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the _Nature Methods_ team. Peer


reviewer reports are available. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION Supplementary Figs. 1–8. REPORTING SUMMARY PEER REVIEW FILE SUPPLEMENTARY TABLES Supplementary Tables 1–5. RIGHTS AND PERMISSIONS


Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author


self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Reprints and permissions ABOUT THIS ARTICLE


CITE THIS ARTICLE Tsueng, G., Mullen, J.L., Alkuzweny, M. _et al._ Outbreak.info Research Library: a standardized, searchable platform to discover and explore COVID-19 resources. _Nat


Methods_ 20, 536–540 (2023). https://doi.org/10.1038/s41592-023-01770-w Download citation * Received: 03 June 2022 * Accepted: 17 January 2023 * Published: 23 February 2023 * Issue Date:


April 2023 * DOI: https://doi.org/10.1038/s41592-023-01770-w SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a


shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative