Outbreak. Info research library: a standardized, searchable platform to discover and explore covid-19 resources

Play all audios:

ABSTRACT Outbreak.info Research Library is a standardized, searchable interface of coronavirus disease 2019 (COVID-19) and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

publications, clinical trials, datasets, protocols and other resources, built with a reusable framework. We developed a rigorous schema to enforce consistency across different sources and

resource types and linked related resources. Researchers can quickly search the latest research across data repositories, regardless of resource type or repository location, via a search

interface, public application programming interface (API) and R package. SIMILAR CONTENT BEING VIEWED BY OTHERS INTERNET SEARCH PATTERNS REVEAL CLINICAL COURSE OF COVID-19 DISEASE

PROGRESSION AND PANDEMIC SPREAD ACROSS 32 COUNTRIES Article Open access 11 February 2021 COVID-19 OPEN-DATA A GLOBAL-SCALE SPATIALLY GRANULAR META-DATASET FOR _CORONAVIRUS_ DISEASE Article

Open access 12 April 2022 OUTBREAK.INFO GENOMIC REPORTS: SCALABLE AND DYNAMIC SURVEILLANCE OF SARS-COV-2 VARIANTS AND MUTATIONS Article 23 February 2023 MAIN In January 2020, SARS-CoV-2 was

identified as the virus responsible for a series of pneumonia cases with unknown origin1. As the virus spread globally, the scientific community rapidly released research outputs (such as

publications, clinical trials and datasets) and resources (websites, portals and more). The frequently uncoordinated generation and curation of resources exacerbated four challenges in

finding and using them: volume, fragmentation, variety and standardization (Supplementary Fig. 1). While many specialized websites were developed independently2,3,4,5,6,7, a centralized and

standardized repository for finding COVID-19 research has limited researchers’ ability to discover these resources and translate them into insights about the virus. To address the fragmented

research landscape, individual and community efforts created shared Google spreadsheets8,9,10 to aid in discoverability, but these efforts were not scalable and often lacked metadata to

promote findability (aside from Navarro and Capdarest-Arest10). Several projects attempted to address the volume and fragmentation issues through large-scale aggregation but failed to tackle

variety, focusing on a single resource type such as publications11,12. Even within a particular type of resource, standardization issues abound. Repositories pivoted quickly to curate

COVID-19 content from their collections using pre-existing metadata standards but were often not interoperable with other sources. For example, PubMed created LitCovid13 based on their

MEDLINE standards, and the National Clinical Trials Registry cataloged COVID-19 clinical trials using their schema14, but the World Health Organization (WHO) International Clinical Trials

Registry uses different conventions. Similarly, Zenodo15 and Figshare16 do not agree on the marginality, cardinality and property names17,18, despite compatibility with the standards of

https://schema.org. We address issues in metadata volume, variety, standardization and fragmentation by creating a single searchable index of COVID-19 publications, clinical trials, datasets

and more: the outbreak.info Research Library. To address variety and standardization, we developed a harmonized schema based on https://schema.org, a framework standardizing metadata across

the internet. Using this schema, we harvested and harmonized metadata from 16 resources (Fig. 1a). Daily updates ensure that site users have up-to-date information, essential amid a

constantly changing research landscape. Next, to address volume and fragmentation, we developed a web-based search portal for researchers to browse across the centralized and standardized

resources (https://outbreak.info/resources) and an API to access and analyze information en masse (https://api.outbreak.info). Within the search interface, users can search, filter and view

related records and share the associated metadata to easily query across resource repositories and types. For instance, a single query (for example, ‘Delta variant’) to our API can return

relevant publications, datasets, clinical trials and more (Fig. 1b), and the Research Library summarizes the search results in visualizations to promote exploration. For instance, the

histogram in Fig. 1b indicates that the number of resources mentioning ‘Delta variant’ began growing in mid 2021 and declined in the summer of 2022, and the donut charts show that LitCovid

is the dominant source. To ensure ease of use of our Research Library, we conducted usability studies and iteratively improved our site (Supplementary Fig. 2). To further address

fragmentation and maintenance issues, we use modular infrastructure, allowing easy addition of new data sources, including community contributions. Citizen scientists have played an active

role in data collection19 (https://covidsample.org/) and accessibility12,20 throughout the pandemic. Given the highly fragmented, diffuse and frequently changing nature inherent to

biomedical research, we built in three mechanisms to expand the Research Library through community participation (Supplementary Fig. 3a). First, contributors can submit individual or

multiple datasets via an online form that ensures that the curated metadata conform to our schema. Second, leveraging the benefits of human curation, the community-contributed metadata using

the form can be exhaustively detailed (Supplementary Fig. 3b) and can further be augmented through pull requests on GitHub. Lastly, anyone with Python coding skills can submit collections

of standardized datasets, publications and other resources to the Outbreak Resources API by contributing a resource parser. Our community-contribution pipeline allows us to integrate the

uncoordinated data-curation efforts quickly and flexibly, particularly apparent at the start of the pandemic (Supplementary Fig. 4). To support resource exploration and interpretation, we

added properties (value-added metadata) to every class in our schema that would support searching, filtering and browsing (topicCategories, Supplementary Fig. 5a); linkage and exploration

(correction, citedBy, isBasedOn, isRelatedTo; Supplementary Fig. 5b); and interpretation (qualitative evaluations) of resources. We selected these properties based on pre-existing citizen

science- and resource-curation activities, suggesting their value in promoting discoverability. For example, citizen scientists categorized resources in their lists or collections by type

(Dataset, ClinicalTrials, etc.) in their outputs10 or area of research (epidemiological, prevention, etc.)20 as they found these classifications helpful for searching, filtering and browsing

their lists or collections. Given the ability of citizen scientists to perform information extraction21 and their immense contributions to classification tasks22, we incorporated citizen

science contributions into the training data for classifying resources into topic categories. Citizen scientists also provided Oxford 2011 Levels of Evidence annotations to improve its

interpretability (that is, understanding the credibility or quality of the resource)20. To further enable assessment of the quality of a resource, we leveraged Digital Science’s Altmetric

ratings23. Finally, we integrated resources with the analyses that we developed to track SARS-CoV-2 variants of concern (VOCs)24, sets of mutations within the virus associated with increased

transmissibility, virulence and/or immune evasion. Researchers can seamlessly traverse from a specific variant report such as Omicron to resources in the Research Library that help

understand its behavior (Supplementary Fig. 5c), and variant searches are among our most commonly queried terms (Supplementary Table 1a). Without a centralized search interface with linked

records such as outbreak.info, a similar attempt to explore resources would require extensive manual searching from multiple different sites (Supplementary Fig. 6), each with their own

interfaces and corresponding search capabilities. To demonstrate the unique features of the outbreak.info Research Library, we explored the dynamics of research into SARS-CoV-2 variants over

time to address two key questions: (1) how has the research community responded to the emergence of new variants and (2) how has that response changed over time? We extracted research

related to variants in the Research Library using the query ‘variant OR lineage’, allowing us to query metadata from 16 sources of different research types simultaneously (Fig. 2a). Over

10,000 separate entries about variants are within the Library as of October 2022, including publications, datasets, clinical trials, protocols and more. Using filters and the quality metrics

provided through Altmetric badges, we quickly identified which results have been recognized by the community via Altmetric scores, such as a quantitative PCR protocol with reverse

transcription (RT–qPCR) to screen VOCs (Fig. 2b). Clearly, variants are an active area of research, but has this enthusiasm changed over time? Using the outbreak.info R package, we accessed

the harmonized metadata to examine the proportion of research related to variants in the Research Library over time. We observed an increase in research on variants following the first

identification of VOCs such as Alpha (B.1.1.7*) and Beta (B.1.351*) (Fig. 2c). This increase was even more prominent for the Omicron (B.1.1.529*) variant in late 2021; we hypothesize that

this increase was due to the heightened awareness of the value in studying variants among the scientific community, and early indications that the variant could be of global concern (high

growth rate of Omicron and the presence of many mutations in important sites). To examine how research differed by VOC over time, we constructed queries for each VOC, including its Pango

lineage name and associated sublineages. With the three VOCs that became the dominant worldwide form of SARS-CoV-2 (Alpha, Delta and Omicron), we find that the increase in research on these

VOCs mirrors the rise in worldwide prevalence for each variant, with the research output roughly proportional to global prevalence (Fig. 2d). With Alpha and Delta, there was a slight lag in

research publications that was not observed with Omicron, and research on Omicron over the last 10 months has dwarfed that for the other VOCs. Lastly, research on previously circulating

variants (Alpha, Beta, Gamma, Delta) continues, even though these variants are rarely detected presently, and focuses on retrospective analyses, fundamental studies on mechanisms of action,

Omicron comparisons and studies of recombinant variants. In sum, the research community’s response to the emergence of new variants has been robust, has become a greater focus of overall

research effort over the last year and quickly pivots to studying the dominant variant. The outbreak.info Research Library and resources API have been widely used by the external community,

including journalists, members of the medical and public health communities, students and biomedical researchers25. For instance, the RADx-Rad Data Coordination Center created the

SearchOutbreak app (https://searchoutbreak.netlify.app), which uses the Outbreak API to collect articles for customized research digests for its partners26. On average, the Research Library

receives nearly 3,000 pageviews per month, of which 85% are unique visitors (Supplementary Table 1b). The Research Library site has been used for over 11,000 unique searches, and the

Research Library API receives an average of nearly 63,000 unique hits per month (including web traffic and programmatic access). Some limitations of the Research Library include incomplete

or unstructured metadata descriptions provided by the sources and optimally querying these descriptions, which often include acronyms and synonyms. Future work will focus on augmenting the

harvested metadata and optimizing search results to provide the most salient results to users. While the unprecedented amount of research on COVID-19 offers new opportunities to accelerate

the pace of research, the difficulty in finding research amid this ‘infodemic’ remains a fundamental challenge. In the outbreak.info Research Library, we address many of these challenges to

assemble a collection of heterogeneous research outputs and data from distributed data sources into a searchable platform. Our metadata-processing platform is modular, allowing easy

extension to add new metadata sources including contributions from the community, allowing the Research Library to grow with the pandemic as research changes. To enable further analysis, we

enable programmatic access to the standardized library. Lastly, with the embrace of open science stored in decentralized sources, quickly finding information will be critical for the next

pandemic. Our approach to unify metadata across repositories will serve as a template for rapidly creating a unified search interface to aggregate research outputs for any pathogen or any

research domain. METHODS SCHEMA DEVELOPMENT The development of the schema for standardizing our collection of resources is as previously described27. Briefly, we prioritized six classes of

resources that had seen a rapid expansion at the start of the pandemic due to their importance to the research community: publications, datasets, clinical trials, analyses, protocols and

computational tools. We identified the most closely related classes from https://schema.org and mapped their properties to available metadata from two to five of the most prolific sources.

Additionally, we identified subclasses that were needed to support the aforementioned six classes and standardized the properties within each class. In addition to standardizing

ready-to-harvest metadata, we created new properties that would support the linkage, exploration and evaluation of our resources. Our schema was then refined as we iterated through the

available metadata when assembling COVID-19 resources. For example, publication providers such as PubMed typically use the ‘author’ property in their metadata, while dataset providers such

as Figshare and Zenodo are compliant with the DataCite schema and typically prefer ‘creator’. Although both properties are valid for their respective https://schema.org classes, we

normalized our schema to use ‘author’ for all six of our classes (Dataset, ClinicalTrial, Analysis, Protocol, Publication, ComputationalTool), because we expected the volume of publications

to dwarf that of all other classes of resources. We added this schema to the Schema Registry of the Data Discovery Engine (DDE)27, a project to share and reuse schemas and register datasets

according to a particular schema. The Outbreak schema is available at https://discovery.biothings.io/view/outbreak. ASSEMBLY OF COVID-19 RESOURCES The resource metadata pipeline for

outbreak.info includes two ways to ingest metadata (Supplementary Fig. 7). First, metadata can be ingested from other resource repositories or collections using the BioThings SDK28 data

plugins. By leveraging the BioThings SDK, we developed a technology stack that addresses the fragmentation issue by easily integrating metadata from different pre-existing resources. For

each resource repository or collection, a parser or data plugin enables automated import and updates from that resource. To import the data, the metadata is harvested from the source using

API calls (if available), HTML web scraping or .CSV or .TXT tables of metadata. All structured metadata provided by the sources is compiled and mapped to our schema using custom Python

scripts. The harmonized metadata is dumped into a JSON output. Supplementary Fig. 8 shows the completeness of each metadata property within our schema, broken down by resource type (data are

provided in Supplementary Table 2). Data plugin code for the sources is available at https://github.com/outbreak-info (Code availability). In the second mechanism, metadata for individually

curated resources can be submitted via an online form through the DDE Metadata Registry27. To assemble the outbreak.info collection of resources, we collected a list of over a hundred

separate resources on COVID-19 and SARS-CoV-2. This list (Supplementary Table 3) included generalist open data repositories, biomedical-specific data projects including those recommended by

the NIH29 and the NSF30 to house open data and individual websites that we came across through search engines and other COVID-19 publications. Prioritizing those resources that had a large

number of resources related to COVID-19, we selected an initial set of two to three sources per resource type to import into our collection. Given the lack of widespread repositories for

analysis resources, only one source would be included in our initial import (Imperial College London31). An analysis resource is defined as a frequently updated, web-based, data

visualization, data interpretation and/or data analysis resource. CREATION OF THE RESEARCH LIBRARY API AND QUERY INTERFACE To accommodate a large number of heterogeneous data sources, each

of which is independently harvested, we used the BioThings SDK framework to combine the data sources into a combined, public searchable index (Supplementary Fig. 7). The JSON outputs of our

data plugins are ingested by the BioThings framework and merged into an intermediary MongoDB database, and the processed data are indexed in an Elasticsearch index that can be accessed

through our public API (api.outbreak.info). The BioThings SDK plugin architecture handles errors in individual parsers without affecting the availability of the API itself. Errors thrown by

individual parsers may result in a lack of updates of an individual resource until the error is resolved, but the API will serve the latest version of data from the broken parser and

up-to-date data from all functional parsers, which will continue to be updated independently. Using the plugin architecture also allows the creation and maintenance of individual resource

parsers to be crowdsourced to anyone with basic Python knowledge and a GitHub account. Although resource plugins allow outbreak.info to ingest large amounts of standardized metadata, there

are still many individual datasets and research outputs scattered throughout the web that are not located in large repositories. As it is not feasible for one team to locate, identify and

collect standardized metadata from these individual datasets and research outputs, we leveraged the DDE27 to enable crowdsourcing and citizen science participation in the curation of

individual resource metadata. A Tornado server is used to create an API endpoint, api.outbreak.info/resources, that leverages the search capabilities of Elasticsearch to efficiently query

data. Within the search results, Elasticsearch sorts them by relevance based on Lucene’s Practical Scoring Function32, which prioritizes the query normalization factor, coordination factor,

term frequency, inverse document frequency and any custom query-boosting fields selected by the user33. To adjust this behavior based on common search patterns, we upweighted queries for

which the search term occurs in the name field and/or the name of a clinical trial therapeutic intervention (for example, ‘remdesivir’) with the following parameters: weight of 4 for ‘name’

and 3 for ‘interventions.name’. We continue to monitor common query patterns using our analytics to refine the scoring algorithm to improve the list of results for the user. Within the web

interface, the user has the option to sort by the best match-relevance score, update date for the document or alphabetically by name. Within search queries, terms are automatically combined

by ‘AND’. For instance, the search ‘long COVID’ will be interpreted as ‘long AND COVID’. This search will find resources containing both terms, although not necessarily together; the

Elasticsearch default scoring function will first list resources that contain both words together and that frequently mention the terms. Exact phrases can be explicitly declared by

encapsulating the terms in quotes (for example, ‘long COVID’ to search only for the phrase ‘long COVID’). Additionally, terms can be combined by the term ‘OR’ (for example, (Moderna OR

Pfizer) AND (‘side effects’ OR ‘adverse effects’)). Further details on advanced searching behavior are provided in our guide to the outbreak.info R package at

https://outbreak-info.github.io/R-outbreak-info/articles/researchlibrary.html#some-notes-on-constructing-queries. Further optimization will be the subject of future work, based on continuing

analysis of analytic patterns for the most common search queries and filters to promote user-driven design. Additional work will also focus on creating an advanced query builder to make it

easier to combine terms by any combination of ‘AND’, ‘OR’ and ‘NOT’ and to help the user search for exact phrases. To update the API with new data provided by the data sources, the BioThings

Hub schedules daily updates to pull data upstream and add them to the existing index. The BioThings Hub independently maintains each data source, enabling independence if an individual data

source pipeline breaks, and maintains historical data by default, creating automated backups. The code for the server-side application is available at

https://github.com/outbreak-info/outbreak.api (https://doi.org/10.5281/zenodo.7343503). OUTBREAK.INFO RESEARCH LIBRARY WEB APPLICATION AND METADATA ACCESS The web application was built using

Vue.js, a model–view–viewmodel JavaScript framework that enables the two-way binding of user interface elements and the underlying data allowing the user interface to reflect any changes in

underlying data and vice versa. The client-side application uses the high-performance API to interactively perform operations on the database. To iteratively improve the interface, we

conducted usability studies as described in Supplementary Fig. 2. The code for the client-side application is available at https://github.com/outbreak-info/outbreak.info

(https://doi.org/10.5281/zenodo.7343497). To enable programmatic access to all our harmonized metadata collection, all data are available in our API (api.outbreak.info) and can be accessed

through an R package as described by Gangavarapu et al.24 (package website, https://outbreak-info.github.io/R-outbreak-info/; code, https://github.com/outbreak-info/R-outbreak-info,

https://doi.org/10.5281/zenodo.7343501). COMMUNITY CURATION OF RESOURCE METADATA Resource plugins such as those used in the assembly of COVID-19 resources do not necessarily have to be built

by our own team. We used the BioThings SDK28 and the DDE27 so that individual resource collections can be added by writing BioThings plugins that conform to our schema. Expanding available

classes of resources can be easily carried out by extending other classes from https://schema.org via the DDE Schema Playground at https://discovery.biothings.io/schema-playground. Community

contributions of resource plugins can be carried out via GitHub. In addition to contributing resource plugins for collections or repositories of metadata, users can enter metadata for

individual resources via the automatic guides created by the DDE. To investigate potential areas of community contribution, we asked two volunteers to inspect 30 individual datasets

sprinkled around the web and collect the metadata for these datasets. We compared the results between the two volunteers, and their combined results were subsequently submitted into the

collection via the DDE’s Outbreak Data Portal Guide at https://discovery.biothings.io/guide/outbreak/dataset. Although limited by the original submission form (Google forms), the raw and

merged responses illustrating the thoroughness of the submissions from the two volunteers can be found at

https://docs.google.com/spreadsheets/d/1q1c400UFIOyXedFf2L81zROVkXi3BWBhU46Ic0cMYsI/edit?usp=sharing. Although both of our volunteers provided values for many of the available metadata

properties (name, description, topicCategories, keywords, etc.), one provided an extensive list of authors. Using the BioThings SDK in conjunction with the DDE allows us to centralize and

leverage individualized curation efforts that often occur at the start of a pandemic. Improvements or updates for manually curated metadata can be submitted via GitHub pull requests.

COMMUNITY CURATION OF SEARCHING, LINKAGE AND EVALUATION METADATA AND SCALING WITH MACHINE LEARNING In an effort to enable improved searching and filtering, we developed a nested list of

thematic or topic-based categories based on an initial list developed by LitCovid13 with input from the infectious disease research community and volunteer curators. The list consists of 11

broad categories and 24 specific child categories. LitCovid organized publications into eight research areas such as treatments or prevention, but these classifications are not available in

the actual metadata records for each publication. To obtain these classifications from LitCovid, subsetted exports of identifiers were downloaded from LitCovid and then mapped to the

metadata records from PubMed. Whenever possible, sources with thematic categories were mapped to our list of categories to develop a training set for basic binary (in-group–out-group)

classifications of required metadata fields such as (title, abstract and/or description). If an already curated training set could not be found for a broad category, it would be created

using an iterative process involving term–phrase searching on LitCovid, evaluating the specificity of the results, identifying new search terms by keyword frequency and repeating the

process. To generate training data for classifying resources into specific topic categories, the results from several approaches were combined. These approaches include direct mapping from

LitCovid research areas, keyword mapping from LitCovid, logical mapping from NCT ClinicalTrials metadata, the aforementioned term search iteration and citizen science curation of Zenodo and

Figshare datasets. Details on the logical mapping from NCT ClinicalTrials metadata can be found at https://github.com/gtsueng/outbreak_CT_classifier (https://doi.org/10.5281/zenodo.7442988).

The keyword mapping from LitCovid can be found at https://github.com/outbreak-info/topic_classifier/tree/main/data/keyword and

https://github.com/outbreak-info/topic_classifier/tree/main/data/subtopics/keywords. While positive categorical data were identified via the aforementioned methods, negative controls were

generated by randomly selecting from alternative topics and ensuring no overlap. The categorical data were randomly split into training (80%) and test (20%) sets per test, and five tests

were performed per topic by applying out-of-the-box logistic regression and multinomial naive Bayes and random forest algorithms from scikit-learn. These three algorithms were found to

perform best on this binary classification task using out-of-the-box tests. Topics were only added to the record if all three methods agreed on the classification. The set size and test

results using default tests from scikit-learn for each algorithm for each topic and subtopic for each of the five test runs can be found at

https://github.com/outbreak-info/topic_classifier/blob/main/results/in_depth_classifier_test.tsv. The efforts of our two volunteers suggested that non-experts were capable of thematically

categorizing datasets; therefore, we built a simple interface to allow citizen scientists to thematically classify the datasets that were available in our collection at that point in time.

Each dataset was assigned up to five topics by at least three different citizen scientists to ensure quality of the results. Citizen scientists were asked to prioritize specific topic

categories over broader ones. Ninety citizen scientists recruited via either participation in the Mark2Cure project34 or a Scripps Research summer program participated in classifying 530

datasets pulled from Figshare and Zenodo, increasing the likelihood of quality submissions and decreasing the likelihood of abuse and false information. The citizen science-curation site was

originally hosted at https://curate.outbreak.info. The code for the site can be found at https://github.com/outbreak-info/outbreak.info-resources/tree/master/citsciclassify. The citizen

science classifications can be found at https://github.com/outbreak-info/topic_classifier/blob/main/data/subtopics/curated_training_df.pickle. To evaluate the quality of the citizen

scientist classifications, we first filtered classifications where at least two or three of three to five curators agreed on the topic category. We then compared the results of their

classification with predictions by an out-of-the-box algorithm that was trained on LitCovid-classified abstracts. A total of 186 of 530 classifications did not agree and were manually

inspected; only about 10% of the categorization (54) was worse with citizen scientists over the predictions, and, in many cases, the curators provided more precise categorization. Full

details of the evaluation are available at https://github.com/gtsueng/curate_outbreak_data (https://doi.org/10.5281/zenodo.7442949). These classifications have been incorporated into the

appropriate datasets in our collection and have been used to build our models for topic categorization. Basic in-group–out-group classification models were developed for each category using

out-of-the-box logistic regression and multinomial naive Bayes and random forest algorithms available from scikit-learn. The topic classifier can be found at

https://github.com/outbreak-info/topic_classifier (https://doi.org/10.5281/zenodo.7439573). In addition to community curation of topic categorizations, we identified a citizen science

effort, the COVID-19 Literature Surveillance Team (COVID-19 LST), that was evaluating the quality of COVID-19 related literature. The COVID-19 LST consists of medical students (many of which

were in their third or fourth year), practitioners and researchers who evaluate publications on COVID-19 based on the Oxford Levels of Evidence criteria and write bottom line, up front

summaries20. With their permission, we integrated their outputs (daily reports or summaries and Levels of Evidence evaluations) into our collection. Although the project has since ended, the

valuable work by this team was integrated without further evaluation due to their background and training. We further integrated our publications by adding structured linkage metadata,

connecting preprints and their peer-reviewed versions. We performed separate Jaccard’s similarity calculations on the title and/or text and authors for preprint (bioRxiv or medRxiv)35 versus

LitCovid publications. We identified thresholds with high precision and low sensitivity and binned the matches into two groups: matched preprint or peer-reviewed publication versus ‘needs

review’. We also leveraged NLM’s pilot preprint program to identify and incorporate additional matches. The code used for the preprint matching and the .XLSX file detailing the

semi-automated and manual inspection of a sample of 1,500 matches from the results can be found at https://github.com/outbreak-info/outbreak_preprint_matcher

(https://doi.org/10.5281/zenodo.7439581). Briefly, a subsample of 1,500 preprint or peer-reviewed matches were inspected and confirmed to match via the preprint listed within the PubMed

record in the correction field (1,158 matches); manual inspection of preprint records, which listed the peer-reviewed publication (290 matches); and manual inspection of preprint and the

corresponding PubMed record and publication content (52 matches). The inspection confirmed that our threshold cutoff for preprint matching ensured the inclusion of a limited number of the

most accurate matches at the cost of many more potential but lower-quality matches. Expected matches were linked via the correction property in our schema. CASE STUDY ON VARIANT RESEARCH To

identify research about variants, we used the keyword phrase ‘variant OR lineage’ in the Research Library and within the R package outbreakinfo. For Fig. 2a, resources were counted by @type

(Publication, Dataset, ComputationalTool, ClinicalTrial, Protocol, Analysis). The number of resources was aggregated to the weekly level by the date of the latest update and normalized to

all resources within the Library for that week, creating a proportion of the Library for that week (Fig. 2c). For variant-specific queries, the WHO-designated name was combined with its

Pango lineage36 plus all descendants, as specified by the Pango team in October 2022 (https://raw.githubusercontent.com/cov-lineages/lineages-website/master/data/lineages.yml). To decrease

the likelihood of a spurious hit for the resource (for instance, a publication mentioning Alpha in the description but focusing only on Omicron), we used fielded queries to only search by

the name of the resource. For instance, for Gamma, the following query was used: name:Gamma OR name:‘P.1’ OR name:‘P.1.2’. Code to replicate the analysis and visualizations is available at

https://github.com/outbreak-info/outbreak-resources-paper/blob/main/Figure%204%20-%20Variant%20analysis.R. HARMONIZATION AND INTEGRATION OF RESOURCES AND GENOMIC DATA The integration of

genomic data from GISAID is discussed by Gangavarapu et al.24. We built separate API endpoints for our resources (metadata resource API) and genomics (genomic data API) using the BioThings

SDK28. Data are available via our API at http://api.outbreak.info and through our R package as described by Gangavarapu et al.24. LIMITATIONS While we have developed a framework for

addressing resource volume, fragmentation and variety that can be applicable to future pandemics, our efforts during this framework exposed additional limitations in how data and metadata

are currently collected and shared. Researchers have embraced preprints, but resources (especially datasets and computational tools) needed to replicate and extend research results are not

linked in ways that are discoverable. Although many journals and funders have embraced dataset and source code submission requirements, the result is that the publication of datasets and

software code is still heavily based in publications instead of in community repositories with well-described metadata to promote discoverability and reuse. In the outbreak.info Research

Library, the largest research output by far is publications, while dataset submission lags in standardized repositories encouraged by the NIH such as ImmPort, Figshare and Zenodo. We

hypothesize that this disparity between preprint and data sharing reflects the existing incentive structure, in which researchers are rewarded for writing papers and less for providing good,

reusable datasets. Ongoing efforts to improve metadata standardization and encourage schema adoption (such as the efforts in the Bioschemas community) will help make resources more

discoverable in the future, provided researchers adopt and use them. For this uptake to happen, fundamental changes in the incentive structure for sharing research outputs may be necessary.

As with many web-based, open-source resource sites, bugs and browser-compatibility issues may arise without notice for less-popular browsers. Users can bring these issues to our attention by

submitting them to our issue tracker on GitHub (https://github.com/outbreak-info/outbreak.info/issues). COMPARISON OF THE OUTBREAK.INFO RESEARCH LIBRARY WITH OTHER RESOURCES To illustrate

how our resource fits into the COVID-19 resource landscape, we compare features from our Research Library with other COVID-19 multisource aggregation efforts (Supplementary Table 4) and

provide a list of terms and features in Supplementary Table 5. We provide the most commonly searched sources (that is, filter by source) and resource types (that is, filter by resource type)

(Supplementary Table 1a). Usage statistics for record views and filtering by source are available in Supplementary Table 1b. Filtering was the most popular feature added to the Library,

with over a quarter of all queries using some sort of filtering (Supplementary Table 1c). Users were most likely to filter results by resource type, followed by keywords and source.

REPORTING SUMMARY Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. DATA AVAILABILITY All metadata harvested and

harmonized in the outbreak.info Research Library is freely available through an API (http://api.outbreak.info/) and in an associated R package

(https://outbreak-info.github.io/R-outbreak-info/). CODE AVAILABILITY All code used to generate the outbreak.info Research Library is freely available on GitHub

(https://github.com/outbreak-info) under open-source licenses. The outbreak.info web application is available at https://github.com/outbreak-info/outbreak.info (version of the code used in

this paper is available at https://doi.org/10.5281/zenodo.7343497). The outbreak.info R package to access all the genomics and epidemiology data and Research Library metadata compiled and

standardized on outbreak.info is available at https://github.com/outbreak-info/R-outbreak-info (version of the code used in this paper is available at

https://doi.org/10.5281/zenodo.7343501). The code to create the API (https://api.outbreak.info) to access Research Library metadata and case and death data is available at

https://github.com/outbreak-info/outbreak.api (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7343503). The harvester of bioRxiv and medRxiv preprint

publications is available at https://github.com/outbreak-info/biorxiv (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439483). The harvester of

clinical trials from https://clinicaltrials.gov is available at https://github.com/outbreak-info/clinical_trials (version of the code used in this paper is available at

https://doi.org/10.5281/zenodo.7439505). The harvester of COVID-19 LST level of evidence ratings is available at https://github.com/outbreak-info/covid19_LST_reports (version of the code

used in this paper is available at https://doi.org/10.5281/zenodo.7439527). The COVID-19 LST annotations code is available at https://github.com/outbreak-info/covid19_LST_annotations

(version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439515).The COVID-19 LST report data are available at

https://github.com/outbreak-info/covid19_LST_report_data (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439521). The harvester for manually curated

metadata from the DDE is available at https://github.com/biothings/discovery-app/blob/master/scripts/outbreak.py) (version of the code used in this paper is available at

https://doi.org/10.5281/zenodo.7439590). The harvester from Figshare COVID-19 is available at https://github.com/outbreak-info/covid_figshare (version of the code used in this paper is

available at https://doi.org/10.5281/zenodo.7439543). The harvester for COVID-19 collection of Harvard Dataverse is available at https://github.com/outbreak-info/dataverses (version of the

code used in this paper is available at https://doi.org/10.5281/zenodo.7439563). The harvester for analyses by Imperial College London is available at

https://github.com/outbreak-info/covid_imperial_college (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439545). The LitCovid publication harvester is

available at https://github.com/outbreak-info/litcovid (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439565). The harvester of metadata for

SARS-CoV-2 structures from the Protein Data Bank is available at https://github.com/outbreak-info/covid_pdb_datasets (version of the code used in this paper is available at

https://doi.org/10.5281/zenodo.7439549). The harvester of protocol metadata from protocols.io is available at https://github.com/outbreak-info/protocolsio (version of the code used in this

paper is available at https://doi.org/10.5281/zenodo.7439579). The harvester of clinical trials from WHO ICTR is available at

https://github.com/outbreak-info/covid_who_clinical_trials/blob/master/parser.py (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439553). The reusable

Research Library schemas for publications, datasets, clinical trials, protocols and analyses and associated data mappings are available at

https://github.com/outbreak-info/outbreak.info-resources (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439569). The reusable Research Library tools

for parsers are available at https://github.com/outbreak-info/outbreak_parser_tools (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439577). The code

to look up Altmetric ratings for outbreak.info resources is available at https://github.com/outbreak-info/covid_altmetrics (version of the code used in this paper is available at

https://doi.org/10.5281/zenodo.7439533). The code to match preprints to their peer-reviewed publications is available at https://github.com/outbreak-info/outbreak_preprint_matcher (version

of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439581). The machine learning topic classification of categories within the Research Library is available at

https://github.com/outbreak-info/topic_classifier (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439573). The mapping logic used to classify clinical

trial records using clinical trial-specific metadata is available at https://github.com/gtsueng/outbreak_CT_classifier (version of the code used in this paper is available at

https://doi.org/10.5281/zenodo.7442988). The evaluation of citizen scientist efforts is available at https://github.com/gtsueng/curate_outbreak_data (version of the code used in this paper

is available at https://doi.org/10.5281/zenodo.7442949). The code to generate the figures within this text, including for the case study, is available at

https://github.com/outbreak-info/outbreak-resources-paper (version of the code used in this paper is available at https://doi.org/10.5281/zenodo.7439567). REFERENCES * _Novel Coronavirus

(2019-nCoV): Situation Report, 1_ (WHO, 2020); https://apps.who.int/iris/handle/10665/330760 * Dong, E. et al. An interactive web-based dashboard to track COVID-19 in real time. _Lancet

Infect. Dis._ 20, 533–534 (2020). Article CAS PubMed PubMed Central Google Scholar * Kaiser, J. ‘Every day is a new surprise.’ Inside the effort to produce the world’s most popular

coronavirus tracker. _Science_ https://doi.org/10.1126/science.abc1085 (2020). * Noren, L. E. et al. _Institutional Response to COVID_

https://docs.google.com/spreadsheets/d/1IbF_wlmldVssG5spcmNE82nR9btcbF7rUlEqtcXW03o/edit#gid=0 (2020). * Morris, A. & citizen scientists. _USA COVID-19 K-12 School Closures, Quarantines,

and/or Deaths_

https://docs.google.com/spreadsheets/d/e/2PACX-1vQSD9mm5HTXhxAiHabZA6BPUByWBlP5HZ2jfOPEeGZkMB0ZFsmFBL5orqjIq22mjFNZ7n-11ObCylGn/pubhtml?fbclid=IwAR2tJ8yDVehGpxoP97Cco5HYAxoN014opwwm6uYt4s3E2xDr_8u9KF_LlgI#

(2020). * James, P. & citizen scientists. Staying home club. _GitHub_ https://github.com/phildini/stayinghomeclub (2020). * Pogkas, D. et al. The airlines halting flights as virus

outbreak spreads. _Bloomberg_ https://www.bloomberg.com/graphics/2020-china-coronavirus-airlines-business-effects/ (2020). * Joachimiak, M. et al. _SARS-COV-2 and COVID-19 Datasets_

https://docs.google.com/spreadsheets/d/1eMhot7MjusyM7_2IBnzqi7RlzWWoYnfheWhMgDlPToQ/edit#gid=0 (2020). * Skenderi, J. et al. _COVID-19 Resource Library_

https://docs.google.com/spreadsheets/u/2/d/1cqxDAg4jMHXI6gHOnoV8HqDdRHnmxEJRl-bhhpe1HEo/htmlview# (2020). * Navarro, C. & Capdarest-Arest, N. _COVID-19 Open Dataset Sources_

https://docs.google.com/spreadsheets/d/10t3vtULr3nTz7mrlKj0rldUys47wsIfOVReHnx3Xu18/edit#gid=0 (2020). * NIH OPA. _iSearch COVID-19 Portfolio_ (NIH, 2020);

https://icite.od.nih.gov/covid19/search * Allen Institute for AI. COVID-19 Open Research Dataset Challenge (CORD-19). _Kaggle_

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge (2020). * Chen, Q. et al. LitCovid: an open database of COVID-19 literature. _Nucleic Acids Res._ 49, D1534–D1540

(2020). Article PubMed Central Google Scholar * ClinicalTrials.gov. _Protocol Record Schema—XML Schema for Electronic Transfer of Protocol Information into the ClinicalTrials.gov Protocol

Registration System_ (National Library of Medicine, 2018) https://prsinfo.clinicaltrials.gov/ProtocolRecordSchema.xsd * Fava, I. et al. Coronavirus disease research community—COVID-19.

_Zenodo_ https://zenodo.org/communities/covid-19/?page=1&size=20 (2020). * Hyndman, A. A Figshare COVID-19 research publishing portal. _Figshare_

https://figshare.com/blog/A_Figshare_COVID-19_Research_Publishing_Portal/558 (2020). * European Organization for Nuclear Research. Zenodo FAIR principles. _Zenodo_

https://about.zenodo.org/principles/ (2013). * Hahnel, M. What Google dataset search means for academia. _Figshare_

https://figshare.com/blog/What_Google_Dataset_Search_means_for_academia/422 (2018). * Birkin, L. J. et al. Citizen science in the time of COVID-19. _Thorax_ 76, 636–637 (2021). Article

PubMed Google Scholar * Rah, J. et al. COVID-19 Literature Surveillance Team. _Internet Archive_ https://web.archive.org/web/20211020140102;https://www.covid19lst.org/copy-of-about (2020).

* Tsueng, G. et al. Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts. _Bioinformatics_ 36, 1226–1233 (2020). Article CAS PubMed Google

Scholar * Blickhan, S. et al. Transforming research (and public engagement) through citizen science. _Proc. Int. Astron. Union_ 14, 518–523 (2018). * Digital Science. About us. _Altmetric_

https://www.altmetric.com/about-us/ (2022). * Gangavarapu, K. et al. Outbreak.info: real-time surveillance of SARS-CoV-2 mutations and variants. _Nat. Methods_

https://doi.org/10.1038/s41592-023-01769-3 (2023). * Haag, E. User stories Outbreak.info blog. _Sulab_ https://blog.outbreak.info/?tag=user_stories (2022). * Valentine, D. & RADx.

SearchOutbreak. Radical data coordination center. _Netlify_ https://searchoutbreak.netlify.app (2021). * Cano, M. et al. Schema Playground: a tool for authoring, extending, and using

metadata schemas to improve FAIRness of biomedical data. Preprint at _bioRxiv_ https://doi.org/10.1101/2021.09.02.458726 (2021). * Lelong, S. et al. BioThings SDK: a toolkit for building

high-performance data APIs in biomedical research. _Bioinformatics_ 38, 2077–2079 (2021). Article Google Scholar * BioMedical Informatics Coordinating Committee. _Data Sharing Resources_

(NIH, 2020) https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html * _Open Data at NSF_ (National Science Foundation, 2013); https://www.nsf.gov/data/ * Imperial College COVID-19

Response Team. _ONS Excess Deaths_ (Imperial College London, 2021); http://www.imperial.ac.uk/mrc-global-infectious-disease-analysis/covid-19/covid-19-reports/ * _Controlling Relevance.

Elasticsearch: the Definitive Guide [2.x]_ (Elasticsearch B.V., 2023) https://www.elastic.co/guide/en/elasticsearch/guide/current/controlling-relevance.html * _Lucene’s Practical Scoring

Function. Elasticsearch: the Definitive Guide [2.x]_ (Elasticsearch B.V., 2023); https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html * Tsueng, G. et

al. Citizen science for mining the biomedical literature. _Citiz. Sci._ 1, 14 (2016). PubMed PubMed Central Google Scholar * _COVID-19 SARS-CoV-2_ (medRxiv and bioRxiv, 2021);

https://connect.biorxiv.org/relate/content/181 * Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. _Nat. Microbiol._ 5, 1403–1407

(2020). Article CAS PubMed PubMed Central Google Scholar Download references ACKNOWLEDGEMENTS We thank J. Rah, B.J. Enright, J. Doroshenko, T. Nishath and the rest of the COVID-19 LST

for allowing us to share their work. We thank T. Adams and C. Lazarchick for their work in identifying metadata from various individual datasets and their extensive feedback. We thank S.

Andarmani for her suggestions and feedback on dataset categories. We thank all Outbreak Curators contributors found at https://blog.outbreak.info/dataset-topic-category-contributors for

taking the time to categorize datasets. We thank S. Ul-Hasan for their feedback on the R package. We thank D. Valentine for sharing details about his netlify app as part of the RADx-Rad Data

Coordination Center, which is funded by the NIH (U24LM013755). Work on outbreak.info was supported by the National Institute for Allergy and Infectious Diseases (5 U19 AI135995: G.T.,

J.L.M., M.A., M.C., E. Haag, A.A.L., E. Hufbauer, M.Z., K.G.A., C.W., A.I.S., K.G., L.D.H.; 3 U19 AI135995-04S3: G.T., J.L.M., E. Haag, E. Hufbauer, K.G.A., C.W., A.I.S., K.G., L.D.H.; 3 U19

AI135995-03S2: G.T., J.L.M., E. Haag, E. Hufbauer, K.G.A., C.W., A.I.S., K.G., L.D.H.; 75N91019D00024: G.T., E. Haag, J.L., D.J.W., C.W., A.I.S., L.D.H.), the National Center for Advancing

Translational Sciences (5 U24 TR002306: G.T., J.L.M., M.C., C.W., A.I.S., L.D.H.), the Centers for Disease Control and Prevention (75D30120C09795: M.A., A.A.L., M.Z., K.G.A., K.G.) and the

National Institute of General Medical Sciences (R01GM083924: G.T., M.C., X.Z., Z.Q., C.W., A.I.S.). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Integrative, Structural and

Computational Biology, the Scripps Research Institute, La Jolla, CA, USA Ginger Tsueng, Julia L. Mullen, Marco Cano, Emily Haag, Jason Lin, Dylan J. Welzel, Xinghua Zhou, Zhongchao Qian,

Chunlei Wu, Andrew I. Su & Laura D. Hughes * Department of Biological Sciences, University of Notre Dame, Notre Dame, IN, USA Manar Alkuzweny * Department of Immunology and Microbiology,

the Scripps Research Institute, La Jolla, CA, USA Manar Alkuzweny, Alaa Abdel Latif, Emory Hufbauer, Mark Zeller, Kristian G. Andersen & Karthik Gangavarapu * Ocuvera, Lincoln, NE, USA

Benjamin Rush * Scripps Research Translational Institute, La Jolla, CA, USA Kristian G. Andersen, Chunlei Wu & Andrew I. Su * Department of Molecular Medicine, the Scripps Research

Institute, La Jolla, CA, USA Chunlei Wu & Andrew I. Su * Department of Human Genetics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, USA Karthik

Gangavarapu Authors * Ginger Tsueng View author publications You can also search for this author inPubMed Google Scholar * Julia L. Mullen View author publications You can also search for

this author inPubMed Google Scholar * Manar Alkuzweny View author publications You can also search for this author inPubMed Google Scholar * Marco Cano View author publications You can also

search for this author inPubMed Google Scholar * Benjamin Rush View author publications You can also search for this author inPubMed Google Scholar * Emily Haag View author publications You

can also search for this author inPubMed Google Scholar * Jason Lin View author publications You can also search for this author inPubMed Google Scholar * Dylan J. Welzel View author

publications You can also search for this author inPubMed Google Scholar * Xinghua Zhou View author publications You can also search for this author inPubMed Google Scholar * Zhongchao Qian

View author publications You can also search for this author inPubMed Google Scholar * Alaa Abdel Latif View author publications You can also search for this author inPubMed Google Scholar *

Emory Hufbauer View author publications You can also search for this author inPubMed Google Scholar * Mark Zeller View author publications You can also search for this author inPubMed

Google Scholar * Kristian G. Andersen View author publications You can also search for this author inPubMed Google Scholar * Chunlei Wu View author publications You can also search for this

author inPubMed Google Scholar * Andrew I. Su View author publications You can also search for this author inPubMed Google Scholar * Karthik Gangavarapu View author publications You can also

search for this author inPubMed Google Scholar * Laura D. Hughes View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS L.D.H., K.G., M.C., E.

Haag, J.L.M., X.Z., Z.Q., E. Hufbauer, C.W., A.I.S., K.G.A., A.A.L., M.Z., G.T., J.L. and D.J.W. contributed to the design, construction and/or maintenance of the outbreak.info website and

data pipelines. K.G., M.A. and L.D.H. designed and built the R outbreak.info package. M.A., A.A.L., K.G., E. Haag, E. Hufbauer, M.Z., K.G.A. and L.D.H. designed and linked the variant

reports. L.D.H., J.L.M., G.T. and M.C. developed the schemas. E. Haag performed the usability studies. B.R. developed the curation app. L.D.H., G.T., E. Haag, K.G., M.Z. and J.L.M.

contributed to writing and editing the manuscript. CORRESPONDING AUTHORS Correspondence to Ginger Tsueng or Laura D. Hughes. ETHICS DECLARATIONS COMPETING INTERESTS K.G.A. has received

consulting fees and/or compensated expert testimony on SARS-CoV-2 and the COVID-19 pandemic. The remaining authors declare no competing interests. PEER REVIEW PEER REVIEW INFORMATION _Nature

Methods_ thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the _Nature Methods_ team. Peer

reviewer reports are available. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION Supplementary Figs. 1–8. REPORTING SUMMARY PEER REVIEW FILE SUPPLEMENTARY TABLES Supplementary Tables 1–5. RIGHTS AND PERMISSIONS

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author

self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Reprints and permissions ABOUT THIS ARTICLE

CITE THIS ARTICLE Tsueng, G., Mullen, J.L., Alkuzweny, M. _et al._ Outbreak.info Research Library: a standardized, searchable platform to discover and explore COVID-19 resources. _Nat

Methods_ 20, 536–540 (2023). https://doi.org/10.1038/s41592-023-01770-w Download citation * Received: 03 June 2022 * Accepted: 17 January 2023 * Published: 23 February 2023 * Issue Date:

April 2023 * DOI: https://doi.org/10.1038/s41592-023-01770-w SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a

shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative