Rummagene: massive mining of gene sets from supporting materials of biomedical research publications

Play all audios:

ABSTRACT Many biomedical research publications contain gene sets in their supporting tables, and these sets are currently not available for search and reuse. By crawling PubMed Central, the

Rummagene server provides access to hundreds of thousands of such mammalian gene sets. So far, we scanned 5,448,589 articles to find 121,237 articles that contain 642,389 gene sets. These

sets are served for enrichment analysis, free text, and table title search. Investigating statistical patterns within the Rummagene database, we demonstrate that Rummagene can be used for

transcription factor and kinase enrichment analyses, and for gene function predictions. By combining gene set similarity with abstract similarity, Rummagene can find surprising relationships

between biological processes, concepts, and named entities. Overall, Rummagene brings to surface the ability to search a massive collection of published biomedical datasets that are

currently buried and inaccessible. The Rummagene web application is available at https://rummagene.com. SIMILAR CONTENT BEING VIEWED BY OTHERS RESTRING: AN OPEN-SOURCE PYTHON SOFTWARE TO

PERFORM AUTOMATIC FUNCTIONAL ENRICHMENT RETRIEVAL, RESULTS AGGREGATION AND DATA VISUALIZATION Article Open access 06 December 2021 EVALUATION OF LARGE LANGUAGE MODELS FOR DISCOVERY OF GENE

SET FUNCTION Article 28 November 2024 PETAGRAPH: A LARGE-SCALE UNIFYING KNOWLEDGE GRAPH FRAMEWORK FOR INTEGRATING BIOMOLECULAR AND BIOMEDICAL DATA Article Open access 18 December 2024

INTRODUCTION The introduction of omics technologies has gradually moved biological and biomedical research from studying single genes and proteins towards studying gene sets, clusters of

genes, molecular complexes, and gene expression modules1. Many biomedical and biological research studies produce and publish gene and protein sets. For example, differentially expressed

genes and proteins from transcriptomics and proteomics assays, genes associated with genomic variants identified to be relevant to a phenotype, gene knockouts associated with a cellular or

an organismal phenotype, target genes of transcription factors as determined by ChIP-seq experiments, proteins identified in differential phosphoproteomics, proteins identified in a complex

from immunoprecipitation followed by mass-spectrometry studies, genes associated with a cellular phenotype from CRISPR screens, and many more types of sets can be generated. These gene sets

are highly valuable but not often reused. This lack of reuse is partially because there are no standards for submitting gene sets in publications, and there are no centralized community

repositories for depositing gene and protein sets. As a result, the potentially useful information about gene sets is buried in supporting material tables stored as PDF, Excel, CSV, or Word

file formats. Since general and domain specific search engines do not index the contents of such supporting materials, there is no way to search through these tables. These supporting tables

are not indexed by search engines because most search engines can only deal with free text and are not capable of parsing data tables. Named entity recognition methods have been widely

applied to biomedical and biological publication text, but not yet to extract gene sets from supporting tables. Manual gene set annotations and extraction of gene sets from publications has

been achieved, but it is time consuming, labor intensive, and requires domain expertise. Most such efforts miss many relevant studies. For example, to create the ChIP-x Enrichment Analysis

(ChEA) resource we manually extracted gene sets from supporting materials of ChIP-seq studies2,3. While the ChEA database achieved great success, it is difficult to maintain. Efforts such as

ReMap4, Recount5, and ARCHS46 aim to address this challenge by uniformly reprocessing all the raw data available from community repositories to recompute gene sets from published studies,

but such efforts rely on the existence of community repositories and uniform data collection standards. Another effort to automate the extraction of gene sets from publications is Pathway

Figure Optical Character Recognition (PFOCR)7. PFOCR automatically extracts pathways from publications by scanning pathway diagrams. However, surprisingly, as far as we know there are no

publications, databases, or community repositories that contain extracted gene sets from supporting materials of scientific biomedical research publications. Rummagene is a web-based

software application that serves hundreds of thousands of gene sets extracted from publications listed on PubMed Central (PMC). It contains a softbot that scans supporting materials of

publications listed on PMC to keep the resource consistently updated. The Rummagene website provides the ability to search the corpus of gene sets by an input gene set query, a PMC free text

search, and a table title search. To understand the statistical patterns within the Rummagene corpus, we performed various exploratory analyses, as well as demonstrate how this rich

resource of organized biological knowledge can be used for specific applications. RESULTS DESCRIPTIVE STATISTICS The initial version of Rummagene contains 642,389 gene sets extracted from

121,237 articles. These 121,237 articles are identified as containing gene sets from 5,448,589 scanned PMC articles. The distribution of the occurrence of genes in gene sets is not even.

Some genes are found in many sets, but most genes are members of few sets (Fig. 1a). At the same time, most identified gene sets have less than one hundred genes in each set (Fig. 1b). While

most publications only contributed to the Rummagene collection one or two gene sets, there are few publications that contributed a few hundred sets (Fig. 1c). Over the years, more and more

gene sets are found in publications (Fig. 1d). In fact, in the past four years, publications included many more sets compared to sets identified in the 30 years between 1988 and 2018. Since

2005, the average length of gene sets jumped from less than 20 genes in each set to ~150 genes in each set (Fig. 1e). This is likely due to the introduction of omics technologies and

publications reporting gene sets identified from such studies. By projecting the gene set content into two dimensions with UMAP8, we see that, on average, short gene sets contain genes that

are more commonly studied (Fig. 1f, g). While this is a general trend, some genes occur in many sets but are less commonly studied (Fig. 1h, i). Specifically, we identified 604 gene sets

that are enriched in understudied genes. Understudied gene sets are defined as gene sets where the median citations per gene is less than 3 standard deviations from the median for citations

observed for randomly assembled gene sets of similar size (Supplementary Data 1). These gene sets contain many sets that are made of orphan GPCRs and Znf family members. Other sets are

mainly modules of differentially expressed genes. These modules are likely serving critical biological roles but are less explored. Next, we noticed that the Rummagene collection of gene

sets has many duplicate entries. In fact, duplicated gene sets make up approximately 15% of the Rummagene gene sets. Many of these duplicate sets are found in the same publication. The

publications having multiple tables often list the same sets but with different measurements or statistics, for example, measuring the expression of a set of genes under different

conditions. We found fewer duplicate gene sets across multiple papers (Fig. 1j, k). ANNOTATED COLLECTIONS OF THEMED GENE SET LIBRARIES For the collection of 642,389 gene sets we can identify

subsets of gene sets associated with specific biological themes such as sets related to kinases, transcription factors, cell types, cell lines and tissues. Such themed gene sets can be used

for specific enrichment analysis tasks such as kinase enrichment analysis9, transcription factor enrichment analysis2. Producing such subsets of gene sets can be done by simply searching

the table titles for terms that match named entities such as protein kinases or transcription factors. Indeed, we identified 4525 gene sets that contain named human kinases, and 8078 gene

sets that contain named transcription factors in the table titles. 444 kinase names and 1121 transcription factor names are unique in these collections of gene sets (Fig. 2a, Supplementary

Data 2, 3). Similarly, we identified 4443 gene sets that contain named cell lines, and 6268 gene sets that contain cell types or tissues in table titles, with 450 and 670 unique terms,

respectively (Fig. 2b). In addition, 5560 sets had the term “down” and 6677 had the term “up” in their table titles (Fig. 2c). These sets likely contain up- and down-regulated genes from

gene expression signatures. A large portion of the identified gene sets contain gene names in their titles. Specifically, 97,478 table titles contain human gene symbols or synonyms (Fig.

2c). For the subset of gene sets containing known transcription factors in their titles, Uniform Manifold Approximation and Projection (UMAP) plots were generated from the inverse document

frequency (IDF) vectors for all gene sets in the subset. Points representing different gene sets are colored by both the PubMed Central ID (PMCID) of the original publication (Fig. 2f), and

by the associated transcription factor (Fig. 2e). We found that these gene sets tend to cluster by transcription factor even when they are derived from different publications. This was

further confirmed to be statistically significant (_T_-test; _p_ < 0.0001) by comparing the average and distribution of the Jaccard index similarities between gene sets mentioning the

same transcription factor from different publications compared to those not mentioning the same TF (Fig. 2d). We also applied the same process to generate UMAP plots for the subset of terms

containing known kinases, and similarly saw that these gene sets clustered by kinase (Fig. 2g) although originating from different PMCIDs (Fig. 2h). This trend was also confirmed

statistically (Fig. 2d). Next, we aimed to assess whether kinase and transcription factor gene set libraries created from Rummagene contain useful information for performing gene set

enrichment analysis. To achieve such an assessment, we queried each gene set from the Rummagene kinase and transcription factor libraries against corresponding kinase and transcription

factor libraries created from multiple sources2,9. We observe a significant recovery of the correct kinases and transcription factors with all libraries, with best agreement observed for

KEA9 for kinases, and ChEA 202210 for transcription factors (Fig. 3a–f). This is likely because these two resources are manual efforts of extracting gene and protein sets from publications,

including data from supporting tables. Comparing the kinase and transcription factor Rummagene libraries to KEA and ChEA, Rummagene is likely more comprehensive and updated, but less

accurate. TOPIC MODELING To obtain a global view of the contents of the gene sets in Rummagene, we performed latent Dirichlet allocation (LDA) analysis11 on all abstracts from publications

containing at least one extracted gene set. Nine topics were identified and subsequently manually labeled based on the most common terms and their relative weights (Fig. 4a). Some of the

most frequently appearing terms across all topics included gene, cell, expression, DNA, patient, cancer, and analysis. The greatest portion of abstracts are relating to mutations and

variants in diseases, protein-protein interactions, and mechanisms, while the topics with the least abstracts are related to immune functions and genome-wide associations and risks. The

visualization of abstracts in topic space also reveals the relation and similarity between topics (Fig. 4b). For instance, the topic mutations and variants in disease borders DNA

transcription and methylation. Additionally, the genome wide association and risk topic is isolated from the other topic clusters. The data and modeling topic is located adjacent to most of

the other topics suggesting that abstracts with this topic may be related to a variety of other topics as expected. Overall, the topic analysis reveals the predominant categories of gene

sets in Rummagene, specifically those concerning mutations and variants in diseases and those concerning protein interactions and functional mechanisms. SIMILAR GENE SET PAIRS THAT ARE

DISTANT IN ABSTRACT SPACE Next, we asked whether the knowledge embedded in Rummagene can lead to the construction of hypotheses by identifying gene sets with high similarity in gene set

space while completely disjointed at the publication abstract text space. The rationale for this is that this way we can identify undiscovered associations between named entities such as

genes and diseases. Surprisingly, we first observed that the pairs of gene sets with the highest similarity at the gene set level, with no similarity at the abstract level, are highly

enriched in proteins that are commonly detected in mass-spectrometry proteomics studies (Fig. 5a), highly expressed in RNA-seq assays (Fig. 5b), but less widely studied (Fig. 5c). This is

likely because proteomics studies tend to commonly report the same abundant, large-size, and “sticky” proteins, transcriptomics studies detected as differentially expressed highly expressed

genes, and gene sets in publications commonly report overlapping genes in pathways and ontology terms containing highly studied genes. After filtering pairs of gene sets that are proteomics

rich, or contain highly expressed genes, or composed of highly studied genes, we identified a few pairs of sets that contain a gene name in one table title of one set, and a disease name in

the table title of the second set (Supplementary Note). For example, some of the top identified pairs highlight a possible relationship between the proteins identified to interact with

_CLUH_12, and gene sets identified in hypoxia13, melanoma14, and glioma15. This connection is logical because _CLUH_ was found to be critical to mitochondrial function which is altered in

these conditions. Similarly, other top overlapping pairs include the _TOPBP1_ interactome16 and a potential relationship to melanoma14, hypoxia17, and teratomas18. To assist in possibly

explaining these connections, we utilized the GPT-4 API, a large language model, to compose hypotheses that suggest how such seemingly unrelated named entities might be in fact related by

giving GPT-4 the two abstracts. For example, when asked about the connection between the gene _CLUH_ and the disease hypoxia, prompted with the abstracts and gene set terms, the LLM

responded with a plausible explanation concerning mitochondrial function, specifically: “Therefore, it is plausible that the _CLUH_ gene may be involved in the adaptive response of SKOV-3

ovarian cancer cells to hypoxia, possibly by regulating the translation and stability of mitochondrial proteins. This could explain the high overlap between the two gene sets. Further

experimental studies would be needed to confirm this hypothesis.” The model successfully determined the cell line used to produce the gene set concerning hypoxia from the abstract provided

and it made a reasonable hypothesis about the relationship between the two gene sets given the dissimilar context of the abstracts. Additionally, when asking the LLM about the connection

between the gene sets with _TOPBP1_ and teratomas in their column names, using the two abstracts associated with these gene sets, the LLM produced a plausible explanation about their

similarity after stating a hypothesis and reiterating information from the abstracts: “Given the role of _TOPBP1_ in DNA repair and the importance of gene mutations in the development of

teratomas, it is plausible that mutations or dysregulation of TOPBP1 could contribute to the development or progression of teratomas. This could explain the high overlap between the two gene

sets. Further research would be needed to confirm this hypothesis and elucidate the exact mechanisms involved”. The summaries produced by the GPT-4 LLM are mostly helpful and logical but

should be manually verified as the model states on its own. GENE FUNCTION PREDICTIONS Large collections of gene sets can be used to effectively predict gene functions with semi-supervised

learning19. The first step to produce such predictions is to construct a gene-gene similarity matrix from the Rummagene database of gene sets. This can be done with different algorithms.

Here we tested the ability of three previously published co-occurrence algorithms20 to make such predictions, and compare the quality of such predictions to predictions made with a similar

method that utilizes gene-gene co-expression correlations from thousands of RNA-seq samples6. The gene-gene similarity matrices from Rummagene were able to predict with high accuracy and

precision the gene membership for functional terms created from the Gene Ontology (GO) Biological Process21, GWAS Catalog22, Mouse Genome Informatics (MGI) Mammalian Phenotypes (MP)23, and

WikiPathways24 (Fig. 6a). To illustrate an example for one term, the term “Fasting Plasma Glucose” from GWAS Catalog was selected. The top 10 genes that are closest to the genes known to be

associated with this phenotype are _SLCO1B3-SLCO1B7_, _P3R3URF-PIK3R3_, _SLC30A8_, _FAM240B_, _MTNR1B_, _PERCC1_, _EEF1AKMT4-ECE2_, _KLF14_, _CCDC201_, and _PAX4_; and the ROC curve to

assess the quality of the predictions has a 0.75 area under the curve (Fig. 6b). The top 10 predicted genes for each term from these three gene set libraries are provided as a supporting

table (Supplementary Data 4). THE KNOWLEDGE SPACE THAT IS COVERED BY RUMMAGENE COMPARED WITH ENRICHR To assess the breadth and coverage of the automatically curated Rummagene gene set space,

we contrasted it against the Enrichr10 gene set space. Enrichr is a large-scale curated database of gene sets of similar size when compared to Rummagene. UMAP25 was applied to project over

1 million gene sets into two dimensions for the purpose of data visualization where each point represents a gene set from either Rummagene or Enrichr. Gene sets are colored by whether they

originate from Rummagene or Enrichr’s gene set library categorization: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Miscellaneous, Legacy, Crowd (Fig. 7a, b). We observe

that Rummagene gene sets cluster into many punctate clusters that likely represent themed gene sets (Fig. 7A). Also, Enrichr’s gene sets are clustered by category (Fig. 7b). When overlaying

the Rummagene gene sets on the Enrichr gene sets, most categories are covered with some few exceptions. We observe that some gene set libraries are not covered by Rummagene, while few areas

in gene set space are much more common in Rummagene compared with Enrichr. To quantitatively verify the presence of these unique clusters, UMAP enhanced clustering was employed with a UMAP

projection with min_dist of 0 followed by HDBSCAN clustering26. Clusters were assigned labels based on whether 25% of the gene sets within that cluster were from a given Enrichr gene set

library, or otherwise they were labeled by a cluster number. Mostly Enrichr and mostly Rummagene clusters, making up 90% of gene sets in the cluster across the projection are visible (Fig.

7c). The largest clusters that are mostly from Enrichr are from gene set libraries that were created from unique sources, for example, the LINCS L1000 data27,28, single cell

transcriptomics29, virus-host protein-protein interactions30, pathways extracted from figures31, and gene sets related to NIH funded investigators32 (Fig. 7d). On the other hand, several

clusters were unique to Rummagene (Supplementary Data 5–8). One of these clusters, namely cluster 81, contains gene sets that are exclusively transcription factors. This is likely because

there are specific assays and studies that focus on profiling these genes exclusively. THE RUMMAGENE WEBSITE The Rummagene data is served on the website https://rummagene.com with three

search engines. The first search engine accepts gene sets as the input query and then returns matching gene sets based on the overlap between the input gene set and the unique sets in the

Rummagene database. The results are ranked by the Fisher’s exact test, and to optimize responsiveness, a fast in-memory algorithm is implemented. The results are presented to the user in

paginated tables with hyperlinks to the original publication and the supplemental material from which the gene sets were extracted, the genes in the matching sets, the overlapping genes, the

_p_-values and the Benjamini-Hochberg corrected _p_-values of the overlap, and the odds ratios. When clicking the overlap numbers, a popup screen shows the overlapping genes with the

ability to copy them to the clipboard, submit them to Rummagene, or submit them to Enrichr10. Similarly, the original gene set can be accessed by clicking on the column name. The second

search engine facilitates a free-text PMC search. This search engine queries PMC with the entered terms to receive PMCIDs that match the query. It then compares the returned PMCIDs to the

PMCIDs in Rummagene to identify matching PMCIDs. Once such matches are detected, the gene sets in the Rummagene database are returned to the user as a paginated table with hyperlinks to the

original publications and the matching gene sets. The third search engine queries the table titles from which gene sets were extracted. Table titles that match the inputted search terms are

displayed in a paginated table with hyperlinks to the matching publications and gene sets. All search engine results can be filtered, shared by URL, and downloaded. The entire database is

available for download as a text file, and access to the data is provided via a GraphQL API. Importantly, the Rummagene resource is updated automatically once a week. DISCUSSION By crawling

through full articles and supporting materials from over five million research publications available from PMC, we were able to identify over 150,000 publications that contain over 600,000

mammalian gene sets of various lengths. Smaller gene sets are enriched for widely studied genes while longer lists contain less studied genes likely due to their origin from omics studies.

Interestingly, in the past five years, the publication of gene sets in articles has been increasing exponentially. Hence, most gene sets in the Rummagene database are from this period. Here

we demonstrated how the Rummagene resource can be used for various applications. Specifically, we showed how a subset of the extracted sets can be used for transcription factor and kinase

enrichment analyses. We also showed how the rich knowledge in Rummagene can be used for gene function predictions. In addition, we demonstrated how we can form hypotheses by identifying gene

set pairs with high similarity in gene set space and low similarity in abstract space. However, many additional applications are possible. For example, Rummagene can be used to produce

textual descriptions for gene sets using large language models (LLMs). Given the large collection of Rummagene gene sets, as well as the fast enrichment search engine that is implemented, we

could provide the Rummagene API to an LLM to act as a chatbot that searches for relevant papers that are related to a given gene set and then summarize the collective functions identified

in these papers. This is different from just giving an LLM a gene set because it adds focus to the search by utilizing the Rummagene API. The LLM use case currently implemented in Rummagene

is forming hypotheses about two highly overlapping gene sets with dissimilar abstracts. We show how when submitting the two abstracts to an LLM to provide an explanation about why the

seemingly unrelated abstracts might have highly overlapping gene sets, the LLM is constrained to provide a plausible explanation. Although such an explanation is at times trivial, in all

cases that we tested, it was based on correct facts. Hence, the prompt is detailed and constrained enough to produce high-quality responses from the LLM. One of the opportunities provided by

Rummagene is its integration with other resources that contain large collections of gene sets and signatures, for example, Enrichr10, ARCHS46, and SigCom LINCS27. Biomedical research has

been traditionally communicated via hardcopy printed paper journals. The transition into fully digital research communication, and with the introduction of omics technologies, increased

efforts are placed on better annotation and standardization of published research data including the publication of gene sets and data tables. During this transition period toward such

improved annotations, Rummagene plays an important role in making previously published data, buried in supplemental materials of publications, more findable, accessible, interoperable, and

reusable (FAIR)33. METHODS CRAWLER TO EXTRACT GENE SETS FROM PUBLICATIONS LISTED IN PMC The PMC Open Access Subset34 contains millions of journal articles available under license terms that

permit reuse. Additionally, PMC provides uniformly structured bundles that can be retrieved in bulk over FTP. An index file contains a tabular listing of all PMCIDs represented with a

pointer to the compressed bundle corresponding to that PMCID. Each bundle has a PDF of the paper, an XML document containing structured metadata about the paper, figures, and supplemental

material files. First, the index file is downloaded, a job is then submitted for each paper. The job downloads and extracts the archive and processes the XML structured paper by loading the

tables from the paper and all supplemental files. Both the tables from the main paper, and the tables in the supplemental files may have captions or labels. These captions or labels are

saved. Additionally, places in the text that mention the table, or the supplemental file, are identified when they are linked in the markup; at most, 15 words before such a call to the

tables are saved. Every supplemental file is processed by one of several table-extractor-functions, selected based on the file extension. These extractor functions include support for Excel,

CSV, TSV, and inferred separator loading of TXT files, as well as a PDF table extractor based on Tabula-Py. For each supporting materials table that is extracted, every column in the table

is considered. The extractor function attempts to map all unique strings to gene symbols. Mapping may be direct, through some synonym, or identifier. Any column where more than half of the

strings can be successfully mapped to a valid human gene symbol using NCBI’s Gene Info35 file for _Homo sapiens_ are retained. In other words, all columns passing this filter become a gene

set in the Rummagene gene set library. This approach aims to capture human gene sets, but also captures gene sets from other mammalian organisms such as mouse or rat because of the high

overlap in gene symbols. Hence, we consider the overall collection of gene sets in Rummagene as mammalian. The term describing the gene set is made of the PMCID, the file name in the bundle,

the Excel spreadsheet name or the XML table label, the column’s first cell, and additional sequential numbers that are added to the term to make it unique if needed. The description field

is constructed by concatenating any available caption, label, and text mention. The original items in each table column that pass the filter are preserved, but genes are included only if

they can be mapped to official symbols. In addition to filtering out columns with too few mapped genes (< 5), columns with too many mapped genes (> 2500) are ignored. This is because

these are likely to contain gene sets that cover all measured genes and not a subset of identified genes with a potentially unique function. This pipeline produces a large gene matrix

transpose (GMT) file which can be added to incrementally. The pipeline is designed to continue where it left off when it is re-run. It is set to run weekly to extend the database with any

new publications that are added to the PMC Open Access database. The new entries to the GMT are stored in the Rummagene database to be accessed from the web-based application. By extracting

gene sets from supporting material of published research articles we can make these more accessible for search and reuse. SEARCH ENGINE IMPLEMENTATION The large size of the Rummagene gene

set library requires special implementation of an algorithm that can quickly compare the input gene set to all the gene sets in the Rummagene database. Besides a fast algorithm that can

compare the input set to all other sets, efficient storage of the gene sets is needed as well as sufficient hardware. To enable a fast gene set search, a Rust-powered REST API was

implemented. The algorithm first initializes several in-memory data structures: 1) a background sorted set of all genes across all gene sets in the database; 2) the index of each gene saved

in a hashmap mapping where each gene is mapped to a 32 bit unsigned integer (U32) index; 3) the gene set IDs and unique hashes stored as UUIDs; and 4) a hashset of mapped genes using the

Fowler–Noll–Vo (FNV) hash function on each gene for each unique gene set. FNV is known to perform well when dealing with small keys. This is the case in our implementation which uses 32-bit

unsigned integer keys. In our tests, FNV performed much faster than the default hasher. These data structures are created by querying the database with Rust. When the user presses the search

button, the queried gene sets are forwarded to the API. After ensuring that the index is initialized, the code maps the user submitted gene set to a U32 hash set. It then computes the

intersections between the user’s gene set and the gene sets in memory and performs the Fisher’s exact test using the identified overlap. Parallel processing with Rayon36 is employed to

further speed up this process. Once completed, Benjamini-Hochberg adjusted _p_-values are computed. Next, the results are sorted by _p_-value, temporarily cached, and returned. The gene sets

in Rummagene are stored in a Postgres database37. A function in the Postgres database is responsible for mapping the gene symbols to UUIDs before passing them to the Rust API to obtain

results. These returned results can be joined by ID with the gene sets and genes in the database to facilitate further filtering. In this way, the use of an API is transparent to the

front-end which queries the database with PostGraphile powered GraphQL. By implementing an advanced fast search engine, we can offer an interactive real-time service to users of the

Rummagene application. The Rummagene database is automatically updated once a week by processing all the new articles added to PMC in the past week to identify new gene sets in the

supporting materials of these articles. When a batch of new gene sets are added to the database, a new reference of valid gene names is constructed with the complete set of genes in the

database. At that time, the API is called to prepare the new gene name reference prior to removing the old reference. By automatically updating the database, we ensure that it will remain

relevant and current long term with minimal effort. EXTRACTING FUNCTIONAL TERMS FROM COLUMN TITLES To assess the contents of the extracted gene sets, the column titles for each table were

examined to identify a variety of functional terms. Supplementary table titles often include DOI and other identification information, thus these were ignored when conducting this analysis.

After separating column titles in each gene set, column titles were split on dashes, underscores, and periods. To identify gene sets in each column, each resulting string was examined to

assess if it was an NCBI Entrez38 approved human gene symbol or a listed synonym. All gene synonyms were subsequently converted to their official symbol. Although genes can be represented

with integer identifiers, strings only containing numbers were ignored because after manual examination, we discovered that many of these as artifacts. Additionally, strings containing S

succeeded by an integer were ignored considering the vast majority of these refer to the supplemental table number. Transcription factors and kinases were subsequently identified from the

extracted gene symbols. To identify gene sets that may represent signatures, the strings ‘up’, ‘down’, and ‘dn’ were searched for in the split column titles. To identify tissues, cell types

and cell lines present in the column titles, the Brenda Tissue Ontology (BTO)39 official terms and synonyms were extracted, and exact matches were identified. For gene sets containing

multiple BTO terms, they were hyphenated to capture, for instance, a cell type from a specific tissue. VISUALIZATION OF THE KINASE AND TF GENE SET LIBRARIES For each extracted gene set, IDF

vectors were computed using the Scikit-learn40 Python package using the set of all included genes as the corpus. Using the Scanpy41 Python package, Uniform Manifold Approximation and

Projection (UMAP)8 plots for different categories of gene sets were then generated from the IDF vectors and clusters were automatically computed using the Leiden algorithm42. To visualize

broad patterns across the data, each point representing a gene set was colored based on the cluster, associated PMCID, and associated kinase or transcription factor, if applicable. By

visualizing the kinase and TF gene set libraries we can observe higher level functional clusters of related kinases and TFs. BENCHMARKING TRANSCRIPTION FACTOR AND KINASE ENRICHMENT ANALYSES

Consensus transcription factor and kinase gene set libraries were created by performing a metadata search of the Rummagene database by submitting the kinase or transcription factor named

entities as the search term. Returned entries are matches where the transcription factor or kinase terms appear in the gene set’s table title, table legend, or column legend. The gene set

for each transcription factor and kinase is composed from the union of all identified gene sets corresponding to the given transcription factor or kinase. Benchmarking datasets were sourced

from ChEA32 for transcription factors and from KEA39 for kinases. To benchmark enrichment analysis performed with the constructed consensus gene set libraries, the rank of each transcription

factor/kinase was identified using the Fisher’s exact test _p_-value for each matching gene set in each benchmarking dataset. To generate ROC curves, we downsampled the negative class to

the same size as the positive class to achieve class balance. ROC curves were then bootstrapped over 5000 iterations and the mean ROC and AUCs were reported. Since we are randomly

downsampling the negative class, bootstrapping the curve over several thousand iterations ensures a more accurate depiction of the ability of the Rummagene transcription factor and kinases

gene set libraries to accurately predict the perturbed transcription factor or kinase. The numpy interp function was used to linearly interpolate between all points from the 5000 ROC curves

to generate composite ROC for each benchmarking library. TOPIC MODELING To identify the predominant topics associated with gene sets in the Rummagene database, the abstracts of each paper

contributing at least one gene set were assembled from the PMC bulk download. The text contained within the _<abstract>_ tags was concatenated. Papers containing no abstracts were

excluded from the analysis. Each abstract was then tokenized, stop words were removed, and lemmatized using the Python package Natural Language Toolkit (NLTK)43. The LdaModel class of Python

package Gensim44 was then used to identify nine topics with a chunksize of 100 over 10 passes. The number of topics was chosen manually by observing the separation of topics given different

sets of parameters. Word counts and word importance were extracted from the model for each of the nine topics. The abstracts were visualized in topic space using the vectors produced by the

latent Dirichlet allocation (LDA) model11 for adherence of each paper to each topic using t-SNE25. SIMILAR GENE SET PAIRS THAT ARE DISTANT IN ABSTRACT SPACE The preprocessing of

publications’ abstracts followed the same procedure as in topic modeling where abstracts were first extracted from the PMC bulk download, then cleaned of stopwords and lemmatized using the

NLTK43 Python package. Abstracts were then converted to word counts using the count vectorizer and subsequently fit to term frequency - inverse document frequency (TF-IDF) vectors using the

Scikit-learn40 Python package. The cosine similarity of each paper abstract to all other abstracts was then assessed using the Scikit-learn pairwise linear kernel metric based on the

computed TF-IDF vectors. Only pairs of gene sets from different publications with zero cosine similarity of their abstracts were retained. For each pair of such gene sets, Fisher’s exact

test was performed to assess the significance of the overlap among the genes within these two sets. Only pairs with _p_ < 0.05 were retained for further analysis. Pairs with identical

gene sets were excluded. Pairs were further filtered to only include those with overlaps of more than 50 genes. Additionally, to assess novelty of the recovered pairs, the percentage of

their overlapping genes with ‘sticky proteins’ identified in analysis of protein-protein interactions45 were used (Supplementary Data 9). In the analysis of gene set pairs including a gene

or a disease in the table or column title and legend, only the top 10,000 most significant pairs with < 10% ‘sticky proteins’ were included. To assess the amount of highly cited genes,

present in the overlapping genes of gene set pairs, the top 500 most cited genes according to GeneRIF38 were used (Supplementary Data 9). Additionally, to determine the amount of highly

expressed genes present in the overlapping genes of gene set pairs, the top 500 most highly expressed protein coding genes were sourced based on mean expression across 5000 random samples

from ARCHS46 (Supplementary Data 9). To identify disease names in column titles of the gene set pairs, DisGeNet46 disease terms were used and gene names were identified using NCBI gene38

mappings. The OpenAI API chat completion module using the GPT-4 model was utilized to hypothesize about the connection between the remaining top pairs of gene sets from the subset of

filtered genes sets based on the filtering steps described above. When prompting the model, we provide it with the gene set terms, the abstracts of both papers, as well as any identified

disease or gene extracted from the gene set term column title in following format: “Based on the pair of extracted gene sets from two research publications, hypothesize why there might be a

connection between these gene sets based on the two abstracts, and the provided gene and disease terms: Gene set term 1: [term1], disease from gene set 1 term: [disease], abstract of

publication for gene set term 1: [term1_abstract], Gene set term 2: [term2], gene(s) from gene set 2 term:47, abstract of publication for gene set term 2: [term2_abstract].” Additionally,

the system message explains the task as follows: “You are a biologist who attempts to generate a hypothesis about why two gene sets, which are lists of genes, may have a high overlap despite

being extracted from two publications that have dissimilar abstracts. The gene set/paper pairs you will be given have one gene set with a disease term and the other with a gene name, so you

should include reasoning as to a possible connection between the disease and the gene and explain this possible connection. Such a connection should be related to the abstracts.” The

response from the model along with statistics about the significance of the overlap and a PubMed query with the disease and the gene is provided to help uncover if this association is

already published in literature. GENE FUNCTION PREDICTIONS 50,000 gene sets were randomly selected from Rummagene and filtered for sets with less than 2000 genes. For all human genes, we

formed a matrix $A$ where $A(i,j)=1$ if gene _i_ is a member of gene set $j$ and $0$ otherwise. Then the co-occurrence matrix $\varPhi =A\cdot {A}^{T}$. As previously described20,

the co-occurrence probability between two genes: $$P\left(\alpha ,\beta \right)=\frac{\varPhi \left(\alpha ,\beta \right)}{{\phi }_{0}},$$ where ${\phi }_{0}$ is the total number of

co-occurrences, and the marginal probability $P(\alpha )=\frac{1}{{\phi }_{0}}\mathop{\sum}\limits_{\beta \ne \alpha }\varPhi (\alpha ,\beta )$. The cosine similarity, Jaccard index, and

normalized pointwise mutual information (NPWMI) for each pair of genes were then calculated as follows: $${Cosine}(\alpha ,\beta )=\frac{P(\alpha ,\beta )}{\sqrt{P(\alpha )P(\beta )}}$$

$${Jaccard}(\alpha ,\beta )=\frac{P(\alpha ,\beta )}{P(\alpha )+P(\beta )-P(\alpha ,\beta )}$$ $${NPWMI}(\alpha ,\beta )=\frac{-1}{{{{{\mathrm{ln}}}}}(P(\alpha ,\beta ))}\cdot \max

\left\{0,{{{{\mathrm{ln}}}}}\left(\frac{P(\alpha ,\beta )}{P(\alpha )P(\beta )}\right)\right\}$$ The NPWMI is a value between 0 and 1, where a larger value indicates the two genes co-occur

with greater probability than expected by random chance48. Four gene set libraries were used to benchmark gene function prediction: GO Biological Process (2023), GWAS Catalog (2023), MGI

Mammalian Phenotypes (2021), and Human WikiPathways (2021). To perform the predictions of the likelihood that a gene belongs to a gene set, we measured the distance of each gene to each gene

set in each library by computing the average distance of the gene to each gene in each gene set. Suppose $L$ is a matrix where $L(i,j)=1$ if gene $i$ is a member of gene set $j$ in

the library $L$, and $0$ otherwise. Let $D$ be the similarity matrix as described above, where the diagonal is set to $0$. The gene/gene-set association matrix \(G=\frac{D\cdot

L}{L\cdot {1}^{T}}\) where the division is elementwise. Each entry $G(i,j)$ is then the mean similarity of gene $i$ to all the genes in gene set $j$. The matrix $G$ can then be used

to predict membership of gene $i$ in any gene set. ROC curves and AUC values for each term in the library were computed using the Python sklearn.metrics module40. COMPARING THE RUMMAGENE

GENE SET SPACE TO THE ENRICHR GENE SET SPACE All the gene set libraries in Enrichr were assembled and processed together with the Rummagene gene sets so they can be projected into the same

two-dimensional space. First, all genes were mapped to their official NCBI gene symbols for _Homo sapiens_ or filtered out. Gene sets were then converted into vectors with values

corresponding to the inverse document frequency (IDF)49. Truncated Singular Value Decomposition (Truncated SVD)50 was then used to reduce the dimensionality of the IDF vectors to the 50

largest singular values. A UMAP25 with the default settings was then used to embed all samples into two dimensions. Finally, to better position the visualization, we computed the mean and

standard deviation of the embedding dimension axes and show the bulk of the samples that are within 1.68 standard deviations from the mean. DATA AVAILABILITY The Rumamgene dataset version

analyzed here is available for download from: https://rummagene.com/download and from Figshare51. The most recent updated version of the Rummagene dataset is also available from

https://rummagene.com/download. This dataset is updated weekly on Mondays. Additional files needed to reproduce the results are provided as Supplementary Data files. CODE AVAILABILITY The

Rummagene web server application is available from: https://rummagene.com/. The Rummagene source code is available from: https://github.com/MaayanLab/rummagene and a snapshot of the source

code was deposited in Figshare52. The code and files needed to reproduce the figures are available from: https://github.com/MaayanLab/rummagene/tree/main/figures. REFERENCES * Manzoni, C. et

al. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. _Brief. Bioinform._ 19, 286–302 (2018). Article CAS PubMed Google Scholar *

Keenan, A. B. et al. ChEA3: transcription factor enrichment analysis by orthogonal omics integration. _Nucleic Acids Res._ 47, W212–W224 (2019). Article CAS PubMed PubMed Central Google

Scholar * Lachmann, A. et al. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. _Bioinformatics_ 26, 2438–2444 (2010). Article CAS PubMed

PubMed Central Google Scholar * Hammal, F., de Langen, P., Bergon, A., Lopez, F. & Ballester, B. ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions

from an integrative analysis of DNA-binding sequencing experiments. _Nucleic Acids Res._ 50, D316–D325 (2022). Article CAS PubMed Google Scholar * Wilks, C. et al. recount3: summaries

and queries for large-scale RNA-seq expression and splicing. _Genome Biol._ 22, 323 (2021). Article CAS PubMed PubMed Central Google Scholar * Lachmann, A. et al. Massive mining of

publicly available RNA-seq data from human and mouse. _Nat. Commun._ 9, 1366 (2018). Article PubMed PubMed Central Google Scholar * Shin, M.-G. & Pico, A. Using Published Pathway

Figures in Enrichment Analysis and Machine Learning. _bioRxiv._ https://doi.org/10.1101/2023.07.06.548037. (2023). * McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold

Approximation and Projection for Dimension Reduction. _arXiv [stat.ML]_ (2018). * Kuleshov, M. V. et al. KEA3: improved kinase enrichment analysis via data integration. _Nucleic Acids Res._

49, W304–W316 (2021). Article CAS PubMed PubMed Central Google Scholar * Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. _Nucleic

Acids Res._ 44, W90–W97 (2016). Article CAS PubMed PubMed Central Google Scholar * Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus

genotype data. _Genetics_ 155, 945–959 (2000). Article CAS PubMed PubMed Central Google Scholar * Hémono, M., Haller, A., Chicher, J., Duchêne, A.-M. & Ngondo, R. P. The interactome

of CLUH reveals its association to SPAG5 and its co-translational proximity to mitochondrial proteins. _BMC Biol._ 20, 13 (2022). Article PubMed PubMed Central Google Scholar * Bileck,

A. et al. Inward Outward Signaling in Ovarian Cancer: Morpho-Phospho-Proteomic Profiling Upon Application of Hypoxia and Shear Stress Characterizes the Adaptive Plasticity of OVCAR-3 and

SKOV-3 Cells. _Front. Oncol._ 11, 746411 (2021). Article CAS PubMed Google Scholar * Rolfs, F., Piersma, S. R., Dias, M. P., Jonkers, J. & Jimenez, C. R. Feasibility of

Phosphoproteomics on Leftover Samples After RNA Extraction With Guanidinium Thiocyanate. _Mol. Cell. Proteom._ 20, 100078 (2021). Article CAS Google Scholar * Monsivais, D. et al.

Mass-spectrometry-based proteomic correlates of grade and stage reveal pathways and kinases associated with aggressive human cancers. _Oncogene_ 40, 2081–2095 (2021). Article CAS PubMed

PubMed Central Google Scholar * Mooser, C. et al. Treacle controls the nucleolar response to rDNA breaks via TOPBP1 recruitment and ATR activation. _Nat. Commun._ 11, 123 (2020). Article

CAS PubMed PubMed Central Google Scholar * Salaverry, L. S. et al. Metabolic plasticity in blast crisis-chronic myeloid leukaemia cells under hypoxia reduces the cytotoxic potency of

drugs targeting mitochondria. _Discov. Oncol._ 13, 60 (2022). Article CAS PubMed PubMed Central Google Scholar * Shen, H. et al. Integrated Molecular Characterization of Testicular Germ

Cell Tumors. _Cell Rep._ 23, 3392–3406 (2018). Article CAS PubMed PubMed Central Google Scholar * Lachmann, A. et al. Geneshot: search engine for ranking genes from arbitrary text

queries. _Nucleic Acids Res_ 47, W571–W577 (2019). Article CAS PubMed PubMed Central Google Scholar * Ma’ayan, A. & Clark, N. R. Large Collection of Diverse Gene Set Search Queries

Recapitulate Known Protein-Protein Interactions and Gene-Gene Functional Associations. _arXiv [q-bio.MN]_ (2016). * Ashburner, M. et al. Gene ontology: tool for the unification of biology.

The Gene Ontology Consortium. _Nat. Genet._ 25, 25–29 (2000). Article CAS PubMed PubMed Central Google Scholar * MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide

association studies (GWAS Catalog). _Nucleic Acids Res._ 45, D896–D901 (2017). Article CAS PubMed Google Scholar * Smith, C. L., Goldsmith, C.-A. W. & Eppig, J. T. The Mammalian

Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. _Genome Biol._ 6, R7 (2005). Article PubMed Google Scholar * Pico, A. R. et al. WikiPathways:

pathway editing for the people. _PLoS Biol._ 6, e184 (2008). Article PubMed PubMed Central Google Scholar * Van Der Maaten, L., Postma, E. O., van den Herik, H. J. & Others.

Dimensionality reduction: A comparative review. _J. Mach. Learn. Res_. 10, 13 (2009). * Campello, R. J. G. B., Moulavi, D. & Sander, J. Density-based clustering based on hierarchical

density estimates. in _Advances in Knowledge Discovery and Data Mining_ 160–172 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2013). * Evangelista, J. E. et al. SigCom LINCS: data and

metadata search engine for a million gene expression signatures. _Nucleic Acids Res._ 50, W697–W709 (2022). Article CAS PubMed PubMed Central Google Scholar * Subramanian, A. et al. A

Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. _Cell_ 171, 1437–1452.e17 (2017). Article CAS PubMed PubMed Central Google Scholar * Tabula Sapiens

Consortium*. et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. _Science_ 376, eabl4896 (2022). Article Google Scholar * Lasso, G. et al. A

Structure-Informed Atlas of Human-Virus Interactions. _Cell_ 178, 1526–1541.e16 (2019). Article CAS PubMed PubMed Central Google Scholar * Hanspers, K., Riutta, A., Summer-Kutmon, M.

& Pico, A. R. Pathway information extracted from 25 years of pathway figures. _Genome Biol._ 21, 273 (2020). Article PubMed PubMed Central Google Scholar * Talley, E. M. et al.

Database of NIH grants using machine-learned categories and graphical clustering. _Nat. Methods_ 8, 443–444 (2011). Article CAS PubMed PubMed Central Google Scholar * Wilkinson, M. D.

et al. The FAIR Guiding Principles for scientific data management and stewardship. _Sci. Data_ 3, 160018 (2016). Article PubMed PubMed Central Google Scholar * Gamble, A. PubMed Central

(PMC). _Charlest. Advisor_ 19, 48–54 (2017). Article Google Scholar * Brown, G. R. et al. Gene: a gene-centered information resource at NCBI. _Nucleic Acids Res_ 43, D36–D42 (2015).

Article CAS PubMed Google Scholar * Pieper, R., Löff, J., Hoffmann, R. B., Griebler, D. & Fernandes, L. G. High-level and efficient structured stream parallelism for rust on

multi-cores. _J. Computer Lang._ 65, 101054 (2021). Article Google Scholar * Obe, R. O. & Hsu, L. S. _PostgreSQL: Up and Running: A Practical Guide to the Advanced Open Source

Database_. (“O’Reilly Media, Inc.,” 2017). * Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. _Nucleic Acids Res_ 33, D54–D58 (2005).

Article CAS PubMed Google Scholar * Gremse, M. et al. The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources. _Nucleic Acids Res_ 39,

D507–D513 (2011). Article CAS PubMed Google Scholar * Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. _J. Mach. Learn. Res._ 12, 2825–2830 (2011). Google Scholar * Wolf,

F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. _Genome Biol._ 19, 15 (2018). Article PubMed PubMed Central Google Scholar * Traag,

V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. _Sci. Rep._ 9, 5233 (2019). Article CAS PubMed PubMed Central Google Scholar *

Bird, S., Klein, E. & Loper, E. _Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit_. (“O’Reilly Media, Inc.,” 2009). * Rehurek, R. & Sojka, P.

Gensim–python framework for vector space modelling. _NLP Centre, Faculty of Informatics, Masaryk University_ (2011). * Mazloom, A. R. et al. Recovering protein-protein and domain-domain

interactions from aggregation of IP-MS proteomics of coregulator complexes. _PLoS Comput. Biol._ 7, e1002319 (2011). Article CAS PubMed PubMed Central Google Scholar * Piñero, J. et al.

DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. _Nucleic Acids Res_ 45, D833–D839 (2017). Article PubMed Google Scholar * Sun,

B. B. et al. Genetic regulation of the human plasma proteome in 54,306 UK Biobank participants. _bioRxiv_ 2022.06.17.496443 (2022) https://doi.org/10.1101/2022.06.17.496443. * Chiarcos, C.,

de Castilho, R. E. & Stede, M. _Von Der Form Zur Bedeutung: Texte Automatisch Verarbeiten: From Form to Meaning: Processing Texts Automatically. Proceedings of the Biennial GSCL

Conference 2009_. (Narr Francke Attempto Verlag, 2009). * Karen, S. J. A statistical interpretation of term specificity and its application in retrieval. _J. Documentation_ 28, 11–21 (1972).

Article Google Scholar * Chicco, D. & Masseroli, M. Software Suite for Gene and Protein Annotation Prediction and Similarity Search. _IEEE/ACM Trans. Comput. Biol. Bioinform._ 12,

837–843 (2015). Article CAS PubMed Google Scholar * Clarke, D. J. B. et al. Rummagene gene sets with descriptions 01172024. figshare. Dataset.

https://doi.org/10.6084/m9.figshare.25017023.v3 (2024). * Clarke, D. J. B. et al. Rummagene source code snapshot from 03132024. figshare. Software.

https://doi.org/10.6084/m9.figshare.25404637.v1 (2024). Download references ACKNOWLEDGEMENTS This study is partially supported by NIH grants OT2OD030160, U24CA264250, RC2DK131995, and

U24CA224260. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York,

NY, 10029, USA Daniel J. B. Clarke, Giacomo B. Marino, Eden Z. Deng, Zhuorui Xie, John Erol Evangelista & Avi Ma’ayan Authors * Daniel J. B. Clarke View author publications You can also

search for this author inPubMed Google Scholar * Giacomo B. Marino View author publications You can also search for this author inPubMed Google Scholar * Eden Z. Deng View author

publications You can also search for this author inPubMed Google Scholar * Zhuorui Xie View author publications You can also search for this author inPubMed Google Scholar * John Erol

Evangelista View author publications You can also search for this author inPubMed Google Scholar * Avi Ma’ayan View author publications You can also search for this author inPubMed Google

Scholar CONTRIBUTIONS D.J.B.C. and G.B.M. developed the website, wrote the manuscript, performed data analysis, and produced figures. D.J.B.C. wrote the crawler and developed the code for

the fast search engine. Z.X. and E.Z.D. wrote the manuscript and performed data analysis. J.E.E. contributed to the data analysis. A.M. conceived the project, wrote the manuscript, managed

the project, and was responsible for funding the project. CORRESPONDING AUTHOR Correspondence to Avi Ma’ayan. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare competing interests.

PEER REVIEW PEER REVIEW INFORMATION _Communications Biology_ thanks Alexander R. Pico and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary

Handling Editors: Chien-Yu Chen and Tobias Goris. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and

institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY NOTE DESCRIPTION OF SUPPLEMENTARY MATERIALS SUPPLEMENTARY-DATA-1 SUPPLEMENTARY-DATA-2 SUPPLEMENTARY-DATA-3

SUPPLEMENTARY-DATA-4 SUPPLEMENTARY-DATA-5 SUPPLEMENTARY-DATA-6 SUPPLEMENTARY-DATA-7 SUPPLEMENTARY-DATA-8 SUPPLEMENTARY-DATA-9 RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed

under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate

credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article

are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and

your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this

licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Clarke, D.J.B., Marino, G.B., Deng, E.Z. _et al._ Rummagene: massive

mining of gene sets from supporting materials of biomedical research publications. _Commun Biol_ 7, 482 (2024). https://doi.org/10.1038/s42003-024-06177-7 Download citation * Received: 11

October 2023 * Accepted: 10 April 2024 * Published: 20 April 2024 * DOI: https://doi.org/10.1038/s42003-024-06177-7 SHARE THIS ARTICLE Anyone you share the following link with will be able

to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing

initiative