Play all audios:
ABSTRACT Histone modifications can regulate transcription epigenetically by marking specific genomic loci, which can be mapped using chromatin immunoprecipitation sequencing (ChIP-seq). Here
we present QHistone, a predictive database of 1534 ChIP-seqs from 27 histone modifications in Arabidopsis, offering three key functionalities. Firstly, QHistone employs machine learning to
predict the epigenomic profile of a query protein, characterized by its most associated histone modifications, and uses these modifications to infer the protein’s role in transcriptional
regulation. Secondly, it predicts synergistic regulatory activities between two proteins by comparing their profiles. Lastly, it detects previously unexplored co-regulating protein pairs by
screening all known proteins. QHistone accurately identifies histone modifications associated with specific known proteins, and allows users to computationally validate their results using
gene expression data from various plant tissues. These functions demonstrate an useful approach to utilizing epigenome data for gene regulation analysis, making QHistone a valuable resource
for the scientific community (https://qhistone.paoyang.ipmb.sinica.edu.tw). SIMILAR CONTENT BEING VIEWED BY OTHERS GENOME-WIDE IDENTIFICATION OF ACCESSIBLE CHROMATIN REGIONS BY ATAC-SEQ UPON
INDUCTION OF THE TRANSCRIPTION FACTOR BZIP11 IN ARABIDOPSIS Article Open access 27 July 2023 GLOBAL PROFILING OF RNA–CHROMATIN INTERACTIONS REVEALS CO-REGULATORY GENE EXPRESSION NETWORKS IN
_ARABIDOPSIS_ Article 14 October 2021 ACHIP IS AN EFFICIENT AND SENSITIVE CHIP-SEQ TECHNIQUE FOR ECONOMICALLY IMPORTANT PLANT ORGANS Article 23 August 2024 INTRODUCTION Histone
modifications such as methylation, acetylation, phosphorylation, and ubiquitylation are known to affect transcription commonly through altering chromatin status1. For example, acetylated
histones are often associated with activating transcription by reducing the positive charges on histones, thus allowing proteins to gain access to the DNA2,3,4,5. Conversely, methylation can
change the overall charge of the histone proteins, affecting their interaction with the negatively charged DNA, hence influencing the interaction dynamics between histones and DNA, that
ultimately impacts the chromatin compaction2,3,4,5. For example, histone 3 lysine 9 dimethylation (H3K9me2) and histone 3 lysine 27 monomethylation (H3K27me1) are known to be enriched in
constitutive silenced heterochromatin, and repressively regulate transcription due to the dense structure of heterochromatin4,5,6,7,8,9,10 (See Supplementary Table 1 for a list of histone
modifications that are often active or repressive). In plants, histone modifications may affect the strength of histone binding to DNA, and such influence is dependent on the plant
conditions including stresses, developmental stages, tissues and genotypes11,12. Overall, the types of histone modifications and their specific preferences for marking locations on the
genome can provide valuable insights into transcriptional regulation, and these preferences are sensitive to plant conditions. Chromatin immunoprecipitation followed by sequencing (ChIP-seq)
and DNA affinity purification sequencing (DAP-seq) are commonly used high-throughput methods for profiling the genome-wide binding between genomic DNA and proteins of interests13,14. A
couple of databases have been built to collect ChIP-seqs in Arabidopsis (Supplementary table 2). PlantPAN3.0 included 54 transcription factors (TFs) and 19 types of histone modifications and
variants into their database, and majorly investigated _cis-_ and _trans-_regulatory motifs in plant promoters by transcription factor binding site (TFBS)15. ReMap2022 integrated the
ChIP-seq data from 423 TFs and 33 types of histone modifications and histone variants in Arabidopsis16. It enables the identification of histone marks enriched in the genomic binding regions
of a query protein. The enrichment of a specific histone mark to a query protein is estimated by the overlap percentage for each histone mark, which is determined by the ratio of the number
of overlapping peaks to the number of total peaks from the protein binding regions. Higher percentages indicate a stronger association with the protein. These databases did not investigate
the plant conditions in relation to the histone modification data, nor did they use rigorous statistical testing or the latest machine learning techniques in approaching the epigenomic
regulation, with machine learning still limited in predicting histone modifications at specific loci17. The regulatory activities of proteins, such as activation or repression, are often
associated with histone modifications4. For instance, the binding sites of an activator often coincide with the locations marked by histone acetylation. Hence, the set of histone
modifications that typically mark the binding locations of a specific regulatory protein can be used to assess the protein’s regulatory function, and this unique set of histone modifications
can be considered as the protein’s epigenome profile. Depending on the regulatory protein, the associated epigenetic marks in the epigenome profile can be activating, repressive, or a
combination, reflecting the intricate regulatory dynamics involved in the biological process4,18,19. AGENET DOMAIN (AGD)-CONTAINING PROTEIN 1 (ADCP1) targets transposable elements (TEs) with
abundant heterochromatic modifications such as H3K9me2, and DNA methylation, resulting in transcriptional silencing20,21,22. Loss of function of _ADCP1_ gene largely increases the
proportion of decondensed nuclei, confirming its role in heterochromatin maintenance21. LIKE HETEROCHROMATIN PROTEIN 1 (LHP1) mediates the spreading of trimethylation of histone H3 lysine 27
(H3K27me3), and requires for maintaining the repression of vernalization-inducing gene _FLOWERING LOCUS C_ (_FLC_)23,24. Therefore, by examining the set of epigenetic marks frequently
observed at the binding locations of the regulatory proteins (i.e., its epigenome profile), it is possible to predict their regulatory activity. Furthermore, regulatory proteins may
collaborate to accomplish their regulatory functions. They may both carry out similar functions to activate or suppress the expression of target genes. Alternatively, they can have
contrasting roles in their transcriptional control. For example, HISTONE DEACETYLASE 6 (HDA6) collaborates with histone demethylases LYSINE-SPECIFIC DEMETHYLASE 1-LIKE 1 and 2 (LDL1/2) to
jointly suppress the expression of genes and TEs25,26. SET DOMAIN GROUP 2 (SDG2) interacts with SWD2-like b (S2Lb), and together they directly affect the enrichment of histone 3, lysine 4
trimethylation (H3K4me3) on highly transcribed genes27. POLYCOMB REPRESSIVE COMPLEX 2 (PRC2) catalyzes H3K27me3, and maintains repressed state of target genes28,29,30. Conversely, INCREASE
IN BONSAI METHYLATION 1 (IBM1) and LDL2 have opposing functions in regulating transcription. A loss of IBM1 function results in the ectopic accumulation of H3K9me2, a repressive mark, across
gene bodies. On the other hand, the _ldl2_ mutation can mediate the repressive state induced by IBM1 by increasing the presence of histone 3, lysine 4 monomethylation (H3K4me1), an active
mark31. Another example includes RELATIVE OF EARLY FLOWERING 6 (REF6) and LHP1, where REF6 can alter the repressive state maintained by LHP1, enabling gene reactivation through the
repressive modification histone H2A monoubiquitination (H2Aub1)32. Altogether, these examples suggest that by evaluating the similarity in their epigenomic profiles, it is possible to
determine whether two regulatory proteins have similar or opposing functions in regulation. Here, we have developed QHistone, a predictive database for histone modifications in Arabidopsis
(https://qhistone.paoyang.ipmb.sinica.edu.tw). It first predicts a protein’s epigenomic profile, which can be used to assess its regulatory activity. Using a machine learning model trained
on numerous ChIP-seq datasets of various histone marks, QHistone can accurately cluster a regulatory protein with its closely associated histone modifications. In addition, between two
regulatory proteins, QHistone compares epigenome profiles between proteins to offer assessment of their possible synergistic regulation. We also conducted pairwise comparisons of epigenome
profiles for 616 proteins with available ChIP-seq data from NCBI, identifying functionally linked protein pairs that were previously unexplored. Our predictions were validated by comparing
them to known regulatory proteins that associate with specific histone modifications, as well as protein pairs known to have similar or dissimilar activities, as mentioned above. QHistone
offers gene expression analysis tools that allow users to computationally validate their results and provides an overview of histone modifications linked to various tissues, stress
responses, and genotypes. By integrating comprehensive epigenomic datasets with bioinformatic analysis, QHistone offers new perspectives on gene regulation in plant epigenomics research.
RESULTS OVERVIEW OF QHISTONE AS A PREDICTING DATABASE QHistone is a web database of a large number of ChIP-seqs from histone modifications and histone variants in Arabidopsis, accompanied by
analytical functions (see Fig. 1 for an overview of key functionalities, and Supplementary Fig. 1 for methods and analyses). Totally, there are 1534 ChIP-seqs from 27 types of histone
modifications, 11 types of histone variants, and 4 types of histone proteins collected from Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) (see Methods) (Fig. 2A;
Supplementary Fig. 2). The database also includes information of plant conditions (Fig. 2B; Supplementary Table 3). These data were processed through a unified peak-calling pipeline (see
Methods for details of the scripts), and can be browsed, downloaded, and visualized on the website (Supplementary Fig. 3). The sections in Results are organized to first introduce the three
major functionalities, including their concepts, methods and validations, followed by genome-wide analysis of histone modifications, the analysis with plant conditions, and performance
evaluation. EPIGENOME PROFILING OF A QUERY PROTEIN The first functionality in QHistone is to predict the epigenome profile of a protein query (see the left panel of Fig. 1), which allows the
inference of potential transcriptional regulation. To establish the epigenomic profiles of a query protein, we first implemented a Support Vector Machine (SVM) model to recognize all the
ChIP-seq datasets of histone modifications (Supplementary Fig. 1; Supplementary Data 1 for Selection of Machine Learning Model and Supplementary Data 2 for Formulas of Evaluation Metrics).
The final model achieved an accuracy of 90.74% (see Fig. 3A for receiver operating characteristic (ROC) analysis; Fig. 3B shows the predicting accuracy of each modification/variant;
Supplementary Table 4). In other words, QHistone can precisely cluster a query protein to its most associated histone modifications (see Methods for step-by-step procedures of model
training, testing, optimization, and evaluation). Once the epigenome profile is predicted, the regulatory role of the top histone modifications may be indicative of the query protein’s role
in transcriptional regulation. The overlapping percentage method used in ReMap2022 is conceptually similar to our approach16. We referred to the methodology described in ReMap2022 and
evaluated our prediction accuracy against this method. This method first merges all peaks discovered from each histone mark and then sequentially calculates the overlapping percentage
between a query and each histone mark. The mark with the highest overlapping percentage indicates the strongest association with the query. Overall, our method is 14.5% more accurate than
the overlapping percentage method (Supplementary Fig. 4) and demonstrates greater accuracy in most individual modifications/variants (Fig. 3B). Moreover, we identified a small set of genomic
regions that can predict the epigenome profile with similar or higher accuracy than using the entire genomic regions33. This was achieved by adopting feature selection using the recursive
feature elimination (RFE) technique to filter out less informative genomic regions in the prediction. Consequently, a set of 303 specific genomic regions of 200 bp (0.05% of the genome) can
reach 99.43% accuracy in leave-one-out cross-validation (Fig. 3C; Supplementary Table 5 and Supplementary Table 6). These critical regions were enriched in the 5’UTR and exon in gene
structure, where histone marks are frequently targeted and observed, suggesting that these regions are crucial in gene regulation through dynamic marking of histone modifications (Fig. 3D).
In QHistone, users can select either all regions or these 303 critical regions for their prediction. To test our models, we predicted the epigenome profiles of two known regulatory proteins,
ADCP1 and EARLY BOLTING IN SHORT DAY (EBS). ADCP1 is a heterochromatin-binding protein that has been reported to specifically recognize H3K9me2 modification, and positively associates with
two heterochromatic modifications H3K9me2 and H3K27me120,21,22. Both are recognized as repressive marks. Our analysis of ADCP1 indicates that the most closely associated histone marks in its
epigenome profile are also H3K9me2 and H3K27me1. This suggests a strong similarity between ADCP1’s binding preference on the genome and the distribution of these two marks (Fig. 3E), that
our prediction corroborates with the previous reports10. The fact that the top four marks associated with ADCP1 are all repressive further suggests that ADCP1’s regulatory role is
predominantly repressive. On the other hand, EBS is a bivalent histone regulatory protein that respectively recognizes H3K4me3, and H3K27me3. EBS without c-terminal loop (EBSΔC) binds
H3K4me3 with a higher affinity than unmodified EBS protein34. In Fig. 3F, our predictions show that the likelihood of H3K4me3 marking to EBSΔC is 2.3 times greater than to EBS, confirming a
stronger association between EBSΔC and H3K4me3. In both instances, QHistone accurately predicts the primary histone marks associated with the regulatory proteins that are critical to their
functions. COMPARING SVM WITH OTHER SUPERVISED MACHINE LEARNING MODELS We compared the SVM with eight other supervised machine learning algorithms, including k-Nearest Neighbors (KNN),
Gaussian Naive Bayes (GNB), Decision Trees (Trees), Random Forest, Nearest Centroid, Stochastic Gradient Descent (SGD), Multi-Layer Perceptron (MLP), and XGBoost. The goal was to determine
whether SVM is the best model for QHistone (see Supplementary Data 1 for Selection of Machine Learning Model). To ensure optimal performance, the hyperparameters of KNN, GNB, Trees, Random
Forest, and Nearest Centroid were tuned through grid search cross-validation. MLP, SGD, and XGBoost were tuned using randomized search cross-validation, selecting the best fit from 100
randomly chosen hyperparameter sets, given a larger space of parameters these algorithms involve. Overall, our results showed that SVM achieved highest predicting accuracy, F1 score,
precision, and recall rate, hence SVM was selected for QHistone. PREDICTING REGULATORY INTERACTIONS BETWEEN PROTEIN PAIRS THROUGH EPIGENOME PROFILING When two regulatory proteins share very
similar epigenome profiles, they are likely to have similar preferences for the binding locations on the genome, and associate with similar sets of histone marks. Altogether, this may
suggest similar regulatory activities such as activation and repression. Conversely, regulatory proteins with very different epigenome profiles may suggest their different roles in
regulating expression (see the center panel of Fig. 1). To assess the similarity of the epigenome profiles between two proteins, both Kullback–Leibler divergence (KLD) and Mutual information
(MI) (see Methods) are employed. KLD measures the dissimilarity between the two profiles, whereas MI captures the amount of information shared between them. Therefore, two similar epigenome
profiles tend to show a low KLD and a high MI; suggesting likelihood of functional consistency between the two proteins. In fact, some proteins with consistent regulation by similar
epigenomic profiles demonstrate a synergistic effect, while others with opposing regulation may have an antagonistic effect. We validated our prediction of functional consistency on known
protein pairs with specific histone modification associated their complexes. In this case, SDG2 co-regulates with S2Lb, directly influencing the enrichment of H3K4me327. QHistone accurately
predicted the epigenome profiles of these proteins (Fig. 4A), and displayed the probabilities ranked by a slope chart (Fig. 4B show the ranking order of each histone modification/variant
respectively predicted from two proteins). Both proteins exhibited a similar ranking of histone modifications, indicating their possible functional consistency. We further assessed the
differences in epigenome profiles using KLD and MI. Both low KLD and high MI indicated similar epigenome profiles (KLD = 0.026; MI = 1.132; Supplementary Table 7). These proteins also shared
similar binding regions, predominantly at promoters and enhancers, underscoring their similar roles in gene regulation (Fig. 4C, D). Additionally, across multiple tissues in Arabidopsis,
both SDG2 and S2Lb consistently showed a tendency to upregulate gene expression (Fig. 4E), as a function in QHistone to assist computational validation of the regulatory activity (refer to
Methods for gene expression analysis). This further supports the predicted functional consistency. Lastly, Gene Ontology (GO) analysis in QHistone highlighted biological functions; genes in
overlapping regions were enriched in root development functions, aligning with previous phenotypic observations27 (Supplementary Table 8). Lastly, QHistone also successfully validated PRC2
complex with its two known members CURLY LEAF (CLF) and SWINGER (SWN); suggesting our tools could accurately indicate functional consistency (Supplementary Table 7 and Supplementary Table 9;
Supplementary Fig. 5). Conversely, two proteins might be functionally opposite within a regulatory complex. For example, IBM1 is known to activate gene expression by removing H3K9me2, while
LDL2 represses gene expression by removing H3K4me131. Since the ChIP-seq data for LDL2 is not publicly available, we utilized LDL1 data as a proxy because LDL1 and LDL2 are structurally
similar35,36 and have shown functional redundancy in epigenetic regulation25. By analyzing their epigenome profiles, we discovered that IBM1 had a lower association with H3K9me2 (Probability
= 0.01), and LDL1 had a lesser association with H3K4me1 (Probability < 0.01), likely due to their demethylation activities (Supplementary Fig. 6A). Interestingly, the epigenome profiles
of these proteins are distinctly different (KLD = 1.387; MI = 0.000) (Supplementary Table 7; Supplementary Fig. 6), indicating that IBM1 is primarily associated with active modifications,
whereas LDL1 is linked to repressive modifications. Overall, these results support the notion that IBM1 and LDL2 are functionally distinct, consistent with their known opposing roles in gene
regulation. In brief, QHistone is able to predict the regulatory activity among two regulatory proteins and summarize their effect on transcription by comparing their epigenome profiles.
Additionally, these predictions can be further evaluated through gene expression analysis. This functionality offers a novel perspective for studying the regulatory interactions between
protein pairs, beyond just their binding partners. QUERY AGAINST ALL ARABIDOPSIS PROTEINS FOR PUTATIVE PROTEIN PARTNERS OR FUNCTIONALLY ASSOCIATED PAIRS QHistone has developed epigenome
profiles for 616 Arabidopsis proteins, with their ChIP-seq data accessible in the GEO database (Supplementary Fig. 7). Users can upload their own proteins (preprocessed in standard BED
format) to query against these proteins in the database. This facilitates the discovery of previously unknown protein partners and functionally associated protein pairs (see the right panel
of Fig. 1). Amongst these 616 proteins, 1566 ChIP-seq libraries are present. To discover potential protein partners or functionally associated pairs within these libraries, we conducted 2.4
M pairwise comparisons of their epigenome profiles. The outcomes are depicted based on their KLD scores, _MI_ values, and the overlapping percentage of binding locations between two proteins
to aid in evaluating the functional consistency of these pairs (Fig. 5). Since there are no standard cutoffs for these scores, we selected four known protein partners or functionally
associated protein pairs—HDA6/LDL1, SDG2/S2Lb, CLF/SWN (PRC2 complex), and CLF/FERTILIZATION-INDEPENDENT ENDOSPERM (FIE) (PRC2 complex)—as reference points for functional consistency.
Additionally, two pairs, IBM1/LDL1 and REF6/LHP1, were identified as functionally opposite (Fig. 5; Supplementary Table 7). Among the 2.4 million combinations, we easily identified
unexplored protein pairs by comparing to these known protein pairs. For instance, SEPALLATA3 (SEP3) and APETALA1 (AP1) demonstrated strong functional consistency, evidenced by their low KLD
close to 0 and a high MI of 1.229 (Fig. 5; Supplementary Table 7; Supplementary Fig. 8A). Our literature review revealed that both are MADS-domain TFs, co-localizing in the same genomic
region, and they can both influence chromatin accessibility to regulate gene expression37,38. This finding aligns well with our predictions. On the other hand, ADCP1 and SUPPRESSOR OF TY
INSERTION 6 (SPT6) appeared as a pair with functionally opposite roles, exhibiting a high KLD of 1.383 and a low MI close to 0 (Fig. 5; Supplementary Table 7; Supplementary Fig. 8B). ADCP1,
a heterochromatin binding protein, plays a crucial role in transcriptional silencing20,21,22, while SPT6, a transcription elongation factor, co-occupies with RNAPII across highly transcribed
genes39, suggesting these two proteins may actually have opposing functions in transcription regulation. In brief, QHistone allows users upload their own query protein to query against
known proteins to identify regulatory interactions with proteins that have not yet been discovered. The results from the 2.4 M epigenome profile comparison are highly informative, enabling
exploration of previously unexplored regulatory interactions among know proteins and providing new research insights. GENOME-WIDE ANALYSIS OF HISTONE MODIFICATIONS WITH PLANT CONDITIONS We
studied a large number of histone modifications under different plant conditions (like tissues, genotypes, and stresses) using QHistone (see Fig. 2; Supplementary Table 3). Our goal was to
understand how these conditions affect the patterns of marking locations of histone modifications across the plant’s genome. QHistone is able to employ three visualization
methods—t-distributed stochastic neighbor embedding (t-SNE), principal component analysis (PCA), and multidimensional scaling (MDS)—to show these patterns (Fig. 6A; Supplementary Fig. 9). As
shown in the t-SNE plot, we discovered that these patterns mainly group together based on the type of histone modification, which suggests that the types of modifications are key in
controlling how histones attach to the genome. In the same t-SNE plot (Fig. 6A), plant conditions did not appear to be a primary factor influencing the marking patterns of histone
modifications. When we plotted different tissue types of the same histone modification H3K27me3, we discovered that the marking locations could be clustered based on the tissue of origin
(Fig. 6B). Furthermore, with hierarchical clustering of histone modification libraries under different plant conditions, we observed that within the histone acetylation group, there are
subgroups combining stresses and tissues. This suggests that plant conditions could be a secondary factor (Supplementary Fig. 10) impacting how histone changes are distributed across the
genome. To incorporate the information of plant conditions, QHistone is able to accurately map the binding patterns of query proteins within a t-SNE plot, alongside their corresponding
histone marks from ChIP-seq datasets across various states of plant conditions. For example, LHP1, which interacts with PRC1/2 to regulate H3K27me3 spread23, is known to have differential
H3K27me3 marking on the genome in different tissues40. While in the example the LHP1 protein binding data was originally sampled from the 14 day seedling with non-stress condition, in the
t-SNE plot by QHistone (Supplementary Table 10; Supplementary Fig. 11). It is clear that LHP1’s binding pattern closely resembles that of H3K27me3 in similar conditions for 10–14 day-old
seedlings, diverging significantly from other tissues (e.g., embryo, protoplast, sperm) or in cases of functional loss mutations. This example demonstrates that QHistone is able to precisely
link LHP1’s binding with the specific histone modifications and plant conditions, enabling detailed exploration of plant conditions. META-ANALYSIS OF HISTONE MODIFICATIONS AT VARIOUS
GENOMIC ELEMENTS To gain insight on the marking pattern of active versus repressive histone modifications on specific genomic structures, we profiled the coverage of histone modifications
around genes, TEs, and enhancers using meta-analysis (Fig. 6C; Supplementary Fig. 12; Supplementary Fig. 13; Supplementary Fig. 14). Histone acetylation marks showed similar pattern which
the marking crest was around the transcriptional start site (TSS), whereas most of the repressive modifications were depleted in the 5’UTR (Fig. 6D). Additionally, we calculated the overlap
percentage of marked regions between pairs of histone modifications, as illustrated in a heatmap (Fig. 6E). The results revealed that active and repressive histone modifications tend to
co-localize within the same genomic regions, suggesting their coordinated regulation. PERFORMANCE EVALUATION IN ASSOCIATING HISTONE MODIFICATIONS TO A QUERY We evaluated the performance of
QHistone by assessing the prediction accuracy and the reliability of the method. While there are currently no similar databases in any species that have implemented machine learning to
identify the associated histone marks for a query protein, we used the intuitive calculation of overlapping percentage between the marking locations of histone modifications and the binding
sites of query proteins (from ChIP-seq datasets) implemented in ReMap2022 for comparison. In ReMap2022, the modification/variant with a high overlapping percentage to the query was predicted
to have a high association. As a result, QHistone showed a higher prediction accuracy through the 5-fold cross-validation (QHistone: 90.74%; Overlapping: 76.20%) (Supplementary Fig. 4).
When evaluating the prediction accuracy of each modification/variant, QHistone showed better performance in 29 modifications/variants out of 43 by at least 5% (Fig. 3B). To investigate the
factors contributing to QHistone’s higher accuracy, we focused on the marking properties (ChIP quality, peak number, and peak size) of each ChIP-seq. As shown in Supplementary Fig. 15 and
Supplementary Fig. 16, we found that ChIP-seq data with high quality, more peaks, and larger peak sizes are considered easy-to-predict. Conversely, data with low quality, fewer peaks, and
smaller peak sizes are considered difficult-to-predict. While both QHistone and the Overlapping methods showed high accuracy in predicting easy-to-predict modifications/variants, QHistone
appears to outperform the overlapping method in predicting difficult-to-predict modifications/variants, as indicated by fewer such cases. This suggests that QHistone performs better in
challenging scenarios. Lastly, as some histone modifications appear to have more ChIP-seq datasets, the collections of each modification/variant in QHistone were therefore unbalanced (Fig.
2A). To examine if our predictions are influenced by such unbalanced data, we trained the SVM models with gradually increasing size of libraries of each modification/variant until the
largest number possible (Supplementary Fig. 15E) and monitored the prediction accuracy. As a result, QHistone achieves the highest accuracy level with fewer training libraries compared to
Overlapping (the highest accuracy is achieved with 38 training datasets for QHistone and 45 for Overlapping). Furthermore, as the accuracy level does not significantly change when more
training data is added, adding unbalanced data does not easily influence the model’s performance. Moreover, since the accuracy of our model has already reached a plateau with 30-40 libraries
per modification, suggesting sufficient libraries might have been collected already in the current QHistone to obtain optimal model parameters; therefore, further updates with more
libraries may not be necessary. DISCUSSION In this study, we developed QHistone which is more than a database storing raw sequencing data. QHistone equipped processed resources and devised
analytical tools that could potentially benefit the research community. QHistone has currently the largest ChIP-seq collection of histone modifications and variants in Arabidopsis
(Supplementary Fig. 2). With the user-friendly interface, users can easily search the sequencing data of their interest such as specific genotype, stress, or tissue, intuitively visualize
the data, and directly download them from QHistone (Supplementary Fig. 3). QHistone also provides a peak-calling pipeline, which is available as an easy-to-use Python script, allowing users
to consistently process their new histone data. Additionally, QHistone has the ability to explore the association between queries and histone modifications, predict the regulatory role
between two queries, and identify unexplored protein pairs. These unique features set QHistone apart from other current databases, enabling it to answer novel questions in the field of
epigenetic regulation and uncover synergistic effects. We developed the query function to meet the increasing interest of analyzing the regulatory proteins in the epigenomic community. In
the past, researchers intuitively utilized the overlapping method to detect the associated histone modification to a query protein, including ReMap202216 (Supplementary Table 2). However,
through the performance evaluation, we found that several factors can all affect the predicting accuracy, including the quality, peak number and peak sizes of the modifications/variants
(Supplementary Fig. 15; Supplementary Fig. 16). QHistone outperformed the overlapping method as it adopted a supervised learning model widely applied in decision-making with high accuracy41.
This method considers the weight of each genomic region and gives probabilistic outputs to speculate the association between the query and the modifications/variants42,43. Although SVM is a
common machine learning technique, it is a novel and well-suited approach in predicting the epigenome profile. QHistone learned the genome-wide marking patterns (include marking or no
marking) from the data, whereas the overlapping only considered the intersecting regions, that altogether make QHistone a better predictor. We adopted RFE in this study (Fig. 3C), to discard
less informative genomic regions. We identified regions capable of distinguishing all profiled histone modifications and variants with increased precision (Accuracy with selected regions:
99.43%, all regions: 90.74%). Surprisingly the total size of the chosen 303 regions is extremely small (0.05% of genome) (Supplementary Table 6). Hence, it could be possible to predict the
profile of query just by examining these regions through bio-chip (such as ChIP-on chip technology) with advantages in price and speed. In our study, we had trained the SVM model using these
regions. User can either predict the profile by inputting protein binding regions from ChIP-seq or the binding information gathering from the 303 critical regions. QHistone offered a robust
method in predicting the epigenome profile. Through our analyses, we found the predicting accuracy with the current data had already reached a plateau that QHistone had collected sufficient
datasets in training the model, and may not require more updates in the future. Moreover, our analyses showed that our model with a few ChIP-seq data is enough for predicting the epigenome
profile. It is therefore possible to build the QHistone on other plant species with fewer ChIP-seq data, such as rice, maize, and soybean. Additionally, the types and roles of histone
modifications are generally conserved across different plant species44,45, and more regulatory proteins with conserved domains are being discovered across species6, suggesting that most of
the results from Arabidopsis can be used to infer other plant species already. The method developed by QHistone has a strong potential to be transferred to additional plant species. In
summary, we present a platform with the largest ChIP-seq datasets of histone modifications and variants database in Arabidopsis, along with analytical tools that provide a comprehensive
epigenome profile of the input query. QHistone allows direct analysis and visualization of the action of protein query and the functional association between protein pairs, with various
histone modifications that usually required challenging experimental validations. METHODS DATA COLLECTION All ChIP-seq libraries including histone modifications, histone variants, histone
proteins, TFs, and regulatory proteins were obtained from NCBI GEO database46 with the following keywords, “_Arabidopsis thaliana_” AND “ChIP” OR “DAP”. QHistone was composed of 1534 histone
related ChIP-seq libraries and 1566 protein ChIP-seq libraries. The metadata which describes the information of each sample were collected using web crawling technique with data cleaning.
Due to the irregular or missing metadata descriptions by different researchers on GEO database, we unified the data through data cleaning and manual modification by surveying the original
publication of each sample for the purpose of gaining the most complete information. All the information was organized in the table and performed on the QHistone database
(https://qhistone.paoyang.ipmb.sinica.edu.tw/browser/). The metadata of each sample included histone modification type, age, genotype, ecotype, treatment, publication resource, containing
input control or not, and the sample id from NCBI (GSM id from GEO database and SRR id from SRA database)46,47. The raw sequencing data (sra) of each sample were downloaded from SRA using
prefetch from SRA Toolkit (version 2.11.2)47. These raw binarized sequencing data were further decompressed using pfastq-dump (version 0.1.6, retrieval data: Mar 12, 2019) with “split-files”
options which is the parallel computing version of the original fastq-dump from SRA toolkits. File from pair-end sequencing data was split into two files which namely “_1.fastq” and
“_2.fastq”, we removed the second file and retained single-end file for further ChIP analysis. PROCESSING CHROMATIN IMMUNOPRECIPITATION SEQUENCING (CHIP-SEQ) DATA All decompressed ChIP-seq
data (fastq) mentioned in the previous section were processed by a unified pipeline (https://qhistone.paoyang.ipmb.sinica.edu.tw/PeakCallingPipeline/) (see the left panel of Supplementary
Fig. 1). The adapter sequences and the low-quality reads (_Q_ < 30) in the sequencing data were removed using TrimGalore (version 0.6.4)48 which is the wrapper tool around Cutadapt
(version 2.10)49 and FastQC (version 0.11.8)50. The trimmed reads were further aligned to Arabidopsis genome TAIR10 51 using bowtie2 (version 2.2.6)52 with default parameters. We converted
SAM format file created by bowtie2 into sorted BAM format using samtools (version 1.10)53. Peak calling was performed using MACS2 (version 2.2.7.1)54 with a false discovery rate of (FDR)
< 0.05 as the criteria. The sample containing input control were processed with “-c” option. The Fraction of Reads in Peaks (FRiP) were calculated for evaluating the quality of the
ChIP-seq library. ChIP-seq samples with high quality (FRiP > 0.01) were used for further analyses. This quality cutoff is suggested by previous publication55. See Supplementary Fig. 17
for the FRiP score of histone modifications/variants in QHistone. On the other hand, to visualize the continuous distribution of ChIP reads on genome browser (IGV), BAM files were converted
into coverage files (bigWig format) using bamCoverage function in deepTools (version 3.4.3)56. The workflow of ChIP-seq analysis was wrapped into a python scripting pipeline, and it is
available on QHistone website. The exact commands and parameters are recorded in the section of Exact Commands and Parameters in Supplementary Data 3. PREDICTING THE EPIGENOME PROFILE OF A
PROTEIN QUERY The prediction of an epigenome profile for a query protein using SVM involves three phases (see center panel of Supplementary Fig. 1): In phase 1, we download and preprocess
1534 ChIP-seq datasets of histone modifications/variants from NCBI, resulting in a total of 596,343 bins for the TAIR10 genome. We represent histone modification presence in the genome
(i.e., peak locations) with a Boolean digit system, where “1” indicates a “marking” and “0” indicates “no marking.” Consequently, a matrix is created to store all Boolean digits from the
1534 histone modification datasets (i.e., 1534 ChIP-seq libraries x 598,343 bins). The final matrix of QHistone covered 1534 libraries, significantly more than the sum of the other two major
databases of Arabidopsis histone modifications15,16 (Supplementary Fig. 2). In phase 2, we implemented SVM and searched for the best SVM model with optimal parameters to cluster 1534
ChIP-seq libraries. We used the machine learning Python package scikit-learn (version 0.23.2) to implement SVM, taking the matrix from phase 1 as the input. To determine the optimal
parameters, including regularization parameter, kernel type, and kernel coefficient, we explored various parameter combinations using grid search coupled with 5-fold cross-validation. We
assessed model performance by calculating overall accuracy, F1 score, recall, and precision (Supplementary Fig. 18). In the 5-fold cross-validation, the dataset of 1534 preprocessed
ChIP-seqs is randomly split into 5 groups. The model is trained on 4 groups in each cycle, while the fifth group, containing the actual answers, is used for performance evaluation. This
cycle is repeated five times, with different groups used for testing each time, thereby providing a comprehensive evaluation of the model’s accuracy, F1 score, recall, and precision across
multiple trials. As a result, the linear SVM model with a regularization parameter C = 0.1 demonstrated the best performance in the test datasets (Supplementary Fig. 18). When implementing
SVM for clustering our data, we addressed the multi-class classification challenge of 27 histone modifications and 11 histone variants using the one-versus-rest algorithm. Among the original
1534 libraries, 292 (19.03%) libraries with low FRiP scores (FRiP < 0.01)55,57,58, specific histone proteins (e.g., H3, H1), and modifications/variants with particularly low predictive
accuracy (accuracy < 30%) were excluded. The final model, trained using 1242 eligible libraries, can accurately cluster any of these 1242 ChIP-seq libraries into the correct histone mark
groups with a 90.74% accuracy. In phase 3, to identify the histone marks most closely associated with a query protein (i.e., its epigenomic profile), we use the ChIP-seq data of the protein
(peak-called and boolean-digitized in standard BED format) as input for our final SVM model in QHistone. This model outputs probability scores for various histone modifications (i.e.,
epigenomic profile) specific to the query protein42. The probability score estimates the similarity between the peak distribution (binding pattern) of the query and the marking preferences
of the histone mark. ASSESSING SIMILARITY BETWEEN EPIGENOME PROFILES USING KULLBACK–LEIBLER DIVERGENCE AND MUTUAL INFORMATION To gain insight into the regulatory role among two queries
(proteins), we plotted their epigenome profiles together using slope chart which arranged the order of the histone modifications by ranking with probabilistic scores. The difference between
the two probabilistic distributions were calculated using Kullback–Leibler divergence (KLD), the formula of KLD is as below, $$D(P||Q)={\sum}_{x \in X}P(x){\log }_{2} \frac{P(X)}{Q(X)}$$ (1)
Where _P(x)_ is the probabilistic outputs of the first query, and Q(x) is the second query59. In addition, Mutual information (MI) is another score for estimating the amount of information
shared between the two epigenome profiles. The formula of MI is as below, $$I(X{;}Y)={\sum}_{x\in Y}{\sum}_{x\in X}{P}_{\left(X,Y\right)}(x,y)\log
\frac{{P}_{\left(X,Y\right)}(x,y)}{{P}_{\left(X\right)}(x){P}_{\left(Y\right)}(y)}$$ (2) Where X is ranking order of first query, Y is ranking order of second query, and P(X,Y) is the joint
probability mass function60. We evaluated these scores from known partners or functionally associated pairs of functional consistency and functionally opposite (Supplementary Table 7). The
functionally consistent pairs would display a low KLD score and a high MI value. In contrast, the opposite pairs would exhibit a high KLD score and a low MI value. GENE EXPRESSION ANALYSIS
As part of validating the use of an epigenome profile to predict the regulatory activity of a query protein, a gene expression analysis is implemented. By comparing the overall expression
levels of promoters (the genomic regions between the TSS and 2 kilobases upstream) overlapping with peaks from ChIP-seq of the query protein with those that do not overlap, the result can be
used to assess whether the query protein is likely to function as an activator or repressor. This analysis is performed across 42 plant tissues, with results detailed in Supplementary Table
11 along with the RNA-seq data sources. These results can be used as a validation for predictions based on epigenome profiling (See Fig. 4E SDG2 and S2Lb for examples of activating
transcription, and Supplementary Fig. 5E CLF and SWN for examples of repressive transcription). Moreover, there might be cases when a regulatory protein may change its regulatory functions
by tissues (Supplementary Fig. 6C IBM1) and may function differently between a protein pair (Supplementary Fig. 6C IBM1 and LDL1). GO ENRICHMENT ANALYSIS The protein binding regions were
functionally classified via Gene Ontology enrichment analysis (GOEA) using GOATOOLS61 with FDR < 0.05 as the threshold. In “Epigenome Profiling “, it is performed on the genes overlapping
with peaks from ChIP-seq of the query protein. In “Regulatory Activity “, which involves two query proteins, it is done on the genes overlapping with specific peaks unique to each query
protein from ChIP-seq data. It is also done on the genes overlapping with the shared peaks between the two proteins. META-ANALYSIS Meta-analyses were used for profiling the coverage of
different histone modifications and variants on specific genomic structures, e.g., protein-coding genes, TEs and enhancer regions. The marking regions of histone modification/variant from
multiple samples were merge using the merge function from BEDOPS (Version 2.4.38)62. We further calculated the peak density among the rescaled genomic structure, by normalizing the average
values of peak occurrence on 10 bp non-overlapping bins through computeMatrix function from deepTools56. The peak occurrence was binarized. In this study, “1” was defined as “marking”
whereas “0” was defined as “no marking”. These peak density files were plotted using Python script with bokeh package. The exact command of computeMatrix was included in Supplementary Data
3. ENRICHMENT ANALYSIS To assess the enrichment or depletion of histone modifications and variants on specific genomic structures, we utilized genomic structures extracted from the GTF file
of TAIR10. The specific genomic structures consist of promoter (the genomic regions between the TSS and 2 kilobases upstream), 5 prime untranslated regions (5’UTR), 3 prime untranslated
regions (3’UTR), intron, intergenic regions, enhancer regions and TEs. When considering alternative splicing for determining first exons, the initial type of splicing in the GTF is taken
into account. We compared the proportion of the specific genomic structure on the marking regions of a histone modification with the proportion of the structure among the whole genome with
the following formula. $${enrichmentscore}={\log }_{2}\left(\frac{A/B}{C/D}\right)$$ (3) where A is the length of overlapping between the specific structure and the protein binding regions,
B is the total length of protein binding regions, C is the total length of the genomic regions containing the genomic structure, and D is the genome size. The enrichment score measures the
level of enrichment of a specific histone modification in a genomic structure, normalized by the overall distribution of this structure throughout the genome. The score is log2 scaled, where
a score above zero suggests a higher presence of the histone mark at this structure compared to random, and a score below zero suggests a lower presence. REPORTING SUMMARY Further
information on research design is available in the Nature Portfolio Reporting Summary linked to this article. DATA AVAILABILITY The data that support the findings of this study are openly
available at Zenodo: https://doi.org/10.5281/zenodo.11228366. CODE AVAILABILITY Source codes are publicly available at:Zenodo: https://doi.org/10.5281/zenodo.10649327. Docker images of
peak-calling pipeline is available at: https://hub.docker.com/r/dppss90008/qhistone-pipeline. REFERENCES * Bannister, A. J. & Kouzarides, T. Regulation of chromatin by histone
modifications. _Cell Res._ 21, 381–395 (2011). Article CAS PubMed PubMed Central Google Scholar * Lusser, A., Kolle, D. & Loidl, P. Histone acetylation: lessons from the plant
kingdom. _Trends Plant Sci._ 6, 59–65 (2001). Article ADS CAS PubMed Google Scholar * Chen, M., Lv, S. & Meng, Y. Epigenetic performers in plants. _Dev. Growth Differ._ 52, 555–566
(2010). Article CAS PubMed Google Scholar * Ueda, M. & Seki, M. Histone modifications form epigenetic regulatory networks to regulate abiotic stress response. _Plant Physiol._ 182,
15–26 (2020). Article CAS PubMed Google Scholar * Liu, Y. et al. H3K4me2 functions as a repressive epigenetic mark in plants. _Epigenet. Chromatin_ 12, 40 (2019). Article Google Scholar
* Liu, C., Lu, F., Cui, X. & Cao, X. Histone methylation in higher plants. _Annu Rev. Plant Biol._ 61, 395–420 (2010). Article CAS PubMed Google Scholar * Yin X. C. et al.
H2AK121ub in Arabidopsis associates with a less accessible chromatin state at transcriptional regulation hotspots. _Nat. Commun._ 12, 315 (2021). * Xu, L. & Jiang, H. Writing and reading
histone H3 lysine 9 methylation in Arabidopsis. _Front Plant Sci._ 11, 452 (2020). Article PubMed PubMed Central Google Scholar * Jacob, Y. et al. ATXR5 and ATXR6 are H3K27
monomethyltransferases required for chromatin structure and gene silencing. _Nat. Struct. Mol. Biol._ 16, 763–768 (2009). Article CAS PubMed PubMed Central Google Scholar *
Trejo-Arellano, M. S. et al. H3K23me1 is an evolutionarily conserved histone modification associated with CG DNA methylation in Arabidopsis. _Plant J._ 90, 293–303 (2017). Article CAS
PubMed Google Scholar * Zhao, N. et al. Systematic analysis of differential H3K27me3 and H3K4me3 deposition in callus and seedling reveals the epigenetic regulatory mechanisms involved in
callus formation in rice. _Front Genet._ 11, 766 (2020). Article CAS PubMed PubMed Central Google Scholar * Kim, J. M., Sasaki, T., Ueda, M., Sako, K. & Seki, M. Chromatin changes
in response to drought, salinity, heat, and cold stresses in plants. _Front Plant Sci._ 6, 114 (2015). Article PubMed PubMed Central Google Scholar * Bartlett, A. et al. Mapping
genome-wide transcription-factor binding sites using DAP-seq. _Nat. Protoc._ 12, 1659–1672 (2017). Article CAS PubMed PubMed Central Google Scholar * Nakato, R. & Sakata, T. Methods
for ChIP-seq analysis: a practical workflow and advanced applications. _Methods_ 187, 44–53 (2021). Article CAS PubMed Google Scholar * Chow, C. N. et al. PlantPAN3.0: a new and updated
resource for reconstructing transcriptional regulatory networks from ChIP-seq experiments in plants. _Nucleic Acids Res._ 47, D1155–D1163 (2019). Article PubMed Google Scholar * Hammal,
F., de Langen, P., Bergon, A., Lopez, F. & Ballester, B. ReMap 2022: a database of human, Mouse, drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding
sequencing experiments. _Nucleic Acids Res._ 50, D316–D325 (2022). Article CAS PubMed Google Scholar * Yin, Q., Wu, M., Liu, Q., Lv, H. & Jiang, R. DeepHistone: a deep learning
approach to predicting histone modifications. _BMC Genom._ 20, 193 (2019). Article CAS Google Scholar * Kang, H., Fan, T., Wu, J., Zhu, Y. & Shen, W. H. Histone modification and
chromatin remodeling in plant response to pathogens. _Front Plant Sci._ 13, 986940 (2022). Article PubMed PubMed Central Google Scholar * Nunez-Vazquez, R., Desvoyes, B. & Gutierrez,
C. Histone variants and modifications during abiotic stress response. _Front Plant Sci._ 13, 984702 (2022). Article PubMed PubMed Central Google Scholar * Harris, C. J. & Jacobsen,
S. E. ADCP1: a novel plant H3K9me2 reader. _Cell Res._ 29, 6–7 (2019). Article PubMed Google Scholar * Zhao, S. et al. Plant HP1 protein ADCP1 links multivalent H3K9 methylation readout
to heterochromatin formation. _Cell Res._ 29, 54–66 (2019). Article CAS PubMed Google Scholar * Zhang, C. et al. Arabidopsis AGDP1 links H3K9me2 to DNA methylation in heterochromatin.
_Nat. Commun._ 9, 4547 (2018). Article ADS PubMed PubMed Central Google Scholar * Veluchamy, A. et al. LHP1 regulates H3K27me3 spreading and shapes the three-dimensional conformation of
the Arabidopsis genome. _PLoS One_ 11, e0158936 (2016). Article PubMed PubMed Central Google Scholar * Mylne, J. S. et al. LHP1, the Arabidopsis homologue of HETEROCHROMATIN PROTEIN1,
is required for epigenetic silencing of FLC. _P Natl Acad. Sci. USA_ 103, 5012–5017 (2006). Article ADS CAS Google Scholar * Hung, F. Y. et al. The Arabidopsis LDL1/2-HDA6 histone
modification complex is functionally associated with CCA1/LHY in regulation of circadian clock genes. _Nucleic Acids Res._ 46, 10669–10681 (2018). CAS PubMed PubMed Central Google Scholar
* Hung, F. Y. et al. The LDL1/2-HDA6 histone modification complex interacts with TOC1 and regulates the core circadian clock components in Arabidopsis. _Front Plant Sci._ 10, 233 (2019).
Article PubMed PubMed Central Google Scholar * Fiorucci, A. S. et al. Arabidopsis S2Lb links AtCOMPASS-like and SDG2 activity in H3K4me3 independently from histone H2B
monoubiquitination. _Genome. Biol._ 20, 100 (2019). Article PubMed PubMed Central Google Scholar * Deng, W. et al. Arabidopsis polycomb repressive complex 2 binding sites contain
putative GAGA factor binding motifs within coding regions of genes. _BMC Genom._ 14, 593 (2013). Article CAS Google Scholar * Shu, J. et al. Genome-wide occupancy of histone H3K27
methyltransferases CURLY LEAF and SWINGER in Arabidopsis seedlings. _Plant Direct_ 3, e00100 (2019). Article PubMed PubMed Central Google Scholar * Jiang, D., Wang, Y., Wang, Y. &
He, Y. Repression of FLOWERING LOCUS C and FLOWERING LOCUS T by the Arabidopsis Polycomb repressive complex 2 components. _PLoS One_ 3, e3404 (2008). Article ADS PubMed PubMed Central
Google Scholar * Inagaki, S. et al. Gene-body chromatin modification dynamics mediate epigenome differentiation in Arabidopsis. _EMBO J._ 36, 970–980 (2017). Article CAS PubMed PubMed
Central Google Scholar * Kralemann, L. E. M. et al. Removal of H2Aub1 by ubiquitin-specific proteases 12 and 13 is required for stable Polycomb-mediated gene repression in Arabidopsis.
_Genome. Biol._ 21, 144 (2020). Article CAS PubMed PubMed Central Google Scholar * Ding, Y. & Wilkins, D. Improving the performance of SVM-RFE to select genes in microarray data.
_BMC Bioinforma._ 7, S12 (2006). Article Google Scholar * Yang, Z. et al. EBS is a bivalent histone reader that regulates floral phase transition in Arabidopsis. _Nat. Genet_. 50,
1247–1253 (2018). Article CAS PubMed PubMed Central Google Scholar * Martignago, D. et al. The four FAD-dependent histone demethylases of Arabidopsis are differently involved in the
control of flowering time. _Front Plant Sci._ 10, 669 (2019). Article PubMed PubMed Central Google Scholar * Spedaletti, V. et al. Characterization of a lysine-specific histone
demethylase from Arabidopsis thaliana. _Biochemistry_ 47, 4936–4947 (2008). Article CAS PubMed Google Scholar * Engelhorn J. et al. Dynamics of H3K4me3 chromatin marks prevails over
H3K27me3 for gene regulation during flower morphogenesis in Arabidopsis thaliana. _Epigenomes_ 1, 8 (2017). * Pajoro, A. et al. Dynamics of chromatin accessibility and gene regulation by
MADS-domain transcription factors in flower development. _Genome. Biol._ 15, R41 (2014). Article PubMed PubMed Central Google Scholar * Chen, C. et al. RNA polymerase II-independent
recruitment of SPT6L at transcription start sites in Arabidopsis. _Nucleic Acids Res._ 47, 6714–6725 (2019). Article ADS CAS PubMed PubMed Central Google Scholar * Lu, P. et al. Genome
encode analyses reveal the basis of convergent evolution of fleshy fruit ripening. _Nat. Plants_ 4, 784–791 (2018). Article CAS PubMed Google Scholar * Uddin, S., Khan, A., Hossain, M.
E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. _BMC Med Inf. Decis. Mak._ 19, 281 (2019). Article Google Scholar * Platt J. C.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In, _Advances in Large Margin Classifiers_. (ed. Alexander, J. S.) 412 (MIT Press,1999).
* Wu, T.-F., Lin, C.-J. & Weng, R. C. Probability estimates for multi-class classification by pairwise coupling. _J. Mach. Learn Res_. 5, 975–1005 (2004). MathSciNet Google Scholar *
Deal, R. B. & Henikoff, S. Histone variants and modifications in plant gene regulation. _Curr. Opin. Plant Biol._ 14, 116–122 (2011). Article CAS PubMed Google Scholar * Fuchs, J.,
Demidov, D., Houben, A. & Schubert, I. Chromosomal histone modification patterns–from conservation to diversity. _Trends Plant Sci._ 11, 199–208 (2006). Article CAS PubMed Google
Scholar * Barrett, T. et al. NCBI GEO: archive for functional genomics data sets–update. _Nucleic Acids Res._ 41, D991–D995 (2013). Article CAS PubMed Google Scholar * Leinonen, R.,
Sugawara, H. & Shumway, M. International nucleotide sequence database C. The sequence read archive. _Nucleic Acids Res._ 39, D19–D21 (2011). Article CAS PubMed Google Scholar *
Krueger F. _Trim Galore._ https://github.com/FelixKrueger/TrimGalore (2015). * Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. _EMBnet J._ 17, 10–12
(2011). Article Google Scholar * Andrews S. _FastQC: a Quality Control Tool for High Throughput Sequence Dat_a. http://www.bioinformatics.babraham.ac.uk/projects/fastqc (2010). *
Berardini, T. Z. et al. The Arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome. _Genesis_ 53, 474–485 (2015). Article CAS PubMed
PubMed Central Google Scholar * Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with bowtie 2. _Nat. Methods_ 9, 357–359 (2012). Article CAS PubMed PubMed Central Google
Scholar * Li, H. et al. The sequence alignment/Map format and SAMtools. _Bioinformatics_ 25, 2078–2079 (2009). Article PubMed PubMed Central Google Scholar * Zhang, Y. et al.
Model-based analysis of ChIP-Seq (MACS). _Genome Biol._ 9, R137 (2008). Article PubMed PubMed Central Google Scholar * Qin, Q. et al. ChiLin: a comprehensive ChIP-seq and DNase-seq
quality control and analysis pipeline. _BMC Bioinforma._ 17, 404 (2016). Article Google Scholar * Ramirez, F. et al. deepTools2: a next generation web server for deep-sequencing data
analysis. _Nucleic Acids Res._ 44, W160–W165 (2016). Article CAS PubMed PubMed Central Google Scholar * Landt, S. G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE
consortia. _Genome Res._ 22, 1813–1831 (2012). Article CAS PubMed PubMed Central Google Scholar * Keller C. A. et al. Effects of sheared chromatin length on ChIP-seq quality and
sensitivity. _G3 (Bethesda__)_ 11, jkab101 (2021). * Ji, S. et al. Kullback-leibler divergence metric learning. _IEEE Trans. Cyber._ 52, 2047–2058 (2022). Article Google Scholar * Ross, B.
C. Mutual information between discrete and continuous data sets. _PLoS One_ 9, e87357 (2014). Article ADS PubMed PubMed Central Google Scholar * Klopfenstein, D. V. et al. GOATOOLS: A
python library for gene ontology analyses. _Sci. Rep._ 8, 10872 (2018). Article ADS CAS PubMed PubMed Central Google Scholar * Neph, S. et al. BEDOPS: high-performance genomic feature
operations. _Bioinformatics_ 28, 1919–1920 (2012). Article CAS PubMed PubMed Central Google Scholar Download references ACKNOWLEDGEMENTS We thank Chiao-Yu Lyra Sheu for the preliminary
data analysis, Nien Yang for the art painting of Arabidopsis. We also thank Zheng-Zhong Huang and Jimmy Lin from the IPMB IT/Network service for the IT support. This work was supported by
grants from Academia Sinica (NTU-AS Innovative Joint Program (AS-NTU-112-12)), and National Science and Technology Council, Taiwan 109-2313-B-001-009-MY3 and 111 −2927-I-001 −505 and
112-2311-B-001 −007 to P.-Y.C. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Institute of Plant and Microbial Biology, Academia Sinica, Taipei, 115201, Taiwan Chih-Hung Hsieh, Ya-Ting
Sabrina Chang, Ming-Ren Yen, Jo-Wei Allison Hsieh & Pao-Yang Chen Authors * Chih-Hung Hsieh View author publications You can also search for this author inPubMed Google Scholar * Ya-Ting
Sabrina Chang View author publications You can also search for this author inPubMed Google Scholar * Ming-Ren Yen View author publications You can also search for this author inPubMed
Google Scholar * Jo-Wei Allison Hsieh View author publications You can also search for this author inPubMed Google Scholar * Pao-Yang Chen View author publications You can also search for
this author inPubMed Google Scholar CONTRIBUTIONS P.-Y.C. conceived the project. C.-H.H. majorly carried out the data analysis and developed the website. Y.-T.S.C. and M.-R.Y. also
contribute to the data analysis. C.-H.H., J.-W.A.H., and P.-Y.C. wrote and edited the paper with inputs from all authors. CORRESPONDING AUTHOR Correspondence to Pao-Yang Chen. ETHICS
DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. PEER REVIEW PEER REVIEW INFORMATION _Nature Communications_ thanks Michihiro Araki, Marek Mutwil and the other,
anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with
regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION PEER REVIEW FILE DESCRIPTION OF ADDITIONAL SUPPLEMENTARY
FILES SUPPLEMENTARY DATA 1 SUPPLEMENTARY DATA 2 SUPPLEMENTARY DATA 3 REPORTING SUMMARY RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission
under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by-nc-nd/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Hsieh, CH., Chang, YT.S., Yen, MR. _et al._ Predicting protein synergistic
effect in Arabidopsis using epigenome profiling. _Nat Commun_ 15, 9160 (2024). https://doi.org/10.1038/s41467-024-53565-y Download citation * Received: 05 September 2023 * Accepted: 15
October 2024 * Published: 24 October 2024 * DOI: https://doi.org/10.1038/s41467-024-53565-y SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get
shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative