Predicting protein synergistic effect in arabidopsis using epigenome profiling

Predicting protein synergistic effect in arabidopsis using epigenome profiling

Play all audios:

Loading...

ABSTRACT Histone modifications can regulate transcription epigenetically by marking specific genomic loci, which can be mapped using chromatin immunoprecipitation sequencing (ChIP-seq). Here


we present QHistone, a predictive database of 1534 ChIP-seqs from 27 histone modifications in Arabidopsis, offering three key functionalities. Firstly, QHistone employs machine learning to


predict the epigenomic profile of a query protein, characterized by its most associated histone modifications, and uses these modifications to infer the protein’s role in transcriptional


regulation. Secondly, it predicts synergistic regulatory activities between two proteins by comparing their profiles. Lastly, it detects previously unexplored co-regulating protein pairs by


screening all known proteins. QHistone accurately identifies histone modifications associated with specific known proteins, and allows users to computationally validate their results using


gene expression data from various plant tissues. These functions demonstrate an useful approach to utilizing epigenome data for gene regulation analysis, making QHistone a valuable resource


for the scientific community (https://qhistone.paoyang.ipmb.sinica.edu.tw). SIMILAR CONTENT BEING VIEWED BY OTHERS GENOME-WIDE IDENTIFICATION OF ACCESSIBLE CHROMATIN REGIONS BY ATAC-SEQ UPON


INDUCTION OF THE TRANSCRIPTION FACTOR BZIP11 IN ARABIDOPSIS Article Open access 27 July 2023 GLOBAL PROFILING OF RNA–CHROMATIN INTERACTIONS REVEALS CO-REGULATORY GENE EXPRESSION NETWORKS IN


_ARABIDOPSIS_ Article 14 October 2021 ACHIP IS AN EFFICIENT AND SENSITIVE CHIP-SEQ TECHNIQUE FOR ECONOMICALLY IMPORTANT PLANT ORGANS Article 23 August 2024 INTRODUCTION Histone


modifications such as methylation, acetylation, phosphorylation, and ubiquitylation are known to affect transcription commonly through altering chromatin status1. For example, acetylated


histones are often associated with activating transcription by reducing the positive charges on histones, thus allowing proteins to gain access to the DNA2,3,4,5. Conversely, methylation can


change the overall charge of the histone proteins, affecting their interaction with the negatively charged DNA, hence influencing the interaction dynamics between histones and DNA, that


ultimately impacts the chromatin compaction2,3,4,5. For example, histone 3 lysine 9 dimethylation (H3K9me2) and histone 3 lysine 27 monomethylation (H3K27me1) are known to be enriched in


constitutive silenced heterochromatin, and repressively regulate transcription due to the dense structure of heterochromatin4,5,6,7,8,9,10 (See Supplementary Table 1 for a list of histone


modifications that are often active or repressive). In plants, histone modifications may affect the strength of histone binding to DNA, and such influence is dependent on the plant


conditions including stresses, developmental stages, tissues and genotypes11,12. Overall, the types of histone modifications and their specific preferences for marking locations on the


genome can provide valuable insights into transcriptional regulation, and these preferences are sensitive to plant conditions. Chromatin immunoprecipitation followed by sequencing (ChIP-seq)


and DNA affinity purification sequencing (DAP-seq) are commonly used high-throughput methods for profiling the genome-wide binding between genomic DNA and proteins of interests13,14. A


couple of databases have been built to collect ChIP-seqs in Arabidopsis (Supplementary table 2). PlantPAN3.0 included 54 transcription factors (TFs) and 19 types of histone modifications and


variants into their database, and majorly investigated _cis-_ and _trans-_regulatory motifs in plant promoters by transcription factor binding site (TFBS)15. ReMap2022 integrated the


ChIP-seq data from 423 TFs and 33 types of histone modifications and histone variants in Arabidopsis16. It enables the identification of histone marks enriched in the genomic binding regions


of a query protein. The enrichment of a specific histone mark to a query protein is estimated by the overlap percentage for each histone mark, which is determined by the ratio of the number


of overlapping peaks to the number of total peaks from the protein binding regions. Higher percentages indicate a stronger association with the protein. These databases did not investigate


the plant conditions in relation to the histone modification data, nor did they use rigorous statistical testing or the latest machine learning techniques in approaching the epigenomic


regulation, with machine learning still limited in predicting histone modifications at specific loci17. The regulatory activities of proteins, such as activation or repression, are often


associated with histone modifications4. For instance, the binding sites of an activator often coincide with the locations marked by histone acetylation. Hence, the set of histone


modifications that typically mark the binding locations of a specific regulatory protein can be used to assess the protein’s regulatory function, and this unique set of histone modifications


can be considered as the protein’s epigenome profile. Depending on the regulatory protein, the associated epigenetic marks in the epigenome profile can be activating, repressive, or a


combination, reflecting the intricate regulatory dynamics involved in the biological process4,18,19. AGENET DOMAIN (AGD)-CONTAINING PROTEIN 1 (ADCP1) targets transposable elements (TEs) with


abundant heterochromatic modifications such as H3K9me2, and DNA methylation, resulting in transcriptional silencing20,21,22. Loss of function of _ADCP1_ gene largely increases the


proportion of decondensed nuclei, confirming its role in heterochromatin maintenance21. LIKE HETEROCHROMATIN PROTEIN 1 (LHP1) mediates the spreading of trimethylation of histone H3 lysine 27


(H3K27me3), and requires for maintaining the repression of vernalization-inducing gene _FLOWERING LOCUS C_ (_FLC_)23,24. Therefore, by examining the set of epigenetic marks frequently


observed at the binding locations of the regulatory proteins (i.e., its epigenome profile), it is possible to predict their regulatory activity. Furthermore, regulatory proteins may


collaborate to accomplish their regulatory functions. They may both carry out similar functions to activate or suppress the expression of target genes. Alternatively, they can have


contrasting roles in their transcriptional control. For example, HISTONE DEACETYLASE 6 (HDA6) collaborates with histone demethylases LYSINE-SPECIFIC DEMETHYLASE 1-LIKE 1 and 2 (LDL1/2) to


jointly suppress the expression of genes and TEs25,26. SET DOMAIN GROUP 2 (SDG2) interacts with SWD2-like b (S2Lb), and together they directly affect the enrichment of histone 3, lysine 4


trimethylation (H3K4me3) on highly transcribed genes27. POLYCOMB REPRESSIVE COMPLEX 2 (PRC2) catalyzes H3K27me3, and maintains repressed state of target genes28,29,30. Conversely, INCREASE


IN BONSAI METHYLATION 1 (IBM1) and LDL2 have opposing functions in regulating transcription. A loss of IBM1 function results in the ectopic accumulation of H3K9me2, a repressive mark, across


gene bodies. On the other hand, the _ldl2_ mutation can mediate the repressive state induced by IBM1 by increasing the presence of histone 3, lysine 4 monomethylation (H3K4me1), an active


mark31. Another example includes RELATIVE OF EARLY FLOWERING 6 (REF6) and LHP1, where REF6 can alter the repressive state maintained by LHP1, enabling gene reactivation through the


repressive modification histone H2A monoubiquitination (H2Aub1)32. Altogether, these examples suggest that by evaluating the similarity in their epigenomic profiles, it is possible to


determine whether two regulatory proteins have similar or opposing functions in regulation. Here, we have developed QHistone, a predictive database for histone modifications in Arabidopsis


(https://qhistone.paoyang.ipmb.sinica.edu.tw). It first predicts a protein’s epigenomic profile, which can be used to assess its regulatory activity. Using a machine learning model trained


on numerous ChIP-seq datasets of various histone marks, QHistone can accurately cluster a regulatory protein with its closely associated histone modifications. In addition, between two


regulatory proteins, QHistone compares epigenome profiles between proteins to offer assessment of their possible synergistic regulation. We also conducted pairwise comparisons of epigenome


profiles for 616 proteins with available ChIP-seq data from NCBI, identifying functionally linked protein pairs that were previously unexplored. Our predictions were validated by comparing


them to known regulatory proteins that associate with specific histone modifications, as well as protein pairs known to have similar or dissimilar activities, as mentioned above. QHistone


offers gene expression analysis tools that allow users to computationally validate their results and provides an overview of histone modifications linked to various tissues, stress


responses, and genotypes. By integrating comprehensive epigenomic datasets with bioinformatic analysis, QHistone offers new perspectives on gene regulation in plant epigenomics research.


RESULTS OVERVIEW OF QHISTONE AS A PREDICTING DATABASE QHistone is a web database of a large number of ChIP-seqs from histone modifications and histone variants in Arabidopsis, accompanied by


analytical functions (see Fig. 1 for an overview of key functionalities, and Supplementary Fig. 1 for methods and analyses). Totally, there are 1534 ChIP-seqs from 27 types of histone


modifications, 11 types of histone variants, and 4 types of histone proteins collected from Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) (see Methods) (Fig. 2A;


Supplementary Fig. 2). The database also includes information of plant conditions (Fig. 2B; Supplementary Table 3). These data were processed through a unified peak-calling pipeline (see


Methods for details of the scripts), and can be browsed, downloaded, and visualized on the website (Supplementary Fig. 3). The sections in Results are organized to first introduce the three


major functionalities, including their concepts, methods and validations, followed by genome-wide analysis of histone modifications, the analysis with plant conditions, and performance


evaluation. EPIGENOME PROFILING OF A QUERY PROTEIN The first functionality in QHistone is to predict the epigenome profile of a protein query (see the left panel of Fig. 1), which allows the


inference of potential transcriptional regulation. To establish the epigenomic profiles of a query protein, we first implemented a Support Vector Machine (SVM) model to recognize all the


ChIP-seq datasets of histone modifications (Supplementary Fig. 1; Supplementary Data 1 for Selection of Machine Learning Model and Supplementary Data 2 for Formulas of Evaluation Metrics).


The final model achieved an accuracy of 90.74% (see Fig. 3A for receiver operating characteristic (ROC) analysis; Fig. 3B shows the predicting accuracy of each modification/variant;


Supplementary Table 4). In other words, QHistone can precisely cluster a query protein to its most associated histone modifications (see Methods for step-by-step procedures of model


training, testing, optimization, and evaluation). Once the epigenome profile is predicted, the regulatory role of the top histone modifications may be indicative of the query protein’s role


in transcriptional regulation. The overlapping percentage method used in ReMap2022 is conceptually similar to our approach16. We referred to the methodology described in ReMap2022 and


evaluated our prediction accuracy against this method. This method first merges all peaks discovered from each histone mark and then sequentially calculates the overlapping percentage


between a query and each histone mark. The mark with the highest overlapping percentage indicates the strongest association with the query. Overall, our method is 14.5% more accurate than


the overlapping percentage method (Supplementary Fig. 4) and demonstrates greater accuracy in most individual modifications/variants (Fig. 3B). Moreover, we identified a small set of genomic


regions that can predict the epigenome profile with similar or higher accuracy than using the entire genomic regions33. This was achieved by adopting feature selection using the recursive


feature elimination (RFE) technique to filter out less informative genomic regions in the prediction. Consequently, a set of 303 specific genomic regions of 200 bp (0.05% of the genome) can


reach 99.43% accuracy in leave-one-out cross-validation (Fig. 3C; Supplementary Table 5 and Supplementary Table 6). These critical regions were enriched in the 5’UTR and exon in gene


structure, where histone marks are frequently targeted and observed, suggesting that these regions are crucial in gene regulation through dynamic marking of histone modifications (Fig. 3D).


In QHistone, users can select either all regions or these 303 critical regions for their prediction. To test our models, we predicted the epigenome profiles of two known regulatory proteins,


ADCP1 and EARLY BOLTING IN SHORT DAY (EBS). ADCP1 is a heterochromatin-binding protein that has been reported to specifically recognize H3K9me2 modification, and positively associates with


two heterochromatic modifications H3K9me2 and H3K27me120,21,22. Both are recognized as repressive marks. Our analysis of ADCP1 indicates that the most closely associated histone marks in its


epigenome profile are also H3K9me2 and H3K27me1. This suggests a strong similarity between ADCP1’s binding preference on the genome and the distribution of these two marks (Fig. 3E), that


our prediction corroborates with the previous reports10. The fact that the top four marks associated with ADCP1 are all repressive further suggests that ADCP1’s regulatory role is


predominantly repressive. On the other hand, EBS is a bivalent histone regulatory protein that respectively recognizes H3K4me3, and H3K27me3. EBS without c-terminal loop (EBSΔC) binds


H3K4me3 with a higher affinity than unmodified EBS protein34. In Fig. 3F, our predictions show that the likelihood of H3K4me3 marking to EBSΔC is 2.3 times greater than to EBS, confirming a


stronger association between EBSΔC and H3K4me3. In both instances, QHistone accurately predicts the primary histone marks associated with the regulatory proteins that are critical to their


functions. COMPARING SVM WITH OTHER SUPERVISED MACHINE LEARNING MODELS We compared the SVM with eight other supervised machine learning algorithms, including k-Nearest Neighbors (KNN),


Gaussian Naive Bayes (GNB), Decision Trees (Trees), Random Forest, Nearest Centroid, Stochastic Gradient Descent (SGD), Multi-Layer Perceptron (MLP), and XGBoost. The goal was to determine


whether SVM is the best model for QHistone (see Supplementary Data 1 for Selection of Machine Learning Model). To ensure optimal performance, the hyperparameters of KNN, GNB, Trees, Random


Forest, and Nearest Centroid were tuned through grid search cross-validation. MLP, SGD, and XGBoost were tuned using randomized search cross-validation, selecting the best fit from 100


randomly chosen hyperparameter sets, given a larger space of parameters these algorithms involve. Overall, our results showed that SVM achieved highest predicting accuracy, F1 score,


precision, and recall rate, hence SVM was selected for QHistone. PREDICTING REGULATORY INTERACTIONS BETWEEN PROTEIN PAIRS THROUGH EPIGENOME PROFILING When two regulatory proteins share very


similar epigenome profiles, they are likely to have similar preferences for the binding locations on the genome, and associate with similar sets of histone marks. Altogether, this may


suggest similar regulatory activities such as activation and repression. Conversely, regulatory proteins with very different epigenome profiles may suggest their different roles in


regulating expression (see the center panel of Fig. 1). To assess the similarity of the epigenome profiles between two proteins, both Kullback–Leibler divergence (KLD) and Mutual information


(MI) (see Methods) are employed. KLD measures the dissimilarity between the two profiles, whereas MI captures the amount of information shared between them. Therefore, two similar epigenome


profiles tend to show a low KLD and a high MI; suggesting likelihood of functional consistency between the two proteins. In fact, some proteins with consistent regulation by similar


epigenomic profiles demonstrate a synergistic effect, while others with opposing regulation may have an antagonistic effect. We validated our prediction of functional consistency on known


protein pairs with specific histone modification associated their complexes. In this case, SDG2 co-regulates with S2Lb, directly influencing the enrichment of H3K4me327. QHistone accurately


predicted the epigenome profiles of these proteins (Fig. 4A), and displayed the probabilities ranked by a slope chart (Fig. 4B show the ranking order of each histone modification/variant


respectively predicted from two proteins). Both proteins exhibited a similar ranking of histone modifications, indicating their possible functional consistency. We further assessed the


differences in epigenome profiles using KLD and MI. Both low KLD and high MI indicated similar epigenome profiles (KLD = 0.026; MI = 1.132; Supplementary Table 7). These proteins also shared


similar binding regions, predominantly at promoters and enhancers, underscoring their similar roles in gene regulation (Fig. 4C, D). Additionally, across multiple tissues in Arabidopsis,


both SDG2 and S2Lb consistently showed a tendency to upregulate gene expression (Fig. 4E), as a function in QHistone to assist computational validation of the regulatory activity (refer to


Methods for gene expression analysis). This further supports the predicted functional consistency. Lastly, Gene Ontology (GO) analysis in QHistone highlighted biological functions; genes in


overlapping regions were enriched in root development functions, aligning with previous phenotypic observations27 (Supplementary Table 8). Lastly, QHistone also successfully validated PRC2


complex with its two known members CURLY LEAF (CLF) and SWINGER (SWN); suggesting our tools could accurately indicate functional consistency (Supplementary Table 7 and Supplementary Table 9;


Supplementary Fig. 5). Conversely, two proteins might be functionally opposite within a regulatory complex. For example, IBM1 is known to activate gene expression by removing H3K9me2, while


LDL2 represses gene expression by removing H3K4me131. Since the ChIP-seq data for LDL2 is not publicly available, we utilized LDL1 data as a proxy because LDL1 and LDL2 are structurally


similar35,36 and have shown functional redundancy in epigenetic regulation25. By analyzing their epigenome profiles, we discovered that IBM1 had a lower association with H3K9me2 (Probability


 = 0.01), and LDL1 had a lesser association with H3K4me1 (Probability < 0.01), likely due to their demethylation activities (Supplementary Fig. 6A). Interestingly, the epigenome profiles


of these proteins are distinctly different (KLD = 1.387; MI = 0.000) (Supplementary Table 7; Supplementary Fig. 6), indicating that IBM1 is primarily associated with active modifications,


whereas LDL1 is linked to repressive modifications. Overall, these results support the notion that IBM1 and LDL2 are functionally distinct, consistent with their known opposing roles in gene


regulation. In brief, QHistone is able to predict the regulatory activity among two regulatory proteins and summarize their effect on transcription by comparing their epigenome profiles.


Additionally, these predictions can be further evaluated through gene expression analysis. This functionality offers a novel perspective for studying the regulatory interactions between


protein pairs, beyond just their binding partners. QUERY AGAINST ALL ARABIDOPSIS PROTEINS FOR PUTATIVE PROTEIN PARTNERS OR FUNCTIONALLY ASSOCIATED PAIRS QHistone has developed epigenome


profiles for 616 Arabidopsis proteins, with their ChIP-seq data accessible in the GEO database (Supplementary Fig. 7). Users can upload their own proteins (preprocessed in standard BED


format) to query against these proteins in the database. This facilitates the discovery of previously unknown protein partners and functionally associated protein pairs (see the right panel


of Fig. 1). Amongst these 616 proteins, 1566 ChIP-seq libraries are present. To discover potential protein partners or functionally associated pairs within these libraries, we conducted 2.4 


M pairwise comparisons of their epigenome profiles. The outcomes are depicted based on their KLD scores, _MI_ values, and the overlapping percentage of binding locations between two proteins


to aid in evaluating the functional consistency of these pairs (Fig. 5). Since there are no standard cutoffs for these scores, we selected four known protein partners or functionally


associated protein pairs—HDA6/LDL1, SDG2/S2Lb, CLF/SWN (PRC2 complex), and CLF/FERTILIZATION-INDEPENDENT ENDOSPERM (FIE) (PRC2 complex)—as reference points for functional consistency.


Additionally, two pairs, IBM1/LDL1 and REF6/LHP1, were identified as functionally opposite (Fig. 5; Supplementary Table 7). Among the 2.4 million combinations, we easily identified


unexplored protein pairs by comparing to these known protein pairs. For instance, SEPALLATA3 (SEP3) and APETALA1 (AP1) demonstrated strong functional consistency, evidenced by their low KLD


close to 0 and a high MI of 1.229 (Fig. 5; Supplementary Table 7; Supplementary Fig. 8A). Our literature review revealed that both are MADS-domain TFs, co-localizing in the same genomic


region, and they can both influence chromatin accessibility to regulate gene expression37,38. This finding aligns well with our predictions. On the other hand, ADCP1 and SUPPRESSOR OF TY


INSERTION 6 (SPT6) appeared as a pair with functionally opposite roles, exhibiting a high KLD of 1.383 and a low MI close to 0 (Fig. 5; Supplementary Table 7; Supplementary Fig. 8B). ADCP1,


a heterochromatin binding protein, plays a crucial role in transcriptional silencing20,21,22, while SPT6, a transcription elongation factor, co-occupies with RNAPII across highly transcribed


genes39, suggesting these two proteins may actually have opposing functions in transcription regulation. In brief, QHistone allows users upload their own query protein to query against


known proteins to identify regulatory interactions with proteins that have not yet been discovered. The results from the 2.4 M epigenome profile comparison are highly informative, enabling


exploration of previously unexplored regulatory interactions among know proteins and providing new research insights. GENOME-WIDE ANALYSIS OF HISTONE MODIFICATIONS WITH PLANT CONDITIONS We


studied a large number of histone modifications under different plant conditions (like tissues, genotypes, and stresses) using QHistone (see Fig. 2; Supplementary Table 3). Our goal was to


understand how these conditions affect the patterns of marking locations of histone modifications across the plant’s genome. QHistone is able to employ three visualization


methods—t-distributed stochastic neighbor embedding (t-SNE), principal component analysis (PCA), and multidimensional scaling (MDS)—to show these patterns (Fig. 6A; Supplementary Fig. 9). As


shown in the t-SNE plot, we discovered that these patterns mainly group together based on the type of histone modification, which suggests that the types of modifications are key in


controlling how histones attach to the genome. In the same t-SNE plot (Fig. 6A), plant conditions did not appear to be a primary factor influencing the marking patterns of histone


modifications. When we plotted different tissue types of the same histone modification H3K27me3, we discovered that the marking locations could be clustered based on the tissue of origin


(Fig. 6B). Furthermore, with hierarchical clustering of histone modification libraries under different plant conditions, we observed that within the histone acetylation group, there are


subgroups combining stresses and tissues. This suggests that plant conditions could be a secondary factor (Supplementary Fig. 10) impacting how histone changes are distributed across the


genome. To incorporate the information of plant conditions, QHistone is able to accurately map the binding patterns of query proteins within a t-SNE plot, alongside their corresponding


histone marks from ChIP-seq datasets across various states of plant conditions. For example, LHP1, which interacts with PRC1/2 to regulate H3K27me3 spread23, is known to have differential


H3K27me3 marking on the genome in different tissues40. While in the example the LHP1 protein binding data was originally sampled from the 14 day seedling with non-stress condition, in the


t-SNE plot by QHistone (Supplementary Table 10; Supplementary Fig. 11). It is clear that LHP1’s binding pattern closely resembles that of H3K27me3 in similar conditions for 10–14 day-old


seedlings, diverging significantly from other tissues (e.g., embryo, protoplast, sperm) or in cases of functional loss mutations. This example demonstrates that QHistone is able to precisely


link LHP1’s binding with the specific histone modifications and plant conditions, enabling detailed exploration of plant conditions. META-ANALYSIS OF HISTONE MODIFICATIONS AT VARIOUS


GENOMIC ELEMENTS To gain insight on the marking pattern of active versus repressive histone modifications on specific genomic structures, we profiled the coverage of histone modifications


around genes, TEs, and enhancers using meta-analysis (Fig. 6C; Supplementary Fig. 12; Supplementary Fig. 13; Supplementary Fig. 14). Histone acetylation marks showed similar pattern which


the marking crest was around the transcriptional start site (TSS), whereas most of the repressive modifications were depleted in the 5’UTR (Fig. 6D). Additionally, we calculated the overlap


percentage of marked regions between pairs of histone modifications, as illustrated in a heatmap (Fig. 6E). The results revealed that active and repressive histone modifications tend to


co-localize within the same genomic regions, suggesting their coordinated regulation. PERFORMANCE EVALUATION IN ASSOCIATING HISTONE MODIFICATIONS TO A QUERY We evaluated the performance of


QHistone by assessing the prediction accuracy and the reliability of the method. While there are currently no similar databases in any species that have implemented machine learning to


identify the associated histone marks for a query protein, we used the intuitive calculation of overlapping percentage between the marking locations of histone modifications and the binding


sites of query proteins (from ChIP-seq datasets) implemented in ReMap2022 for comparison. In ReMap2022, the modification/variant with a high overlapping percentage to the query was predicted


to have a high association. As a result, QHistone showed a higher prediction accuracy through the 5-fold cross-validation (QHistone: 90.74%; Overlapping: 76.20%) (Supplementary Fig. 4).


When evaluating the prediction accuracy of each modification/variant, QHistone showed better performance in 29 modifications/variants out of 43 by at least 5% (Fig. 3B). To investigate the


factors contributing to QHistone’s higher accuracy, we focused on the marking properties (ChIP quality, peak number, and peak size) of each ChIP-seq. As shown in Supplementary Fig. 15 and


Supplementary Fig. 16, we found that ChIP-seq data with high quality, more peaks, and larger peak sizes are considered easy-to-predict. Conversely, data with low quality, fewer peaks, and


smaller peak sizes are considered difficult-to-predict. While both QHistone and the Overlapping methods showed high accuracy in predicting easy-to-predict modifications/variants, QHistone


appears to outperform the overlapping method in predicting difficult-to-predict modifications/variants, as indicated by fewer such cases. This suggests that QHistone performs better in


challenging scenarios. Lastly, as some histone modifications appear to have more ChIP-seq datasets, the collections of each modification/variant in QHistone were therefore unbalanced (Fig. 


2A). To examine if our predictions are influenced by such unbalanced data, we trained the SVM models with gradually increasing size of libraries of each modification/variant until the


largest number possible (Supplementary Fig. 15E) and monitored the prediction accuracy. As a result, QHistone achieves the highest accuracy level with fewer training libraries compared to


Overlapping (the highest accuracy is achieved with 38 training datasets for QHistone and 45 for Overlapping). Furthermore, as the accuracy level does not significantly change when more


training data is added, adding unbalanced data does not easily influence the model’s performance. Moreover, since the accuracy of our model has already reached a plateau with 30-40 libraries


per modification, suggesting sufficient libraries might have been collected already in the current QHistone to obtain optimal model parameters; therefore, further updates with more


libraries may not be necessary. DISCUSSION In this study, we developed QHistone which is more than a database storing raw sequencing data. QHistone equipped processed resources and devised


analytical tools that could potentially benefit the research community. QHistone has currently the largest ChIP-seq collection of histone modifications and variants in Arabidopsis


(Supplementary Fig. 2). With the user-friendly interface, users can easily search the sequencing data of their interest such as specific genotype, stress, or tissue, intuitively visualize


the data, and directly download them from QHistone (Supplementary Fig. 3). QHistone also provides a peak-calling pipeline, which is available as an easy-to-use Python script, allowing users


to consistently process their new histone data. Additionally, QHistone has the ability to explore the association between queries and histone modifications, predict the regulatory role


between two queries, and identify unexplored protein pairs. These unique features set QHistone apart from other current databases, enabling it to answer novel questions in the field of


epigenetic regulation and uncover synergistic effects. We developed the query function to meet the increasing interest of analyzing the regulatory proteins in the epigenomic community. In


the past, researchers intuitively utilized the overlapping method to detect the associated histone modification to a query protein, including ReMap202216 (Supplementary Table 2). However,


through the performance evaluation, we found that several factors can all affect the predicting accuracy, including the quality, peak number and peak sizes of the modifications/variants


(Supplementary Fig. 15; Supplementary Fig. 16). QHistone outperformed the overlapping method as it adopted a supervised learning model widely applied in decision-making with high accuracy41.


This method considers the weight of each genomic region and gives probabilistic outputs to speculate the association between the query and the modifications/variants42,43. Although SVM is a


common machine learning technique, it is a novel and well-suited approach in predicting the epigenome profile. QHistone learned the genome-wide marking patterns (include marking or no


marking) from the data, whereas the overlapping only considered the intersecting regions, that altogether make QHistone a better predictor. We adopted RFE in this study (Fig. 3C), to discard


less informative genomic regions. We identified regions capable of distinguishing all profiled histone modifications and variants with increased precision (Accuracy with selected regions:


99.43%, all regions: 90.74%). Surprisingly the total size of the chosen 303 regions is extremely small (0.05% of genome) (Supplementary Table 6). Hence, it could be possible to predict the


profile of query just by examining these regions through bio-chip (such as ChIP-on chip technology) with advantages in price and speed. In our study, we had trained the SVM model using these


regions. User can either predict the profile by inputting protein binding regions from ChIP-seq or the binding information gathering from the 303 critical regions. QHistone offered a robust


method in predicting the epigenome profile. Through our analyses, we found the predicting accuracy with the current data had already reached a plateau that QHistone had collected sufficient


datasets in training the model, and may not require more updates in the future. Moreover, our analyses showed that our model with a few ChIP-seq data is enough for predicting the epigenome


profile. It is therefore possible to build the QHistone on other plant species with fewer ChIP-seq data, such as rice, maize, and soybean. Additionally, the types and roles of histone


modifications are generally conserved across different plant species44,45, and more regulatory proteins with conserved domains are being discovered across species6, suggesting that most of


the results from Arabidopsis can be used to infer other plant species already. The method developed by QHistone has a strong potential to be transferred to additional plant species. In


summary, we present a platform with the largest ChIP-seq datasets of histone modifications and variants database in Arabidopsis, along with analytical tools that provide a comprehensive


epigenome profile of the input query. QHistone allows direct analysis and visualization of the action of protein query and the functional association between protein pairs, with various


histone modifications that usually required challenging experimental validations. METHODS DATA COLLECTION All ChIP-seq libraries including histone modifications, histone variants, histone


proteins, TFs, and regulatory proteins were obtained from NCBI GEO database46 with the following keywords, “_Arabidopsis thaliana_” AND “ChIP” OR “DAP”. QHistone was composed of 1534 histone


related ChIP-seq libraries and 1566 protein ChIP-seq libraries. The metadata which describes the information of each sample were collected using web crawling technique with data cleaning.


Due to the irregular or missing metadata descriptions by different researchers on GEO database, we unified the data through data cleaning and manual modification by surveying the original


publication of each sample for the purpose of gaining the most complete information. All the information was organized in the table and performed on the QHistone database


(https://qhistone.paoyang.ipmb.sinica.edu.tw/browser/). The metadata of each sample included histone modification type, age, genotype, ecotype, treatment, publication resource, containing


input control or not, and the sample id from NCBI (GSM id from GEO database and SRR id from SRA database)46,47. The raw sequencing data (sra) of each sample were downloaded from SRA using


prefetch from SRA Toolkit (version 2.11.2)47. These raw binarized sequencing data were further decompressed using pfastq-dump (version 0.1.6, retrieval data: Mar 12, 2019) with “split-files”


options which is the parallel computing version of the original fastq-dump from SRA toolkits. File from pair-end sequencing data was split into two files which namely “_1.fastq” and


“_2.fastq”, we removed the second file and retained single-end file for further ChIP analysis. PROCESSING CHROMATIN IMMUNOPRECIPITATION SEQUENCING (CHIP-SEQ) DATA All decompressed ChIP-seq


data (fastq) mentioned in the previous section were processed by a unified pipeline (https://qhistone.paoyang.ipmb.sinica.edu.tw/PeakCallingPipeline/) (see the left panel of Supplementary


Fig. 1). The adapter sequences and the low-quality reads (_Q_ < 30) in the sequencing data were removed using TrimGalore (version 0.6.4)48 which is the wrapper tool around Cutadapt


(version 2.10)49 and FastQC (version 0.11.8)50. The trimmed reads were further aligned to Arabidopsis genome TAIR10 51 using bowtie2 (version 2.2.6)52 with default parameters. We converted


SAM format file created by bowtie2 into sorted BAM format using samtools (version 1.10)53. Peak calling was performed using MACS2 (version 2.2.7.1)54 with a false discovery rate of (FDR) 


< 0.05 as the criteria. The sample containing input control were processed with “-c” option. The Fraction of Reads in Peaks (FRiP) were calculated for evaluating the quality of the


ChIP-seq library. ChIP-seq samples with high quality (FRiP > 0.01) were used for further analyses. This quality cutoff is suggested by previous publication55. See Supplementary Fig. 17


for the FRiP score of histone modifications/variants in QHistone. On the other hand, to visualize the continuous distribution of ChIP reads on genome browser (IGV), BAM files were converted


into coverage files (bigWig format) using bamCoverage function in deepTools (version 3.4.3)56. The workflow of ChIP-seq analysis was wrapped into a python scripting pipeline, and it is


available on QHistone website. The exact commands and parameters are recorded in the section of Exact Commands and Parameters in Supplementary Data 3. PREDICTING THE EPIGENOME PROFILE OF A


PROTEIN QUERY The prediction of an epigenome profile for a query protein using SVM involves three phases (see center panel of Supplementary Fig. 1): In phase 1, we download and preprocess


1534 ChIP-seq datasets of histone modifications/variants from NCBI, resulting in a total of 596,343 bins for the TAIR10 genome. We represent histone modification presence in the genome


(i.e., peak locations) with a Boolean digit system, where “1” indicates a “marking” and “0” indicates “no marking.” Consequently, a matrix is created to store all Boolean digits from the


1534 histone modification datasets (i.e., 1534 ChIP-seq libraries x 598,343 bins). The final matrix of QHistone covered 1534 libraries, significantly more than the sum of the other two major


databases of Arabidopsis histone modifications15,16 (Supplementary Fig. 2). In phase 2, we implemented SVM and searched for the best SVM model with optimal parameters to cluster 1534


ChIP-seq libraries. We used the machine learning Python package scikit-learn (version 0.23.2) to implement SVM, taking the matrix from phase 1 as the input. To determine the optimal


parameters, including regularization parameter, kernel type, and kernel coefficient, we explored various parameter combinations using grid search coupled with 5-fold cross-validation. We


assessed model performance by calculating overall accuracy, F1 score, recall, and precision (Supplementary Fig. 18). In the 5-fold cross-validation, the dataset of 1534 preprocessed


ChIP-seqs is randomly split into 5 groups. The model is trained on 4 groups in each cycle, while the fifth group, containing the actual answers, is used for performance evaluation. This


cycle is repeated five times, with different groups used for testing each time, thereby providing a comprehensive evaluation of the model’s accuracy, F1 score, recall, and precision across


multiple trials. As a result, the linear SVM model with a regularization parameter C = 0.1 demonstrated the best performance in the test datasets (Supplementary Fig. 18). When implementing


SVM for clustering our data, we addressed the multi-class classification challenge of 27 histone modifications and 11 histone variants using the one-versus-rest algorithm. Among the original


1534 libraries, 292 (19.03%) libraries with low FRiP scores (FRiP < 0.01)55,57,58, specific histone proteins (e.g., H3, H1), and modifications/variants with particularly low predictive


accuracy (accuracy < 30%) were excluded. The final model, trained using 1242 eligible libraries, can accurately cluster any of these 1242 ChIP-seq libraries into the correct histone mark


groups with a 90.74% accuracy. In phase 3, to identify the histone marks most closely associated with a query protein (i.e., its epigenomic profile), we use the ChIP-seq data of the protein


(peak-called and boolean-digitized in standard BED format) as input for our final SVM model in QHistone. This model outputs probability scores for various histone modifications (i.e.,


epigenomic profile) specific to the query protein42. The probability score estimates the similarity between the peak distribution (binding pattern) of the query and the marking preferences


of the histone mark. ASSESSING SIMILARITY BETWEEN EPIGENOME PROFILES USING KULLBACK–LEIBLER DIVERGENCE AND MUTUAL INFORMATION To gain insight into the regulatory role among two queries


(proteins), we plotted their epigenome profiles together using slope chart which arranged the order of the histone modifications by ranking with probabilistic scores. The difference between


the two probabilistic distributions were calculated using Kullback–Leibler divergence (KLD), the formula of KLD is as below, $$D(P||Q)={\sum}_{x \in X}P(x){\log }_{2} \frac{P(X)}{Q(X)}$$ (1)


Where _P(x)_ is the probabilistic outputs of the first query, and Q(x) is the second query59. In addition, Mutual information (MI) is another score for estimating the amount of information


shared between the two epigenome profiles. The formula of MI is as below, $$I(X{;}Y)={\sum}_{x\in Y}{\sum}_{x\in X}{P}_{\left(X,Y\right)}(x,y)\log


\frac{{P}_{\left(X,Y\right)}(x,y)}{{P}_{\left(X\right)}(x){P}_{\left(Y\right)}(y)}$$ (2) Where X is ranking order of first query, Y is ranking order of second query, and P(X,Y) is the joint


probability mass function60. We evaluated these scores from known partners or functionally associated pairs of functional consistency and functionally opposite (Supplementary Table 7). The


functionally consistent pairs would display a low KLD score and a high MI value. In contrast, the opposite pairs would exhibit a high KLD score and a low MI value. GENE EXPRESSION ANALYSIS


As part of validating the use of an epigenome profile to predict the regulatory activity of a query protein, a gene expression analysis is implemented. By comparing the overall expression


levels of promoters (the genomic regions between the TSS and 2 kilobases upstream) overlapping with peaks from ChIP-seq of the query protein with those that do not overlap, the result can be


used to assess whether the query protein is likely to function as an activator or repressor. This analysis is performed across 42 plant tissues, with results detailed in Supplementary Table


 11 along with the RNA-seq data sources. These results can be used as a validation for predictions based on epigenome profiling (See Fig. 4E SDG2 and S2Lb for examples of activating


transcription, and Supplementary Fig. 5E CLF and SWN for examples of repressive transcription). Moreover, there might be cases when a regulatory protein may change its regulatory functions


by tissues (Supplementary Fig. 6C IBM1) and may function differently between a protein pair (Supplementary Fig. 6C IBM1 and LDL1). GO ENRICHMENT ANALYSIS The protein binding regions were


functionally classified via Gene Ontology enrichment analysis (GOEA) using GOATOOLS61 with FDR < 0.05 as the threshold. In “Epigenome Profiling “, it is performed on the genes overlapping


with peaks from ChIP-seq of the query protein. In “Regulatory Activity “, which involves two query proteins, it is done on the genes overlapping with specific peaks unique to each query


protein from ChIP-seq data. It is also done on the genes overlapping with the shared peaks between the two proteins. META-ANALYSIS Meta-analyses were used for profiling the coverage of


different histone modifications and variants on specific genomic structures, e.g., protein-coding genes, TEs and enhancer regions. The marking regions of histone modification/variant from


multiple samples were merge using the merge function from BEDOPS (Version 2.4.38)62. We further calculated the peak density among the rescaled genomic structure, by normalizing the average


values of peak occurrence on 10 bp non-overlapping bins through computeMatrix function from deepTools56. The peak occurrence was binarized. In this study, “1” was defined as “marking”


whereas “0” was defined as “no marking”. These peak density files were plotted using Python script with bokeh package. The exact command of computeMatrix was included in Supplementary Data 


3. ENRICHMENT ANALYSIS To assess the enrichment or depletion of histone modifications and variants on specific genomic structures, we utilized genomic structures extracted from the GTF file


of TAIR10. The specific genomic structures consist of promoter (the genomic regions between the TSS and 2 kilobases upstream), 5 prime untranslated regions (5’UTR), 3 prime untranslated


regions (3’UTR), intron, intergenic regions, enhancer regions and TEs. When considering alternative splicing for determining first exons, the initial type of splicing in the GTF is taken


into account. We compared the proportion of the specific genomic structure on the marking regions of a histone modification with the proportion of the structure among the whole genome with


the following formula. $${enrichmentscore}={\log }_{2}\left(\frac{A/B}{C/D}\right)$$ (3) where A is the length of overlapping between the specific structure and the protein binding regions,


B is the total length of protein binding regions, C is the total length of the genomic regions containing the genomic structure, and D is the genome size. The enrichment score measures the


level of enrichment of a specific histone modification in a genomic structure, normalized by the overall distribution of this structure throughout the genome. The score is log2 scaled, where


a score above zero suggests a higher presence of the histone mark at this structure compared to random, and a score below zero suggests a lower presence. REPORTING SUMMARY Further


information on research design is available in the Nature Portfolio Reporting Summary linked to this article. DATA AVAILABILITY The data that support the findings of this study are openly


available at Zenodo: https://doi.org/10.5281/zenodo.11228366. CODE AVAILABILITY Source codes are publicly available at:Zenodo: https://doi.org/10.5281/zenodo.10649327. Docker images of


peak-calling pipeline is available at: https://hub.docker.com/r/dppss90008/qhistone-pipeline. REFERENCES * Bannister, A. J. & Kouzarides, T. Regulation of chromatin by histone


modifications. _Cell Res._ 21, 381–395 (2011). Article  CAS  PubMed  PubMed Central  Google Scholar  * Lusser, A., Kolle, D. & Loidl, P. Histone acetylation: lessons from the plant


kingdom. _Trends Plant Sci._ 6, 59–65 (2001). Article  ADS  CAS  PubMed  Google Scholar  * Chen, M., Lv, S. & Meng, Y. Epigenetic performers in plants. _Dev. Growth Differ._ 52, 555–566


(2010). Article  CAS  PubMed  Google Scholar  * Ueda, M. & Seki, M. Histone modifications form epigenetic regulatory networks to regulate abiotic stress response. _Plant Physiol._ 182,


15–26 (2020). Article  CAS  PubMed  Google Scholar  * Liu, Y. et al. H3K4me2 functions as a repressive epigenetic mark in plants. _Epigenet. Chromatin_ 12, 40 (2019). Article  Google Scholar


  * Liu, C., Lu, F., Cui, X. & Cao, X. Histone methylation in higher plants. _Annu Rev. Plant Biol._ 61, 395–420 (2010). Article  CAS  PubMed  Google Scholar  * Yin X. C. et al.


H2AK121ub in Arabidopsis associates with a less accessible chromatin state at transcriptional regulation hotspots. _Nat. Commun._ 12, 315 (2021). * Xu, L. & Jiang, H. Writing and reading


histone H3 lysine 9 methylation in Arabidopsis. _Front Plant Sci._ 11, 452 (2020). Article  PubMed  PubMed Central  Google Scholar  * Jacob, Y. et al. ATXR5 and ATXR6 are H3K27


monomethyltransferases required for chromatin structure and gene silencing. _Nat. Struct. Mol. Biol._ 16, 763–768 (2009). Article  CAS  PubMed  PubMed Central  Google Scholar  *


Trejo-Arellano, M. S. et al. H3K23me1 is an evolutionarily conserved histone modification associated with CG DNA methylation in Arabidopsis. _Plant J._ 90, 293–303 (2017). Article  CAS 


PubMed  Google Scholar  * Zhao, N. et al. Systematic analysis of differential H3K27me3 and H3K4me3 deposition in callus and seedling reveals the epigenetic regulatory mechanisms involved in


callus formation in rice. _Front Genet._ 11, 766 (2020). Article  CAS  PubMed  PubMed Central  Google Scholar  * Kim, J. M., Sasaki, T., Ueda, M., Sako, K. & Seki, M. Chromatin changes


in response to drought, salinity, heat, and cold stresses in plants. _Front Plant Sci._ 6, 114 (2015). Article  PubMed  PubMed Central  Google Scholar  * Bartlett, A. et al. Mapping


genome-wide transcription-factor binding sites using DAP-seq. _Nat. Protoc._ 12, 1659–1672 (2017). Article  CAS  PubMed  PubMed Central  Google Scholar  * Nakato, R. & Sakata, T. Methods


for ChIP-seq analysis: a practical workflow and advanced applications. _Methods_ 187, 44–53 (2021). Article  CAS  PubMed  Google Scholar  * Chow, C. N. et al. PlantPAN3.0: a new and updated


resource for reconstructing transcriptional regulatory networks from ChIP-seq experiments in plants. _Nucleic Acids Res._ 47, D1155–D1163 (2019). Article  PubMed  Google Scholar  * Hammal,


F., de Langen, P., Bergon, A., Lopez, F. & Ballester, B. ReMap 2022: a database of human, Mouse, drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding


sequencing experiments. _Nucleic Acids Res._ 50, D316–D325 (2022). Article  CAS  PubMed  Google Scholar  * Yin, Q., Wu, M., Liu, Q., Lv, H. & Jiang, R. DeepHistone: a deep learning


approach to predicting histone modifications. _BMC Genom._ 20, 193 (2019). Article  CAS  Google Scholar  * Kang, H., Fan, T., Wu, J., Zhu, Y. & Shen, W. H. Histone modification and


chromatin remodeling in plant response to pathogens. _Front Plant Sci._ 13, 986940 (2022). Article  PubMed  PubMed Central  Google Scholar  * Nunez-Vazquez, R., Desvoyes, B. & Gutierrez,


C. Histone variants and modifications during abiotic stress response. _Front Plant Sci._ 13, 984702 (2022). Article  PubMed  PubMed Central  Google Scholar  * Harris, C. J. & Jacobsen,


S. E. ADCP1: a novel plant H3K9me2 reader. _Cell Res._ 29, 6–7 (2019). Article  PubMed  Google Scholar  * Zhao, S. et al. Plant HP1 protein ADCP1 links multivalent H3K9 methylation readout


to heterochromatin formation. _Cell Res._ 29, 54–66 (2019). Article  CAS  PubMed  Google Scholar  * Zhang, C. et al. Arabidopsis AGDP1 links H3K9me2 to DNA methylation in heterochromatin.


_Nat. Commun._ 9, 4547 (2018). Article  ADS  PubMed  PubMed Central  Google Scholar  * Veluchamy, A. et al. LHP1 regulates H3K27me3 spreading and shapes the three-dimensional conformation of


the Arabidopsis genome. _PLoS One_ 11, e0158936 (2016). Article  PubMed  PubMed Central  Google Scholar  * Mylne, J. S. et al. LHP1, the Arabidopsis homologue of HETEROCHROMATIN PROTEIN1,


is required for epigenetic silencing of FLC. _P Natl Acad. Sci. USA_ 103, 5012–5017 (2006). Article  ADS  CAS  Google Scholar  * Hung, F. Y. et al. The Arabidopsis LDL1/2-HDA6 histone


modification complex is functionally associated with CCA1/LHY in regulation of circadian clock genes. _Nucleic Acids Res._ 46, 10669–10681 (2018). CAS  PubMed  PubMed Central  Google Scholar


  * Hung, F. Y. et al. The LDL1/2-HDA6 histone modification complex interacts with TOC1 and regulates the core circadian clock components in Arabidopsis. _Front Plant Sci._ 10, 233 (2019).


Article  PubMed  PubMed Central  Google Scholar  * Fiorucci, A. S. et al. Arabidopsis S2Lb links AtCOMPASS-like and SDG2 activity in H3K4me3 independently from histone H2B


monoubiquitination. _Genome. Biol._ 20, 100 (2019). Article  PubMed  PubMed Central  Google Scholar  * Deng, W. et al. Arabidopsis polycomb repressive complex 2 binding sites contain


putative GAGA factor binding motifs within coding regions of genes. _BMC Genom._ 14, 593 (2013). Article  CAS  Google Scholar  * Shu, J. et al. Genome-wide occupancy of histone H3K27


methyltransferases CURLY LEAF and SWINGER in Arabidopsis seedlings. _Plant Direct_ 3, e00100 (2019). Article  PubMed  PubMed Central  Google Scholar  * Jiang, D., Wang, Y., Wang, Y. &


He, Y. Repression of FLOWERING LOCUS C and FLOWERING LOCUS T by the Arabidopsis Polycomb repressive complex 2 components. _PLoS One_ 3, e3404 (2008). Article  ADS  PubMed  PubMed Central 


Google Scholar  * Inagaki, S. et al. Gene-body chromatin modification dynamics mediate epigenome differentiation in Arabidopsis. _EMBO J._ 36, 970–980 (2017). Article  CAS  PubMed  PubMed


Central  Google Scholar  * Kralemann, L. E. M. et al. Removal of H2Aub1 by ubiquitin-specific proteases 12 and 13 is required for stable Polycomb-mediated gene repression in Arabidopsis.


_Genome. Biol._ 21, 144 (2020). Article  CAS  PubMed  PubMed Central  Google Scholar  * Ding, Y. & Wilkins, D. Improving the performance of SVM-RFE to select genes in microarray data.


_BMC Bioinforma._ 7, S12 (2006). Article  Google Scholar  * Yang, Z. et al. EBS is a bivalent histone reader that regulates floral phase transition in Arabidopsis. _Nat. Genet_. 50,


1247–1253 (2018). Article  CAS  PubMed  PubMed Central  Google Scholar  * Martignago, D. et al. The four FAD-dependent histone demethylases of Arabidopsis are differently involved in the


control of flowering time. _Front Plant Sci._ 10, 669 (2019). Article  PubMed  PubMed Central  Google Scholar  * Spedaletti, V. et al. Characterization of a lysine-specific histone


demethylase from Arabidopsis thaliana. _Biochemistry_ 47, 4936–4947 (2008). Article  CAS  PubMed  Google Scholar  * Engelhorn J. et al. Dynamics of H3K4me3 chromatin marks prevails over


H3K27me3 for gene regulation during flower morphogenesis in Arabidopsis thaliana. _Epigenomes_ 1, 8 (2017). * Pajoro, A. et al. Dynamics of chromatin accessibility and gene regulation by


MADS-domain transcription factors in flower development. _Genome. Biol._ 15, R41 (2014). Article  PubMed  PubMed Central  Google Scholar  * Chen, C. et al. RNA polymerase II-independent


recruitment of SPT6L at transcription start sites in Arabidopsis. _Nucleic Acids Res._ 47, 6714–6725 (2019). Article  ADS  CAS  PubMed  PubMed Central  Google Scholar  * Lu, P. et al. Genome


encode analyses reveal the basis of convergent evolution of fleshy fruit ripening. _Nat. Plants_ 4, 784–791 (2018). Article  CAS  PubMed  Google Scholar  * Uddin, S., Khan, A., Hossain, M.


E. & Moni, M. A. Comparing different supervised machine learning algorithms for disease prediction. _BMC Med Inf. Decis. Mak._ 19, 281 (2019). Article  Google Scholar  * Platt J. C.


Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In, _Advances in Large Margin Classifiers_. (ed. Alexander, J. S.) 412 (MIT Press,1999).


* Wu, T.-F., Lin, C.-J. & Weng, R. C. Probability estimates for multi-class classification by pairwise coupling. _J. Mach. Learn Res_. 5, 975–1005 (2004). MathSciNet  Google Scholar  *


Deal, R. B. & Henikoff, S. Histone variants and modifications in plant gene regulation. _Curr. Opin. Plant Biol._ 14, 116–122 (2011). Article  CAS  PubMed  Google Scholar  * Fuchs, J.,


Demidov, D., Houben, A. & Schubert, I. Chromosomal histone modification patterns–from conservation to diversity. _Trends Plant Sci._ 11, 199–208 (2006). Article  CAS  PubMed  Google


Scholar  * Barrett, T. et al. NCBI GEO: archive for functional genomics data sets–update. _Nucleic Acids Res._ 41, D991–D995 (2013). Article  CAS  PubMed  Google Scholar  * Leinonen, R.,


Sugawara, H. & Shumway, M. International nucleotide sequence database C. The sequence read archive. _Nucleic Acids Res._ 39, D19–D21 (2011). Article  CAS  PubMed  Google Scholar  *


Krueger F. _Trim Galore._ https://github.com/FelixKrueger/TrimGalore (2015). * Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. _EMBnet J._ 17, 10–12


(2011). Article  Google Scholar  * Andrews S. _FastQC: a Quality Control Tool for High Throughput Sequence Dat_a. http://www.bioinformatics.babraham.ac.uk/projects/fastqc (2010). *


Berardini, T. Z. et al. The Arabidopsis information resource: making and mining the “gold standard” annotated reference plant genome. _Genesis_ 53, 474–485 (2015). Article  CAS  PubMed 


PubMed Central  Google Scholar  * Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with bowtie 2. _Nat. Methods_ 9, 357–359 (2012). Article  CAS  PubMed  PubMed Central  Google


Scholar  * Li, H. et al. The sequence alignment/Map format and SAMtools. _Bioinformatics_ 25, 2078–2079 (2009). Article  PubMed  PubMed Central  Google Scholar  * Zhang, Y. et al.


Model-based analysis of ChIP-Seq (MACS). _Genome Biol._ 9, R137 (2008). Article  PubMed  PubMed Central  Google Scholar  * Qin, Q. et al. ChiLin: a comprehensive ChIP-seq and DNase-seq


quality control and analysis pipeline. _BMC Bioinforma._ 17, 404 (2016). Article  Google Scholar  * Ramirez, F. et al. deepTools2: a next generation web server for deep-sequencing data


analysis. _Nucleic Acids Res._ 44, W160–W165 (2016). Article  CAS  PubMed  PubMed Central  Google Scholar  * Landt, S. G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE


consortia. _Genome Res._ 22, 1813–1831 (2012). Article  CAS  PubMed  PubMed Central  Google Scholar  * Keller C. A. et al. Effects of sheared chromatin length on ChIP-seq quality and


sensitivity. _G3 (Bethesda__)_ 11, jkab101 (2021). * Ji, S. et al. Kullback-leibler divergence metric learning. _IEEE Trans. Cyber._ 52, 2047–2058 (2022). Article  Google Scholar  * Ross, B.


C. Mutual information between discrete and continuous data sets. _PLoS One_ 9, e87357 (2014). Article  ADS  PubMed  PubMed Central  Google Scholar  * Klopfenstein, D. V. et al. GOATOOLS: A


python library for gene ontology analyses. _Sci. Rep._ 8, 10872 (2018). Article  ADS  CAS  PubMed  PubMed Central  Google Scholar  * Neph, S. et al. BEDOPS: high-performance genomic feature


operations. _Bioinformatics_ 28, 1919–1920 (2012). Article  CAS  PubMed  PubMed Central  Google Scholar  Download references ACKNOWLEDGEMENTS We thank Chiao-Yu Lyra Sheu for the preliminary


data analysis, Nien Yang for the art painting of Arabidopsis. We also thank Zheng-Zhong Huang and Jimmy Lin from the IPMB IT/Network service for the IT support. This work was supported by


grants from Academia Sinica (NTU-AS Innovative Joint Program (AS-NTU-112-12)), and National Science and Technology Council, Taiwan 109-2313-B-001-009-MY3 and 111 −2927-I-001 −505 and


112-2311-B-001 −007 to P.-Y.C. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Institute of Plant and Microbial Biology, Academia Sinica, Taipei, 115201, Taiwan Chih-Hung Hsieh, Ya-Ting


Sabrina Chang, Ming-Ren Yen, Jo-Wei Allison Hsieh & Pao-Yang Chen Authors * Chih-Hung Hsieh View author publications You can also search for this author inPubMed Google Scholar * Ya-Ting


Sabrina Chang View author publications You can also search for this author inPubMed Google Scholar * Ming-Ren Yen View author publications You can also search for this author inPubMed 


Google Scholar * Jo-Wei Allison Hsieh View author publications You can also search for this author inPubMed Google Scholar * Pao-Yang Chen View author publications You can also search for


this author inPubMed Google Scholar CONTRIBUTIONS P.-Y.C. conceived the project. C.-H.H. majorly carried out the data analysis and developed the website. Y.-T.S.C. and M.-R.Y. also


contribute to the data analysis. C.-H.H., J.-W.A.H., and P.-Y.C. wrote and edited the paper with inputs from all authors. CORRESPONDING AUTHOR Correspondence to Pao-Yang Chen. ETHICS


DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. PEER REVIEW PEER REVIEW INFORMATION _Nature Communications_ thanks Michihiro Araki, Marek Mutwil and the other,


anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available. ADDITIONAL INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with


regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION PEER REVIEW FILE DESCRIPTION OF ADDITIONAL SUPPLEMENTARY


FILES SUPPLEMENTARY DATA 1 SUPPLEMENTARY DATA 2 SUPPLEMENTARY DATA 3 REPORTING SUMMARY RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons


Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give


appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission


under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons


licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by


statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


http://creativecommons.org/licenses/by-nc-nd/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Hsieh, CH., Chang, YT.S., Yen, MR. _et al._ Predicting protein synergistic


effect in Arabidopsis using epigenome profiling. _Nat Commun_ 15, 9160 (2024). https://doi.org/10.1038/s41467-024-53565-y Download citation * Received: 05 September 2023 * Accepted: 15


October 2024 * Published: 24 October 2024 * DOI: https://doi.org/10.1038/s41467-024-53565-y SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get


shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative