Genome-wide analysis of wd40 protein family in human

Play all audios:

ABSTRACT The WD40 proteins, often acting as scaffolds to form functional complexes in fundamental cellular processes, are one of the largest families encoded by the eukaryotic genomes.

Systematic studies of this family on genome scale are highly required for understanding their detailed functions, but are currently lacking in the animal lineage. Here we present a

comprehensive _in silico_ study of the human WD40 family. We have identified 262 non-redundant WD40 proteins, and grouped them into 21 classes according to their domain architectures. Among

them, 11 animal-specific domain architectures have been recognized. Sequence alignment indicates the complicated duplication and recombination events in the evolution of this family. Through

further phylogenetic analysis, we have revealed that the WD40 family underwent more expansion than the overall average in the evolutionary early stage, and the early emerged WD40 proteins

are prone to domain architectures with fundamental cellular roles and more interactions. While most widely and highly expressed human WD40 genes originated early, the tissue-specific ones

often have late origin. These results provide a landscape of the human WD40 family concerning their classification, evolution, and expression, serving as a valuable complement to the

previous studies in the plant lineage. SIMILAR CONTENT BEING VIEWED BY OTHERS GENOME-WIDE IDENTIFICATION AND EXPRESSION PATTERN ANALYSIS OF THE RIBONUCLEASE T2 FAMILY IN _EUCOMMIA ULMOIDES_

Article Open access 25 March 2021 COMPREHENSIVE IDENTIFICATION OF SWI/SNF COMPLEX SUBUNITS UNDERPINS DEEP EUKARYOTIC ANCESTRY AND REVEALS NEW PLANT COMPONENTS Article Open access 06 June

2022 EVOLUTION OF SEQUENCE, STRUCTURAL AND FUNCTIONAL DIVERSITY OF THE UBIQUITOUS DNA/RNA-BINDING ALBA DOMAIN Article Open access 05 December 2024 INTRODUCTION The WD40 domains, as special

cases of the β-propeller domains, are abundant in eukaryotic proteomes. It was estimated that WD40 domain-containing proteins (WD40 protein family) account for about 1% of the human

proteome1. A canonical WD40 domain comprises 7 blades or repeats, each of which contains 40–60 residues with a motif of WD (tryptophan and aspartic acid). The blades then fold into a

propeller, exposing the top, bottom, and side surfaces, which are believed to be involved in molecular recognition and interaction2. The WD40 domains often act as scaffolds to recruit other

molecules, forming functional complexes or protein-protein interactions3,4,5. WD40 proteins play important roles in many fundamental biological processes such as signal transduction6,

histone modification7, DNA damage response8, transcription regulation9,10, RNA processing11, protein degradation12, and apoptosis13. Consistent with their essential roles, many are involved

in various diseases. For example, FBXW7 is a well-known tumour suppressor and is implicated in several cancers14,15. TLE1 is also a well-studied tumour suppressor gene16. Besides tumour,

other diseases are involved in as well: WDR45 is associated with neurodegeneration through autophagy17, and WDR62 was found mutated in human microcephaly18. Concerning their important roles

in basic biological processes and their abundance, it is valuable to perform a genome-wide computational analysis on this family of proteins. Currently, several genome-wide studies have

already put efforts on identifying and analysing WD40 protein family in plants including Arabidopsis, rice, foxtail millet, and cucumber19,20,21,22. These studies found variation in the

number of WD40 genes in different plants, suggesting gene expansion history during evolution. While all of these studies speculated that most WD40 genes in plants are conserved across all

the eukaryotes, they are functionally diverse between the family members. In rice and foxtail millet, the authors classified WD40 proteins into 11 and 12 classes based on their domain

architectures, respectively20,22. Evolutionary analysis showed that both tandem duplication and segmental duplication contributed to the expansion of WD40 gene family, and revealed that

plant-specific domain architectures and functions emerged with the family expansion in the plant lineage. In a study of tomato genome, the authors specifically analysed the DDB1-binding WD40

proteins, a subfamily presumably serving as substrates recognition components of CUL4 E3 ligases, and experimentally confirmed 14 proteins interacting with DDB123. This kind of studies

provided us a global landscape of the characteristics of WD40 family, including classification, evolution, expression, and functions in plants. However, the systematic study of WD40 protein

family in the animal lineage is lacking. Since the plant kingdom has undergone genome evolution significantly different from the animal kingdom after their divergence, the genome-wide

analysis of the WD40 protein family in animals should result in novel insights other than those from plants research, and will thus serve as a complement of a more comprehensive landscape.

In this work, we chose human as a representative of the animal lineage for a genome-wide computational analysis. First, a reliable set of human WD40 proteins were identified carefully.

Second, we roughly depicted their domain architectures and made a classification followed by inspecting the functional annotations. Detailed sequence comparison at the level of domain and

repeat was further performed. Third, their phylogenetic relationships and evolutionary implications were proposed. Fourth, WD40 genes with different expression profiles and their

relationship with phylogenetic patterns were studied as well. This analysis provided a broad understanding of the WD40 protein family in the animal lineage, and offered a good basis for

further investigation of biological functions and evolution of animal WD40 proteins. More specifically, the study on human WD40 proteins will hopefully provide crucial clues in the research

of diseases and health. RESULTS 262 NON-REDUNDANT HUMAN WD40 PROTEINS ARE IDENTIFIED We utilized the WDSP tool to identify WD40 proteins in human reference proteome24,25. The careful

curation pipeline (Supplementary Fig. S1) resulted in 262 non-redundant human WD40 (_hs_WD40) proteins, each of which represents the typical protein product of an _hs_WD40 gene (Table 1,

Supplementary Table S1). In brief, these _hs_WD40 proteins contain more than 300 WD40 domains, which are composed of 2188 WD40 repeats. Among them, 167 out of 262 (63.7%) _hs_WD40 proteins

hold exactly 7 repeats, indicating that more than half of the proteins should contain the canonical form of WD40 domains. In addition, a small part of _hs_WD40 proteins are composed of more

than 10 repeats, suggesting the existence of multiple WD40 domains within the same protein. In extreme cases, some _hs_WD40 proteins even contain more than 20 repeats, such as WDR6, EML5,

and EML6, which contains 20, 33, and 35 repeats, respectively. DOMAIN ARCHITECTURES CAN DEFINE 21 CLASSES OF _HS_WD40S It is known that many WD40 proteins may contain other types of domains

to form complicated domain architectures, and this may endow WD40 family with complicated functions. To obtain the panorama, we annotated the domain architectures of the _hs_WD40 proteins,

and inspected their functions from literature subsequently. Based on their domain architectures, we grouped _hs_WD40s into 21 classes (Fig. 1). One hundred and sixty-three _hs_WD40 proteins

containing only WD40 domains were grouped into Class 1, and the rest 99 _hs_WD40 proteins containing additional domains were grouped into classes from 2 to 21. For example, 10 _hs_WD40s with

F-box and WD40 domain were classified as Class 2, and 7 _hs_WD40 proteins in Class 3 contain LisH domain. For the sake of simplicity, we put all domain architectures with only one member

into Class 21 (details in Supplementary Table S1). The grouping of domain architectures provided crucial information concerning their subfamily classification. Inspecting their functional

information from the literature revealed that _hs_WD40 proteins with the same domain architecture generally function in a similar way or are involved in similar functional modules

(Supplementary Table S2). When comparing between human and plants (Arabidopsis, rice, foxtail millet, and cucumber), we found that many domain architectures are conserved. In detail, Class 2

(F-box + WD40), Class 3 (LisH + WD40), Class 4 (BEACH + WD40), Class 6 (WD40 + Utp), Class 8 (WD40 + Bromodomain), Class 10 (NLE + WD40), Class 13 (ATG16 + WD40), Class 15 (RING finger +

WD40), Class 16 (WD40 + Lgl_C), and Class 18 (TFIID_90 kDa + WD40) are present in human and at least one plant species20,22. In addition to these conserved domain architectures, we also

noticed 11 potential animal-specific ones (marked with red stars in Fig. 1). They are Class 5 (HELP + WD40), Class 7 (TLE_N + WD40), Class 9 (Striatin N-terminal + WD40), Class 11 (NACHT +

WD40), Class 12 (BTB/POZ-like + WD40), Class 14 (Dynein_IC2 + WD40), Class 17 (Kinesin motor + WD40), Class 19 (WD40 + SOCS box), and at least another three architectures in Class 21 (WD40 +

U box + SAM, WD40 + RWD, CARD + NB-ARC + WD40), none of which was reported in the previous plants studies. Proteins with these domain architectures may have specifically emerged in the

animal lineage, and it is reasonable to speculate that they may carry out animal-specific functions. To confirm this, we performed functional enrichment analysis for proteins belonging to

these architectures, and identified six significantly enriched Gene Ontology (GO) biological processes (_p_-value < 0.05), including β-catenin-TCF complex assembly, Wnt signalling

pathway, microtubule-based movement, microtubule cytoskeleton organization, animal organ morphogenesis, and protein homooligomerization (Supplementary Table S3). Among them, both β-catenin

TCF complex and Wnt signaling are important for embryonic development in animals rather than in plants. The two microtubule-related processes in animals were reported to be distinct from

those in plants. The fifth enriched GO term, as its name implies, is apparently animal-specific. These results are well consistent with our speculation. SEQUENCE ALIGNMENT SUGGESTED FURTHER

SUBFAMILY CLASSIFICATION, AND DUPLICATION AFTER RECOMBINATION EVENTS For such a large family, it is necessary to investigate their relationships with each other and the subfamily

classifications. The domain architectures, serving as a kind of rough sequence feature, were analysed and grouped in the previous section. In order to obtain the more detailed relationships

among _hs_WD40 family members, we explored the pairwise alignments of domain sequences. Although the WD40 domain sequences are very diverse in general1, we identified 71 pairs (0.16% of all

pairwise comparisons, 86 different domain sequences involved) of highly similar domains, and a considerable number of them are connected into clusters (Fig. 2, Supplementary Table S4),

suggesting that members in the same cluster can be classified into a subfamily reasonably. For example, the first 2 clusters in Fig. 2, although both of which belong to Class 1 according to

the domain architecture, further defined the subfamilies of GNB and PPP2R2 respectively. Many other clusters also meet this scenario, so the WD40 domain sequence alignment indeed provided

more details concerning the subfamily classification. It is well known that gene families should have evolved by complicated gene duplication events26. Since the sequence divergences within

each cluster are less than those between clusters, the domains within each cluster in Fig. 2 may have evolved by duplication events more recent than those between clusters. As for proteins

with multiple domain types, it is accepted that domain recombination events also happened in the evolution in addition to the duplication27. It will deepen our understanding to discriminate

the earlier events from the later ones. Interestingly, we noticed that there exists evident consistency between WD40 domain sequence similarities and the overall domain architectures of the

proteins. That is, when the WD40 domain sequence similarity of two proteins is high (connected in Fig. 2), the two proteins almost always belong to the same class of domain architecture

(Fig. 2, rounded rectangles). For example, the WD40 domain sequences of TLE1-4 are highly similar between each other, and all of them belong to Class 7 (TLE_N + WD40), and so do BRWD1,

BRWD3, and PHIP, which belong to Class 8 (WD40 + Bromodomain). These results suggested that the whole gene duplication events happened pervasively after the domain recombination in the

evolution history of the multi-domain WD40 proteins. If it was not the case, we should have detected highly similar domain pairs coming from different domain architecture classes. Since each

WD40 domain contains multiple repeats, we further performed the pairwise sequence alignment at the repeat level, and found 596 pairs of highly similar repeats (0.025% of all pairwise

comparisons, including 655 different repeats, covering 121 different proteins). More than 75% of highly similar repeat pairs came from highly similar domain pairs. Moreover, we noticed that

7 pairs of highly similar repeats came from within-domain repeat alignment, _i.e._, FBXW7, DAW1, and WDR5 (Supplementary Table S5). These data suggested that WD40 domain also evolved at the

repeat-level through recent repeat duplication in addition to the domain-level duplication, although the latter should be the dominant28. _HS_WD40 GENES ARE WIDELY DISTRIBUTED ON ALL

CHROMOSOMES It will provide us an overall picture and more evolutionary implications to sketch a “WD40 map” by plotting all the _hs_WD40 genes according to their chromosomal locations. We

thus extracted their chromosomal coordinates from Ensembl web site and made a circular map (Supplementary Fig. S2). Overall, the “WD40 map” can offer us a brief landscape for quickly

browsing their genomic locations and contexts. As shown in the “WD40 map”, _hs_WD40 genes are widely distributed on all chromosomes. With the number of protein-coding genes on each

chromosome as a denominator, the percentage of _hs_WD40 genes ranges from 0.36% on chromosome 20 to 2.11% on chromosome 9 (Supplementary Table S6). Overall, the number of _hs_WD40 genes on

each chromosome is roughly proportional to that of all protein-coding genes on it, though several evidently biased cases exist. Specifically, the percentages of _hs_WD40 genes in chromosome

9, 3, and 2 are 2.11%, 2.04%, and 1.87%, respectively, which are significantly higher than the overall average, _i.e._, 1.29% (_p_-values: 0.032, 0.023, and 0.043, respectively). On the

contrary, the percentages of _hs_WD40 genes in chromosome 20 and 11 are 0.36% and 0.76%, respectively, which are significantly lower than the overall average (_p_-values: 0.023 and 0.043).

Genome segmental duplication and tandem duplication play important roles in the evolution of a gene family29,30. The genomic locations and pairwise sequence similarities illustrated in the

“WD40 map”, revealed that pervasive segmental duplication events have acted in the expansion history of the WD40 gene family. In addition, we identified 4 pairs of tandemly arrayed genes

(TAGs), _i.e._, TLE1 and TLE4, DCAF8L1 and DCAF8L2, DCAF12L1 and DCAF12L2, and ARPC1A and ARPC1B. These TAGs should have been involved in tandem duplication events (red gene symbols in

Supplementary Fig. S2, and yellow shading in Fig. 2). WD40 FAMILY UNDERWENT MORE EXPANSION THAN OVERALL AVERAGE IN EVOLUTIONARY EARLY STAGE, BUT LESS IN LATE STAGE The analyses in previous

sections glimpsed several evolutionary perspectives of _hs_WD40s, and more insights will be disclosed if we further study them in the context of an evolutionary tree with pivotal time

points, as different members of the _hs_WD40 protein family should have emerged at different evolutionary stages, and may thus be implicated in different functions. We performed a

phylogenetic analysis roughly according to their status of ortholog existence (referred to as phylogenetic pattern) in three model organisms, _i.e._, yeast, Arabidopsis, and Drosophila.

These organisms, in addition to human, are representatives for single-cell eukaryotes, plants, invertebrates, and vertebrates, whose speciation events can define several key time points in

the evolutionary tree. The human genes with orthologs in all other three species (70 in total, labelled as ‘+++’ in Fig. 3) indicate their emergence may be as ancient as the origin of

eukaryotes. Besides these 70 _hs_WD40s, there are 45 _hs_WD40 genes with orthologs only in Arabidopsis and Drosophila (labelled as ‘++−’), suggesting that these WD40s might have emerged

before the separation of plants and animals. And 54 _hs_WD40 genes have orthologs only in Drosophila (labelled as ‘+−−’), indicating that they might have emerged before the separation of

invertebrates and vertebrates. Another 54 _hs_WD40 genes without orthologs in any of the other 3 species should have originated after the separation of vertebrates from invertebrates

(labelled as ‘−−−’). When comparing _hs_WD40s with all human protein-coding genes, we can infer that a larger proportion of _hs_WD40s than that of all genes (26.72% _vs_. 11.27%) should have

originated at the very early stage of eukaryotes (Fig. 3(a,b), Supplementary Tables S7 and S8). A similar speculation can be deduced for the genes originated before the separation of

animals and plants (17.18% _vs_. 8.24%). However, there is no such tendency before the separation of invertebrates and vertebrates (20.61% _vs_. 19.13%). Furthermore, this kind of tendency

is inverted after the separation of vertebrates from invertebrates (20.61% _vs_. 50.34%, Fig. 3(a,b), Supplementary Tables S7 and S8). Studies of human genome showed that many human gene

families have largely expanded in the late stage of evolution31. Distinct from this, our observations indicated that the WD40 family has undergone more expansion than the overall average of

all genes during the early evolutionary period, which echoes the fundamental cellular functions (_i.e._, house-keeping) enriched in WD40 genes. On the other hand, the WD40 family underwent

less expansion than the overall average after the separation of vertebrates and invertebrates. Though they expanded less, the _hs_WD40s with animal or vertebrate origin may have evolved some

animal or vertebrate-specific functions other than fundamental ones. Further studies on these _hs_WD40s may lead to discoveries concerning their important biological roles. For example,

AHI1, as one of them, has been demonstrated that its mutations can result in JBTS, a human disease characterized by psychomotor delay, cerebellar hypoplasia, consecutive ataxia, and so on32.

DIFFERENT PHYLOGENETIC PATTERNS ARE ASSOCIATED WITH DIFFERENT DOMAIN ARCHITECTURES AND INTERACTION COUNTS Both phylogenetic patterns and domain architectures can be utilized for functional

inference, so we further inspected the domain architectures with different phylogenetic patterns (Fig. 4). We found that domain architectures of Class 3, 6, 10, and 18 emerged at the early

stage of eukaryotes (phylogenetic pattern of “+++”). According to their functional annotations (Supplementary Table S2), proteins in these classes are involved in very fundamental functions

such as transcription regulation, histone binding, and rRNA processing. WD40 proteins that emerged at multi-cellular stage (phylogenetic pattern of “++−”) began to present domain

architectures of Class 4, 8, 13, 16, and 20, which endowed proteins with functions of apoptosis, autophagy, cell morphology, and neurotransmitter release process. After the divergence of

plant and animal (phylogenetic pattern of “+−−”), more domain architectures emerged, including Class 2, 5, 7, 9, 11, 12, 14, and 17. Among them, Class 5, 14, and 17 are related to

microtubule dynamic processes which is different between animal and plant, and Class 9 and 11 are implicated in estrogen or androgen receptor binding. The WD40 proteins that emerged after

the separation of vertebrate from invertebrate (phylogenetic pattern of “−−−”), are composed of domain architectures of Class 15, 19, and several other architectures that had already emerged

at earlier stages. Class 15 and 19 in this group, and Class 2 and 12 in the group of “+−−”, are implicated in E3 ubiquitin ligase system, which may be corresponded to that the degradation

system in organisms with more complicated cellular structures need to recognize more protein substrates. It is worth noting that almost all of the potential animal-specific domain

architectures consistently belong to the phylogenetic group of “+−−” or “−−−” (marked with red stars in Fig. 4), which meets our expectation very well. Since WD40 proteins are frequently

involved in protein-protein interactions (PPI), we further briefly checked their network degrees in a curated human PPI dataset33, where 174 of the 262 WD40 proteins have interaction data.

In this dataset, about 55% of the interactions involved in multi-domain WD40 proteins should be contributed from WD40 domains according to the estimation of a domain-domain prediction

method34. Although there is no evident trend for the degrees of the four groups with different phylogenetic patterns, the degrees of WD40 proteins in group “+++” is significantly higher than

those in group “−−−” with a fold change of ~2.75 (18.54 _vs_. 6.74, _p_-value = 0.93e-3). This indicates that the late emerged proteins should be involved in fewer interactions than the

early ones, possibly because they have undergone shorter evolutionary time. MOST WIDELY AND HIGHLY EXPRESSED _HS_WD40 GENES ORIGINATED EARLY IN EVOLUTION, WHILE MOST TISSUE-SPECIFIC ONES

HAVE LATE ORIGIN Compared to the static features including domain architectures, sequence similarities, genomic locations, and phylogenetic properties, the gene expression profile across

various tissues further presents a more vivid picture concerning the biological activity and functions of a gene. To view the expression patterns of the 262 _hs_WD40 genes, we used the

RNA-seq dataset from the Human Protein Atlas, which contains normalized gene expression levels across 27 tissues35 (Supplementary Table S9). According to the expression profiles, all

_hs_WD40 genes have detectable expression signals in at least one tissue. Overall, the median expression levels of WD40 genes are two times higher than those of all human genes in all

tissues (Supplementary Fig. S3). Since a considerable portion of _hs_WD40 genes may originate at the evolutionarily early stage of eukaryotes and play roles in basic cellular processes (or

“house-keeping” in other words), it is reasonable to witness their overall higher expression levels. In addition to the overall expression pattern, the _hs_WD40 genes can be further divided

into several classes according to their differentiated expression profiles. According to our definition in Methods, 204 _hs_WD40 genes can be classified as “Expressed in all”, _i.e._, most

of _hs_WD40 genes are widely expressed. Furthermore, among them, 52 can be grouped as “Highly expressed in all”, implying that the functions of these genes should be enriched with

house-keeping roles in fundamental cellular processes (Supplementary Table S9). Except the widely expressed genes, we also identified 20 _hs_WD40 genes which manifested the “Tissue-specific”

expression characteristics (Table 2, Supplementary Table S9). Among them, 17 genes are specifically expressed in testis, while the other 3 are specifically expressed in brain, prostate, and

pancreas, respectively. This small list of the _hs_WD40 genes may have evolved with specific functions rather than house-keeping. Both the expression profile and phylogenetic information

can give us indications about the functions of genes, so integrating them together may present some interesting patterns and provide deeper insights. Bearing this in mind, we combined the

classification of expression and the phylogenetic pattern of _hs_WD40 genes, and found that the “Highly expressed in all” WD40 genes and the “Tissue-specific” genes showed strikingly

different distribution of phylogenetic patterns (Fig. 5). The WD40 genes that expressed highly in all tissues reside dominantly in the phylogenetic group with very early evolutionary origin

(labelled as ‘+++’ in Fig. 5) among all the four representative groups. Since the _hs_WD40 genes with very ancient origin were supposed to play fundamental roles in basic cellular processes

according to the previous section, and so were the _hs_WD40s with wide and high expression (_i.e._, house-keeping) according to this section, it is reasonable to observe this coincidence. In

contrast, the WD40 genes whose expressions are tissue-specific fall dominantly into the phylogenetic group with late evolutionary origin (labelled as ‘−−−’ in Fig. 5, _i.e._, originated

after the separation of vertebrates and invertebrates). We have speculated that _hs_WD40 genes with late evolutionary origin may have evolved with lineage-specific functions, and here the

tissue-specific expression patterns actually serve as certain kind of evidences since specialized tissues or organs only occurred in specific lineage. Overall, analysing the _hs_WD40 family

with both the dimensions of phylogeny and expression can provide us deeper insights, and can further help researchers choose individual WD40 genes for detailed functional studies with

experiments. DISCUSSION Due to the low sequence similarity between WD40 repeats, and the variable number of repeats within a single WD40 domain, it is a big challenge to identify WD40

domains by methods merely based on sequence similarity search and alignment. In this work, we utilized the WDSP24 software, a tool designed for annotating WD40 repeats and domains

specifically, to identify human WD40 domains. Rather than general methods which can only find typical WD40 repeats, WDSP is capable of detecting non-typical repeats with remote homology.

Steven van Nocker defined a protein with 4 or more WD40 repeats to be a WD40 domain19, but we found that it should be at least 6 repeats to form a complete WD40 β-propeller according to the

current 3D structures in PDB database. Hence, we defined a protein with six or more WD40 repeats to be a WD40 protein in this work, which should be more reliable. In the classification based

on the domain architectures, the 262 _hs_WD40 proteins were grouped roughly into 21 classes. It is worth noting that proteins in Class 1 are different from each other in sequence lengths,

repeat numbers, and many other features. More efforts are required to make further classifications, and the domain sequence alignment that followed demonstrated its necessity. In addition,

the Class 21 contains many different domain architectures with only one member identified, so it can actually be divided into many smaller groups. According to our domain annotation

criteria, F-box domain was not identified in CDRT1, but annotations in some other databases with loose criteria did. This means that the domain architecture classification can be refined

with more comprehensive domain annotations. We identified the potential animal-specific domain architectures by checking the literatures of plant studies, which may be improved by a more

comprehensive comparative genomics study. In the domain sequence alignment, it is not self-evident to define the WD40 domain boundaries of the proteins with multiple WD40 domains. Although

we have considered this problem carefully according to our experiences, it will be improved if more accurate solutions of domain boundary definition are available. In the sequence

comparison, we set 50% of sequence identity as the cut-off. This is a strict measure of sequence similarity, so we only considered the similar pairs of domains or repeats with high

confidence. In this setting, 214 out of 300 domains were isolated with no similarity to other domains. If we lower the cut-off, more sequence pairs can be identified. The pervasive but

uneven distribution of _hs_WD40 genes on chromosomes is similar to those in plants20,21,22, which may be correlated with different levels of segmental duplication on different chromosomes.

For example, the high density of _hs_WD40 genes in chromosome 9 may be related to the enriched segmental duplications36, but further elucidation of these distribution patterns needs more

detailed investigations. Among those _hs_WD40s with orthologs in other species, we noticed that some different human genes are co-orthologous to only one gene in other species (Supplementary

Table S7), indicating a specific type of gene expansion. For instance, there are 5 human Gβ genes (GNB1–5), which are all orthologous to the same gene in Arabidopsis (GB1) and yeast (STE4).

Another case is protein phosphatase 2, subunit B. There are 4 genes in this group (PPP2R2A, PPP2R2B, PPP2R2C, PPP2R2D), which are all orthologous to the same gene in Drosophila (tws) and

yeast (CDC55). As expected, the 5 Gβ genes and the 4 phosphatase genes were all involved in the aforementioned highly similar domain clusters (Fig. 2). Although these phylogenetic data may

reflect expansion within cluster happened in the evolutionary late stage, the fact that these genes have orthologs in all other species indicated that their “prototype” originated very

early. Gene expression profiles in 27 normal tissues were used in this study for mining functional implications. There are many gene expression datasets in the public domain with different

levels of quality. Further mining of these data with careful curation and robust algorithms in the future will greatly improve our understanding of this gene family, especially for those

datasets with disease and normal tissue comparisons. Previous studies have reported many WD40 genes involved in different human diseases, such as cancer and neurodegenerative

diseases14,15,17,37,38,39. Studies on their biological roles in disease pathogenesis will be an important direction in the future. Due to the unique structural features of the WD40 family,

further systematic studies of them may be conducted, such as the hydrogen bond network, protein-protein interaction hotspots, and so on, from the perspective of evolution and subfamily

classification. In addition, the analysis of these structure features may also be adopted to interpret or to help discriminate the disease-related mutations on WD40 domains, since the

next-generation sequencing technology are identifying more and more variants by re-sequencing different samples. CONCLUSION In this work, we presented a comprehensive characterization of the

human WD40 protein family. 262 _hs_WD40 genes have been identified, and classified into 21 classes based on their domain architectures. Many architecture types were not observed in plants,

and may be animal-specific. The domain sequence alignment provided detailed information regarding further subfamily classification, and indicated duplication and recombination events in

evolution. The WD40 family should have undergone more expansion than overall average in the evolutionarily early stage, but experienced less expansion in the late stage. The early emerged

WD40 proteins generally interact with more other proteins, and carry domain architectures playing roles in fundamental cellular processes. As for the gene expression, the overall

transcription levels of WD40 genes are much higher than those of all human genes. Fifty-two _hs_WD40 genes are highly expressed in a wide spectrum of tissues, while 20 _hs_WD40 genes are

tissue-specific. After integration of the phylogenetic patterns and expression profiles, we found that most widely and highly expressed _hs_WD40 genes originated early in evolution, while

most tissue-specific ones have late origin. Our work depicted a landscape of the _hs_WD40 protein family, including the subfamily classification, evolution, and gene expression. As the first

systematic study of animal WD40 protein family, it can serve as an important complement to the published studies in plants, and do have identified animal-specific WD40s. These analyses

provided crucial insights regarding their evolutionary and functional implications, and will thus help us prioritize important ones for further experimental investigations. METHODS

IDENTIFICATION OF WD40 PROTEINS FROM THE HUMAN PROTEOME The sequences of human reference proteome were downloaded from UniProt in April 2014 (http://www.uniprot.org/proteomes/, UP000005640,

UniProt Release 2014_03)25. WDSP software was adopted in a strict pipeline for the identification of human WD40 proteins (Supplementary Fig. S1). In brief, WDSP predicts out the potential

repeats and calculates an average score for them. According to the previous experiences24, the minimum number of repeats is 6 for WD40 domains with known structures, and the repeat score of

WDSP as high as 48 will greatly reduce the false positive predictions. So we set 48 as the cut-off of the average score and 6 as the cut-off of the number of repeats to screen the

potentially reliable human WD40 proteins. The proteins that passed the filter were mapped to Ensembl gene identifiers and gene symbols by BioMart (http://grch37.ensembl.org/index.html,

Genome assembly version GRCh37.p13)40, and only the longest sequence was kept if multiple proteins were mapped to the same gene. Through manual curation, a protein was discarded if there

exist clear annotations denoting that it should belong to other β-propeller proteins. This procedure ensured that the final WD40 protein set is reliable and non-redundant. DETERMINATION OF

DOMAIN ARCHITECTURES Domain annotation of WD40 proteins was performed locally using InterProScan 5 (version 5.10−50.0)41, with three domain annotation engines enabled, including

ProDom-2006.1, SMART-6.2, and PfamA-27.0. The WD40 repeats annotated by InterProScan were replaced with annotations by WDSP, since WDSP can provide more complete and precise WD40 repeat

annotations24. Based on the domain annotations, proteins with similar domain architectures were assigned to the same class. Schematic diagram for the domain architectures of _hs_WD40

proteins was drawn by using IBS42. The functional enrichment was analysed by using DAVID43,44 online, and the GO functional terms with _p_-values less than 0.05 were considered as enriched

significantly. DOMAIN AND REPEAT SEQUENCE ALIGNMENT Pairwise sequence alignment for WD40 domains and repeats were performed by the BLASTP program with default parameters45. A protein may

contain multiple WD40 domains. For the sake of simplicity, we split them sequentially by seven repeats, and every seven repeats were regarded as an individual WD40 domain. If six repeats

were left, we also consider them as an individual domain, and if less than six repeats were left, they were discarded in this analysis. If a protein contains multiple WD40 domains, each

domain were named after the gene symbol (or the protein ID) and a numeric suffix to avoid confusion. In the repeat alignment, we named each repeat based on the protein ID and a numeric

suffix. If two sequences in an alignment resulted in identity greater than 50%, and the average coverage of the two sequences in the aligned region was greater than 90%, they were defined as

a highly similar sequence pair. The graph of highly similar WD40 domain sequence pairs was prepared using Cytoscape46, and manual editing was added for more detailed information such as the

chromosome numbering and additional domain names. CHROMOSOMAL LOCALIZATION Coordinates of _hs_WD40 genes in human genome were obtained from Ensembl website through BioMart

(http://grch37.ensembl.org/index.html, Genome assembly version GRCh37.p13)40. As BOP1 (ENSG00000261236) and CIRH1A (ENSG0000262788) do not locate on well-assembled chromosomes, only 260

genes were involved in the “WD40 map”, which was built using Circos47. The hyper-geometric distribution test was used to detect the chromosomes with biased WD40 abundance. Among the WD40

genes with highly similar domains, we also defined two WD40 genes adjacent to each other on the same chromosome with at most one spacer gene as tandemly arrayed genes (TAGs)48. PHYLOGENETIC

ANALYSIS AND PPI NETWORK STUDY Orthologs of human genes in Drosophila, Arabidopsis, and yeast were obtained from InParanoid 8 (http://inparanoid.sbc.su.se/, Version 8.0)49. According to the

status of ortholog existence, the genes were classified into different phylogenetic patterns. Specifically, the status of orthologs existence in “Drosophila, Arabidopsis and yeast”, “only

Drosophila and Arabidopsis”, “only Drosophila”, and “none of the other three species”, are represented by the symbols of “+++”, “++−”, “+−−”, and “−−−”, respectively. The different

phylogenetic patterns can be used to indicate the different time of evolutionary origin. The observed number of proteins was counted for each combination of the domain architecture classes

and the phylogenetic patterns, and the expected number for each combination was calculated with the assumption of independent marginal distributions. The ratios of observation to the

expectation, _i.e._, a kind of measure of relative counts of proteins matching specific domain architecture types and phylogenetic patterns, were subjected to hierarchical clustering

(Euclidean distance and average linkage) for putting together domain architectures with similar distributions in phylogenetic patterns. The figure was prepared in the R programing

environment, and the colour depth represents the ratio. Class 1 and 21 were not presented, since Class 1 contains proteins with only WD40 domains and Class 21 actually contain many kinds of

domain architectures. The human PPI dataset were downloaded from HIPPIE33 (v2.0), and only the PPIs detected by at least two methods and with a score of at least 0.5 were used for further

analysis. Degree of each node in the PPI network was calculated by using Cytoscape46, and the Wilcoxon rank-sum test was performed to test whether there are significant differences between

the degrees of WD40 proteins with different phylogenetic patterns. Domain-domain interaction prediction was performed by a parsimony approach implemented by linear programming34. Because the

large amount of PPIs (more than 70,000) in the dataset impeded a thorough computation, we randomly sampled 2,000 PPIs for predicting the DDIs, and repeated the process for 1,000 times. For

each run, we calculated the percentage of WD40 domain-mediated PPIs in multi-domain WD40 protein-associated PPIs, to estimate the degree of involvement of WD40 domains in multi-domain WD40

proteins. GENE EXPRESSION ANALYSIS Expression data of WD40 genes were obtained from the RNA-seq dataset in Human Protein Atlas database, which assayed the expression levels of coding RNAs

from 95 individuals in 27 different human tissues35. The gene expression levels were denoted by FPKM (Fragments Per Kilobase of transcript per Million fragments mapped), and the data were

downloaded from ArrayExpress website (ID: E-MTAB-1733). For each tissue, the FPKM values of every gene were averaged among all individual samples. Consistent with the original article35,

genes with FPKM less than 1.0 in all 27 tissues were termed as “Not detected”, and were treated as 0 in the fold change calculation. Genes with FPKM greater than 1.0 in all 27 tissues were

defined as “Expressed in all”, and if all are greater than 10, they were termed as “Highly expressed in all” or “House-keeping genes”. The “Tissue-specific” WD40 genes were defined as genes

whose FPKM values in a specific tissue are 5 folds greater than in all other tissues, which includes “Tissue-specific” and “Tissue-enriched” in the original article. ADDITIONAL INFORMATION

HOW TO CITE THIS ARTICLE: Zou, X.-D. _et al_. Genome-wide Analysis of WD40 Protein Family in Human. _Sci. Rep._ 6, 39262; doi: 10.1038/srep39262 (2016). PUBLISHER'S NOTE: Springer

Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. REFERENCES * Stirnimann, C. U., Petsalaki, E., Russell, R. B. & Muller, C.

W. WD40 proteins propel cellular networks. Trends Biochem. Sci. 35, 565–574 (2010). Article CAS PubMed Google Scholar * Smith, T. F., Gaitatzes, C., Saxena, K. & Neer, E. J. The WD

repeat: a common architecture for diverse functions. Trends Biochem. Sci. 24, 181–185 (1999). Article CAS PubMed Google Scholar * Li, D. & Roberts, R. WD-repeat proteins: structure

characteristics, biological function, and their involvement in human diseases. Cell Mol. Life Sci. 58, 2085–2097 (2001). Article CAS PubMed Google Scholar * Neer, E. J., Schmidt, C. J.,

Nambudripad, R. & Smith, T. F. The ancient regulatory-protein family of WD-repeat proteins. Nature 371, 297–300 (1994). Article ADS CAS PubMed Google Scholar * Xu, C. & Min, J.

Structure and function of WD40 domain proteins. Protein Cell 2, 202–214 (2011). Article CAS PubMed PubMed Central Google Scholar * Gaudet, R., Bohm, A. & Sigler, P. B. Crystal

structure at 2.4 angstroms resolution of the complex of transducin betagamma and its regulator, phosducin. Cell 87, 577–588 (1996). Article CAS PubMed Google Scholar * Ruthenburg, A. J.

et al. Histone H3 recognition and presentation by the WDR5 module of the MLL1 complex. Nat. Struct. Mol. Biol. 13, 704–712 (2006). Article CAS PubMed PubMed Central Google Scholar *

Wakasugi, M. et al. DDB accumulates at DNA damage sites immediately after UV irradiation and directly stimulates nucleotide excision repair. J. Biol. Chem. 277, 1637–1640 (2002). Article

CAS PubMed Google Scholar * Jennings, B. H., Pickles, L. M., Wainwright, S. M., Roe, S. M., Pearl, L. H. & Ish-Horowicz, D. Molecular recognition of transcriptional repressor motifs

by the WD domain of the Groucho/TLE corepressor. Mol. Cell 22, 645–655 (2006). Article CAS PubMed Google Scholar * Znaidi, S., Pelletier, B., Mukai, Y. & Labbe, S. The

Schizosaccharomyces pombe corepressor Tup11 interacts with the iron-responsive transcription factor Fep1. J. Biol. Chem. 279, 9462–9474 (2004). Article CAS PubMed Google Scholar * Yan,

C., Hang, J., Wan, R., Huang, M., Wong, C. C. & Shi, Y. Structure of a yeast spliceosome at 3.6-angstrom resolution. Science 349, 1182–1191 (2015). Article ADS CAS PubMed Google

Scholar * Higa, L. A., Wu, M., Ye, T., Kobayashi, R., Sun, H. & Zhang, H. CUL4-DDB1 ubiquitin ligase interacts with multiple WD40-repeat proteins and regulates histone methylation. Nat.

Cell Biol. 8, 1277–1283 (2006). Article CAS PubMed Google Scholar * Zou, H., Henzel, W. J., Liu, X., Lutschg, A. & Wang, X. Apaf-1, a human protein homologous to C. elegans CED-4,

participates in cytochrome c-dependent activation of caspase-3. Cell 90, 405–413 (1997). Article CAS PubMed Google Scholar * Zhan, P. et al. FBXW7 negatively regulates ENO1 expression

and function in colorectal cancer. Lab Invest. 95, 995–1004 (2015). Article CAS PubMed PubMed Central Google Scholar * Wang, X. et al. Fbxw7 regulates hepatocellular carcinoma migration

and invasion via Notch1 signaling pathway. Int. J. Oncol. 47, 231–243 (2015). Article PubMed CAS Google Scholar * Ramasamy, S. et al. Tle1 tumor suppressor negatively regulates

inflammation _in vivo_ and modulates NF-kappaB inflammatory pathway. Proc. Natl. Acad. Sci. USA 113, 1871–1876 (2016). Article ADS CAS PubMed PubMed Central Google Scholar * Ozawa, T.

et al. A novel WDR45 mutation in a patient with static encephalopathy of childhood with neurodegeneration in adulthood (SENDA). Am. J. Med. Genet. A. 164A, 2388–2390 (2014). Article PubMed

CAS Google Scholar * Nicholas, A. K. et al. WDR62 is associated with the spindle pole and is mutated in human microcephaly. Nat. Genet. 42, 1010–1014 (2010). Article CAS PubMed PubMed

Central Google Scholar * van Nocker, S. & Ludwig, P. The WD-repeat protein superfamily in Arabidopsis: conservation and divergence in structure and function. BMC Genomics 4, 50

(2003). Article PubMed PubMed Central Google Scholar * Ouyang, Y., Huang, X., Lu, Z. & Yao, J. Genomic survey, expression profile and co-expression network analysis of OsWD40 family

in rice. BMC genomics 13, 100 (2012). Article CAS PubMed PubMed Central Google Scholar * Li, Q., Zhao, P., Li, J., Zhang, C., Wang, L. & Ren, Z. Genome-wide analysis of the

WD-repeat protein family in cucumber and Arabidopsis. Mol. Genet. Genomics 289, 103–124 (2014). Article CAS PubMed Google Scholar * Mishra, A. K., Muthamilarasan, M., Khan, Y., Parida,

S. K. & Prasad, M. Genome-wide investigation and expression analyses of WD40 protein family in the model plant foxtail millet (Setaria italica L.). PLoS One 9, e86852 (2014). Article

ADS PubMed PubMed Central CAS Google Scholar * Zhu, Y. et al. Genome-wide identification, sequence characterization, and protein-protein interaction properties of DDB1 (damaged DNA

binding protein-1)-binding WD40-repeat family members in Solanum lycopersicum. Planta 241, 1337–1350 (2015). Article CAS PubMed Google Scholar * Wang, Y., Jiang, F., Zhuo, Z., Wu, X. H.

& Wu, Y. D. A method for WD40 repeat detection and secondary structure prediction. PLoS One 8, e65705 (2013). Article ADS CAS PubMed PubMed Central Google Scholar * Magrane, M.

& Consortium, U. UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford) 2011, bar009 (2011). * Zhang, J. Evolution by gene duplication: an update. Trends Ecol. Evol.

18, 292–298 (2003). Article Google Scholar * Vogel, C., Teichmann, S. A. & Pereira-Leal, J. The relationship between domain duplication and recombination. J. Mol. Biol. 346, 355–365

(2005). Article CAS PubMed Google Scholar * Chaudhuri, I., Soding, J. & Lupas, A. N. Evolution of the beta-propeller fold. Proteins 71, 795–803 (2008). Article CAS PubMed Google

Scholar * Zhang, L., Lu, H. H., Chung, W. Y., Yang, J. & Li, W. H. Patterns of segmental duplication in the human genome. Mol. Biol. Evol. 22, 135–141 (2005). Article CAS PubMed

Google Scholar * Shoja, V. & Zhang, L. A roadmap of tandemly arrayed genes in the genomes of human, mouse, and rat. Mol. Biol. Evol. 23, 2134–2141 (2006). Article CAS PubMed Google

Scholar * Collins, F., Lander, E., Rogers, J., Waterston, R. & Conso, I. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004). Article CAS Google Scholar

* Elsayed, S. M. et al. Non-manifesting AHI1 truncations indicate localized loss-of-function tolerance in a severe Mendelian disease gene. Hum. Mol. Genet. 24, 2594–2603 (2015). Article

CAS PubMed PubMed Central Google Scholar * Schaefer, M. H., Fontaine, J. F., Vinayagam, A., Porras, P., Wanker, E. E. & Andrade-Navarro, M. A. HIPPIE: Integrating protein interaction

networks with experiment based quality scores. PLoS One 7, e31826 (2012). Article ADS CAS PubMed PubMed Central Google Scholar * Guimaraes, K. S., Jothi, R., Zotenko, E. &

Przytycka, T. M. Predicting domain-domain interactions using a parsimony approach. Genome Biol. 7, R104 (2006). Article PubMed PubMed Central CAS Google Scholar * Fagerberg, L. et al.

Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell Proteomics 13, 397–406 (2014). Article CAS PubMed

Google Scholar * Humphray, S. J. et al. DNA sequence and analysis of human chromosome 9. Nature 429, 369–374 (2004). Article ADS CAS PubMed PubMed Central Google Scholar * Saitsu, H.

et al. De novo mutations in the autophagy gene WDR45 cause static encephalopathy of childhood with neurodegeneration in adulthood. Nat. Genet. 45, 445–449, 449e441 (2013). Article CAS

PubMed Google Scholar * Park, J. Y. et al. Breast cancer-associated missense mutants of the PALB2 WD40 domain, which directly binds RAD51C, RAD51 and BRCA2, disrupt DNA repair. Oncogene

33, 4803–4812 (2014). Article CAS PubMed Google Scholar * Park, J. Y., Zhang, F. & Andreassen, P. R. PALB2: the hub of a network of tumor suppressors involved in DNA damage

responses. BBA-Rev. Cancer 1846, 263–275 (2014). CAS Google Scholar * Flicek, P. et al. Ensembl 2013. Nuleic Acids Res. 41, D48–55 (2013). Article CAS Google Scholar * Jones, P. et al.

InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014). Article CAS PubMed PubMed Central Google Scholar * Liu, W. et al. IBS: an illustrator

for the presentation and visualization of biological sequences. Bioinformatics 31, 3359–3361 (2015). Article CAS PubMed PubMed Central Google Scholar * Huang da, W., Sherman, B. T.

& Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009). Article PubMed CAS Google Scholar *

Huang da, W., Sherman B. T. & Lempicki R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).

Article PubMed CAS Google Scholar * Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). Article PubMed PubMed Central CAS Google Scholar *

Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). CAS PubMed PubMed Central Google Scholar

* Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009). Article CAS PubMed PubMed Central Google Scholar * Pan, D. &

Zhang, L. Tandemly arrayed genes in vertebrate genomes. Comp. Funct. Genomics. 545269 (2008). * Sonnhammer, E. L. & Ostlund, G. InParanoid 8: orthology analysis between 273 proteomes,

mostly eukaryotic. Nuleic Acids Res. 43, D234–239 (2015). Article CAS Google Scholar Download references ACKNOWLEDGEMENTS The authors would like to thank Dr. Xin-Hao Zhang, Dr. Fan Jiang,

Dr. Dong-Yang Li, Dr. Olaf Wiest, and Dr. Ge Gao for their valuable suggestions and discussions. This work was supported by National Natural Science Foundation of China (21133002,

31471243); Shenzhen Program (JCYJ20140509093817689, JCYJ20140509093817686, KQTD201103). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Lab of Computational Chemistry and Drug Design,

Laboratory of Chemical Genomics, Peking University Shenzhen Graduate School, Shenzhen, 518055, P.R. China Xu-Dong Zou, Xue-Jia Hu, Jing Ma, Tuan Li, Zhi-Qiang Ye & Yun-Dong Wu * College

of Chemistry, Peking University, Beijing, 100871, P.R. China Yun-Dong Wu Authors * Xu-Dong Zou View author publications You can also search for this author inPubMed Google Scholar * Xue-Jia

Hu View author publications You can also search for this author inPubMed Google Scholar * Jing Ma View author publications You can also search for this author inPubMed Google Scholar * Tuan

Li View author publications You can also search for this author inPubMed Google Scholar * Zhi-Qiang Ye View author publications You can also search for this author inPubMed Google Scholar *

Yun-Dong Wu View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS Y.D.W. and Z.Q.Y. conceived this study; X.D.Z., Z.Q.Y., X.J.H., J.M. and T.L.

analysed the data; X.D.Z., Z.Q.Y. and Y.D.W wrote the manuscript; all the authors reviewed the manuscript. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing financial

interests. ELECTRONIC SUPPLEMENTARY MATERIAL SUPPLEMENTARY INFORMATION SUPPLEMENTARY DATASET 1 RIGHTS AND PERMISSIONS This work is licensed under a Creative Commons Attribution 4.0

International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the

material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit

http://creativecommons.org/licenses/by/4.0/ Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Zou, XD., Hu, XJ., Ma, J. _et al._ Genome-wide Analysis of WD40 Protein Family in

Human. _Sci Rep_ 6, 39262 (2016). https://doi.org/10.1038/srep39262 Download citation * Received: 15 August 2016 * Accepted: 22 November 2016 * Published: 19 December 2016 * DOI:

https://doi.org/10.1038/srep39262 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently

available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative