Play all audios:
ABSTRACT Deconvolution is an efficient approach for detecting cell-type-specific (cs) transcriptomic signals without cellular segmentation. However, this type of methods may require a
reference profile from the same molecular source and tissue type. Here, we present a method to dissect bulk proteome by leveraging tissue-matched transcriptome and proteome without using a
proteomics reference panel. Our method also selects the proteins contributing to the cellular heterogeneity shared between bulk transcriptome and proteome. The deconvoluted result enables
downstream analyses such as cs-protein Quantitative Trait Loci (cspQTL) mapping. We benchmarked the performance of this multimodal deconvolution approach through CITE-seq pseudo bulk data, a
simulation study, and the bulk multi-omics data from human brain normal tissues and breast cancer tumors, individually, showing robust and accurate cell abundance quantification across
different datasets. This algorithm is implemented in a tool MICSQTL that also provides cspQTL and multi-omics integrative visualization, available at
https://bioconductor.org/packages/MICSQTL. SIMILAR CONTENT BEING VIEWED BY OTHERS CELL-ATTRIBUTE AWARE COMMUNITY DETECTION IMPROVES DIFFERENTIAL ABUNDANCE TESTING FROM SINGLE-CELL RNA-SEQ
DATA Article Open access 05 June 2023 BENCHMARKING OF CELL TYPE DECONVOLUTION PIPELINES FOR TRANSCRIPTOMICS DATA Article Open access 06 November 2020 ENHANCED SENSITIVITY AND SCALABILITY
WITH A CHIP-TIP WORKFLOW ENABLES DEEP SINGLE-CELL PROTEOMICS Article Open access 16 January 2025 INTRODUCTION Proteomics profiling and analysis at cell type level is critical in the study of
complex biological systems with numerous applications in immunology, cancer research, and developmental biology1,2,3. Several technologies have been developed to identify and quantify
proteins at cellular resolution4. For example, the detection of proteins by CyTOF coupled with fluorescence activated cell sorting (FACS)5 and the single cell multimodal technology
CITE-seq6, which is a multimodal sequencing technique that enables simultaneous profiling of gene expression and up to 300 cell surface protein markers in individual cells, allows the
identification of rare cell types and cells that express low levels of certain genes. However, these technologies only measure limited number of proteins. The abundance of proteins not
detectable in CyTOF or CITE-seq may be strikingly different from the transcript expression of coding genes and cannot be approximated by scRNA-seq measurement because of the RNA/protein
degradation and post-translation modifications. Recent advances in liquid chromatography mass spectrometry (LC-MS)-based proteomics methods have addressed the limitations in the sensitivity
and throughput7,8, which accelerates the evolvement of single cell mass spectrometry (scMS) proteomics. One major challenge in scMS proteomics is that the number of unique samples and cells
analyzed in a single day is very limited9. For label-free scMS, samples are analyzed sequentially with analysis time ranging from 35 to 90 min. At this speed, a maximum of 40 single cells
could be analyzed in a day, which is not ideal for population-scale clinical studies due to the burden of time and costs. The scMS technology limitations and costs of cell sorting are the
hurdles for cell-type-specific inference such as differential expression (csDE) or protein quantitative trait mapping (cspQTL) that requires median or large sample size. Deconvolution
algorithms are rapidly developed to measure molecule proportions (e.g., RNA transcripts) mapped to each cell type, which is different from the cell counts composition and varies across
molecular sources. To estimate the cellular composition in human proteome, the pure cell or single cell reference proteomes (i.e., signature matrix) is needed but lacking in certain tissues
or cell types due to the challenges in cellular dissociation (e.g., astrocytes and excitatory neurons2) and the aforementioned limitations in scMS and CITE-seq. Meanwhile, the multi-omics
profiling matched by samples becomes popular in recent decade, enabling the integration across data sources and the discovery of multimodal signatures for disease1. Hence, we design a
algorithm to estimate the proteomics cell fractions by integrating bulk transcriptome-proteome without single cell reference proteome, implemented in R package MICSQTL. Our method enables
the downstream cell-type-specific protein quantitative trait loci mapping (cspQTL) based on the mixed-cell proteomes and pre-estimated proteomics cellular composition, without the need for
large-scale single cell sequencing10 or cell sorting. RESULTS MICSQTL ALGORITHMS AND IMPLEMENTATION Our method quantifies the cell abundances in proteins by jointly deconvoluting matched
bulk transcriptome and proteome, which can be used in downstream cspQTL mapping (Fig. 1). For each tissue sample with bulk transcriptome-proteome, we model the cellular compositions _Θ_(_i_)
in the _i_th modality as a product of tissue-specific cell counts fractions _P_ and molecule source-specific cell size factors _S_(_i_). A Joint Non-negative Matrix Factorization (JNMF)
framework was employed to link the modalities through shared cell counts _P_, allowing individualized multimodal reference panels. We employ a loss function that integrates the observed bulk
RNA and protein expressions to optimize the cell abundances in each molecular source, as described in Methods. The parameters in JNMF are initialized by an RNA signature matrix of similar
tissue type and the RNA proportions pre-estimated by CIBERSORT (CBS)11 with this signature matrix, the first of which can be obtained from scRNA-seq or sorted cell RNA-seq data accessible in
many public repositories. Hence, this joint deconvolution algorithm is semi reference-free without using a single cell or pure cell proteomics reference profile, implemented in the function
_deconv_. In proteomics deconvolution, researchers may not have a priori knowledge about the cell marker proteins to be used for certain cellular subpopulations, but the cell marker genes
in transcriptome have been broadly identified and curated in public databases12. Here, we use the AJIVE framework13 to construct a common space shared across two molecular sources: bulk RNA
expression of cell marker genes and the sample-matched whole proteome, which captures the between-sample heterogeneity caused by cellular abundance variation. Next, the observed whole
proteome is projected onto this shared space by employing the reduced-rank loadings from AJIVE, where protein dimensions and annotations are unchanged. The rank of loadings is determined by
an inherent algorithm. The potential cell marker proteins are selected based on the feature-wise Euclidean distance between the projected and observed proteomes. This cross-source feature
selection is similar to ReFACTor14, but ReFACTor is only applicable to single modality and sets the rank of loadings by the assumed number of cell types. Hence, we name our signature
selection procedure as ‘AJ-RF’ and use the selected proteins in joint deconvolution, implemented in the function _ajive_decomp_. VALIDATION USING MULTIMODAL EXPRESSION FROM CITE-SEQ We first
validate our algorithm by using the pseudo bulk multimodal expression profiles built from a public CITE-seq dataset with 161,764 human peripheral blood mononuclear cells (PBMCs). These
samples were collected from eight donors1 and processed by \(10\times {3}^{{\prime} }\) technology and Seurat v4. For each single cell, the expression of 228 surface proteins and more than
30,000 RNA genes were measured. Hence, we generated pseudo bulk expression data for a list of RNA cell marker genes and 228 surface proteins by aggregating feature-wise abundance or Unique
Molecular Identifier (UMI) counts across the cells per donor. The ground truth of cellular fractions can be achieved by the annotated cell labels, which are identical for RNA and protein in
CITE-seq. We selected four donors (samples P1 and P7 in Fig. 2a) with disparities in B cell, natural killer (NK) cell, dendritic cell (DC), and other T cell abundances to generate the pseudo
signature matrix. The cell count fractions quantified by our algorithm were similar to the true cell count fractions captured and annotated in CITE-seq (i.e., Pearson correlation (r) =
0.91, Lin’s concordance correlation coefficient (CCC) = 0.91), as demonstrated in Fig. 2b, and showing improvement over the CBS method (r = 0.88, CCC = 0.85). The individualized
cell-type-specific expression is exemplified for three cell types: B cell, CD4 T cell, and CD8 cell in Fig. 2c–h. Although the individualized cell-type-specific protein expression displayed
larger variance, it remained significantly correlated with the observed pure cell bulk expression (Fig. 2c–e). This discrepancy may arise from the distinct nature of cell surface protein
measurement through a binding strategy, as opposed to RNA UMI counts. The RNA expression levels resolved by our algorithm were well-aligned near the diagonal line with _r_ > 0.9 across
all three cellular populations (Fig. 2f–h). Further, we examined the impact of varying step sizes on the results obtained from the same input data and different rescaling approaches. The
results in Supplementary Note 1 Figs. S1–S5 suggest that despite of ambiguity in the optimal step size, moderate adjustments in step size and rescaling by log or MinMax do not lead to
substantial change in the deconvoluted cellular composition. Nevertheless, the above results demonstrate the validity and power of our algorithm without reliance on a single cell (type)
proteomics reference. ASSESSMENT USING SIMULATION DATA The above pseudo bulk data from CITE-seq only provides the ground truth for cell counts fractions instead of the cell proportions in
each molecule source. To mimic the possible differences in RNA vs. protein cellular compositions and compare the performance of distinct methods, we rigorously designed a simulation study to
generate synthetic bulk transcriptome-proteome with ground truth of modality-specific cellular compositions. The statistical models used in data generation are described in Supplementary
Note 2, in which the statistical parameters are extracted from a public scRNA-seq data of human brain15, a public single-cell-type mouse brain proteomics data16, and a bulk proteomics data
of human brain described in the next section. We designed two scenarios to allow relatively low (scenario A) and high (scenario B) correlations between the protein and RNA proportions, as
visualized in Fig. 3a, b. We compared our algorithm to three existing methods: the RNA proportions resolved by CBS (i.e., the initial cell counts fractions used in JNMF) as surrogate protein
proportions, the cellular fractions estimated by TCA with CBS RNA proportions as initial value and the bulk proteomes as target data, and a deep learning-based method scpDeconv17 using
single-cell reference data for training. To the best of our knowledge, the high-quality single cell proteomics data in human prefrontal cortex is not publicly available. Therefore, we chose
the scRNA-seq data used in the above data generation as a surrogate single cell reference for scpDeconv. The output from each method was compared to the ground truth protein proportion via
mean absolute error (MAE) in Fig. 3c–d, g–h and CCC in Fig. 3e–f, i–j. To deconvolve the synthetic bulk RNA-seq data with CBS, we constructed _n_ = 50 replicates of pseudo signature matrix
per scenario by introducing small vs. large random noises to the ground truth of subject-specific reference transcriptome. The RNA cell marker genes were chosen based on gene-wise
coefficients of variation from the true reference panel, while 700 out of 1000 proteins were selected with AJ-RF. For each pseudo signature matrix and the corresponding initial CBS estimate,
JNMF significantly improved the accuracy of cellular compositions in proteomes compared to CBS, as validated by a paired _t_ test (Fig. 3c–j). Our method also outperforms TCA and scpDeconv
across the scenarios and pseudo signature matrices, whereas a single cell (type) proteomics reference profile is lacking. Notably, the proteins selected by AJ-RF yielded similar
outperformance over the competing methods, implying the efficacy in capturing the latent cellular heterogeneity. We also benchmarked the robustness of our algorithm via initialization with
Non-Negative Least Squares (NNLS) RNA proportions and assessed the computation power of scpDeconv in bulk RNA deconvolution. Supplementary Fig. S6 presents statistically significant
improvement by JNMF in MAE and CCC, although NNLS estimate was less accurate compared to CBS initial and restricted the absolute performance of our method. The result of proteomics
deconvolution tool scpDeconv in an application to bulk RNA-seq data was even worse than NNLS because of the inherent differences between RNA and protein molecules and the disitnct profiling
technologies. Again, our algorithm demonstrated superior accuracy and robustness in predicting proteomics cell proportions compared to alternative approaches. APPLICATION TO BULK
TRANSCRIPTOME-PROTEOME IN BIPOLAR DISORDER, SCHIZOPHRENIA, AND HEALTHY CONTROLS We used tissue-matched transcriptome-proteome from postmortem human brain prefrontal cortex in a study of (_n_
= 25) bipolar disorder (BP) and (_n_ = 45) schizophrenia (SCZ) cases and (_n_ = 194) healthy controls to demonstrate the performance of MICSQTL. The details of human brain tissues, MS
proteomics and RNA-seq transcriptomics profiling are available in Supplementary Note 3. The signature genes used in competing methods were selected from previous findings16, while the
(initial) signature matrix was generated by scRNA-seq data of healthy human prefrontal cortex15. The protein proportions quantified by JNMF-AJ-RF algorithm (Fig. 4a) showed possible
disparities in astrocyte, microglia, and oligodendrocyte abundances between the (BP and/or SCZ) cases and controls whose brain tissues (death) were under 70 years of age. This result aligns
with the expected cell abundance in human prefrontal cortex15,18 and is similar to the previous findings about glia cell abundance in SCZ19,20. But there are confounding factors not
regressed out in this dataset such as sub-cohorts and aging. The validity of cellular compositions in BP and SCZ compared to controls requires more experiments or an external cohort with
adequate sample size of single cell proteomes or transcriptomes. On the other hand, the initial CBS RNA deconvolution failed to recover the expected mean abundances in inhibitory neuron and
microglia (Fig. 4b). We also applied CBS to the bulk proteomes with a proteomics signature matrix of four cell types profiled from major regions of mouse brain16, which neither decomposed
the neuron cells into excitatory and inhibitory subpopulations nor recovered the expected cell abundances (Fig. 4c). A downstream cspQTL screening was performed by cell-type-specific
differential analysis, implemented in the function _csQTL_. The input data include bulk proteomes, whole genome sequencing genetic variants (SNP), and protein proportions in Fig. 4a and
Supplementary Data 4. We selected several mutation markers reported in previous literature to illustrate the csQTL function in MICSQTL. For each protein of interest, we select the nearby
SNPs within a genomic distance of 1 million bases (Mb). Figure 4d, e shows the SNPs with false discovery rate (FDR) adjusted _p_ value < 0.2 per cell type for the proteins encoded by
genes associated with neurodevelopmental or neuropsychiatric illness: _MAST4_ and _ADCYAP1R1_. These genes or the related gene family were reported as mutations associated with increased
risks for Alzheimer’s disease21, mega-corpus-callosum syndrome22, and post-traumatic stress disorder23. The adjusted p-values were listed in Supplementary Data 4, while cspQTL results for
additional genes were shown in Figure S7. Note that the validity of cspQTL result depends on the study populations for the bulk proteomes and the accuracy of proteomics deconvolution, which
should be investigated through a rigorous single cell study measuring the proteins of interest with adequate sample size. Last but not least, our tool MICSQTL outputs the multivariate Common
Normalized Scores (CNS) from AJIVE that represents the sample-specific variation shared across transcriptome and proteome, which may uncover the heterogeneity across disease phenotypes and
the hidden drivers. The multi-omics human brain tissue samples were jointly visualized in Fig. 4f by using the concatenated CNS, which outperforms the single modality visualization (Fig. 4g,
h). APPLICATION TO MULTI-OMICS DATA IN BREAST CANCER To demonstrate our method in different tissue types, we utilized the patient-matched bulk RNA-seq and MS proteomics data from _n_ = 122
breast cancer (BC) tumor tissues24 along with the single cell multi-omics data in an external BC cohort25. We first applied CBS to the bulk RNA-seq data24 and a signature matrix built from
the external scRNA-seq profiles25, which served as the initial values in JNMF algorithm. The aforementioned deep learning deconvolution tool scpDeconv was applied to the bulk MS proteomics
data, using the single cell CyTOF proteomics profiles from external BC tissues25 to train the autoencoder. The input single cell and bulk data were transformed by log scale for CBS,
JNMF-AJ-RF and z-score for scpDeconv. The deconvolution result by each method was benchmarked by the annotated cell fractions in the scRNA-seq data of external BC tissues25 (Fig. 5a).
Obviously, the tumor microenvironment deconvolved by CBS with bulk transcriptomes (Fig. 5b) was substantially improved by applying JNMF-AJ-RF to the bulk multi-omics profiles (Fig. 5c and
Supplementary Data 5). The proteomics tumor microenvironment composition resolved by a scpDeconv submodule that only uses the (seven) proteins detected in both single cell CyTOF panel and
bulk MS proteomes was the best estimate (Fig. 5d). However, this outperformance of scpDeconv may rely on the accurate measurement or low dimension of marker proteins in the CyTOF reference
panel and not necessarily hold for the untargeted single cell reference proteomics data (e.g., scMS). To illustrate this caveat, we ran another submodule of scpDeconv that imputed the single
cell reference for additional (20 or 50) highly variable proteins (HVP) only detected in the bulk proteome. For either set of HVP, this module was rerun ten times to assess the variation of
performance. The replicates of deconvoluted proportions in Supplementary Note 1 Figs. S8–S9 demonstrate that predicting the abundance of proteins not available in CyTOF reference panel may
randomly reduce the accuracy of deconvolution. In other words, the performance of scpDeconv depends on the availability and quality of protein markers measured at single cell level.
DISCUSSION Our pipeline offers three primary functions to perform multi-omics cell abundance quantification with or without marker protein selection, integrative visualization, and cspQTL
mapping. The semi reference-free joint quantification of cellular compositions in RNA and proteins were benchmarked by multiple datasets. That is pseudo bulk RNA and protein expression
constructed from CITE-seq single cell data, synthetic bulk multimodal expression generated by statistical models, and real human brain bulk transcriptome-proteome with external snRNA-seq
data. Overall, JNMF coupled with cross-modality signature protein selection significantly improves the cell abundance quantification of MS proteomes compared to CBS, TCA, and scpDeconv with
imputation. Our algorithm also identifies the proteins contributing to the latent cellular heterogeneity in bulk multi-omics profiles, which may elucidate the cell marker proteins for
certain tissue types. This multimodal deconvolution framework is more favorable in population-scale studies compared to the single modal deconvolution or single cell profiling, since it
neither relies on a single cell proteomics reference profile nor requires cell clustering and labeling. The hyperparameter (PGD step size) in our algorithm has marginal impact on the
estimated proportions and can be adjusted according to the scale of input data. In the current version of MICSQTL, we do not suggest the optimal scaling approach since the accuracy of
purified data varies and may affect the downstream analysis such as integrating purified multi-omics profiles. According to the extensive experiments in CITE-seq pseudo bulk data, the impact
of step size on cellular fractions was reduced by the MinMax rescaling across all the features, but the purified RNA expression was distorted. Another potential utility of our tool is the
high-resolution purification of individualized pure cell expression, which is an essential component in the output of our algorithm and paves the way for deep proteome profiling at single
cell type resolution. However, the current implementation of joint deconvolution algorithm emphasizes the accuracy of cell abundance quantification, which may sacrifice the power of
high-resolution purification. A possible solution to improve individualized multimodal pure cell expression is to train a deep learning model (such as autoencoder26) on the observed bulk
proteomes and the reference panels pre-estimated by JNMF algorithm. Meanwhile, it’s worth employing Stochastic Gradient Descent in the future to reduce the errors in pre-estimated reference
panels. Altogether, the JNMF deconvolution algorithm substantially improves the cell abundance estimation in bulk proteome by integrating modalities and using PGD for high-dimensional
parameter optimization. The impact of PGD optimization on the quantified cell abundances and individualized purification will be assessed extensively in future. Our tool MICSQTL not only
fills the methodological gap in bulk proteomics deconvolution without using single cell proteomics data, but also sheds light on the design of a comprehensive experiment that profiles single
cell MS proteomes matched to bulk samples to benchmark the performance of different deconvolution tools. METHODS MULTIMODAL JOINT DECONVOLUTION For the tissue biospecimen of individual _i_,
we measure the expressions of protein _g_ (_g_ = 1, …, _G_) and mRNA transcript (or gene) _m_ (_m_ = 1, …, _M_), respectively, denoted by \({y}_{gi}^{(1)}\), \({y}_{mi}^{(2)}\). The
unobserved and individualized pure cell expressions are denoted by \({x}_{gik}^{(1)},{x}_{mik}^{(2)}\). The molecular source-specific cellular fractions for cell type _k_ are \({\theta
}_{ik}^{(1)}\), \({\theta }_{ik}^{(2)}\) (_k_ = 1, …, _K_), determined by the common tissue-specific cell counts (fractions) _p__i__k_ and source-specific cell size factors
\({s}_{ik}^{(1)}\), \({s}_{ik}^{(2)}\). That is \({\theta }_{ik}^{(1)}={p}_{ik}{s}_{ik}^{(1)}\), \({\theta }_{ik}^{(2)}={p}_{ik}{s}_{ik}^{(2)}\). Thus, the bulk multi-modal expression data
are modeled as $$E({y}_{gi}^{(1)})=\mathop{\sum }\limits_{k=1}^{K}{x}_{gik}^{(1)}{\theta }_{ik}^{(1)}=\mathop{\sum }\limits_{k=1}^{K}{x}_{gik}^{(1)}{p}_{ik}{s}_{ik}^{(1)},$$
$$E({y}_{mi}^{(2)})=\mathop{\sum }\limits_{k=1}^{K}{x}_{mik}^{(2)}{\theta }_{ik}^{(2)}=\mathop{\sum }\limits_{k=1}^{K}{x}_{mik}^{(2)}{p}_{ik}{s}_{ik}^{(2)}.$$ We denote
\({{{{{{{{\boldsymbol{y}}}}}}}}}_{i}^{(1)}={\left[{y}_{gi}^{(1)}\right]}_{G\times 1}\), \({{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(1)}={\left[{x}_{gik}^{(1)}\right]}_{G\times K}\),
\({{{{{{{{\boldsymbol{y}}}}}}}}}_{i}^{(2)}={\left[{y}_{mi}^{(2)}\right]}_{M\times 1}\), \({{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(2)}={\left[{x}_{mik}^{(2)}\right]}_{M\times K}\), _P__i_ =
diag(_p__i_1, …, _p__i__K_), \({{{{{{{{\boldsymbol{s}}}}}}}}}_{i}^{(1)}={\left[{s}_{ik}^{(1)}\right]}_{K\times 1}\) and
\({{{{{{{{\boldsymbol{s}}}}}}}}}_{i}^{(2)}={\left[{s}_{ik}^{(2)}\right]}_{K\times 1}\). The above source-specific models are linked by the common tissue-specific cell counts fractions
_P__i_. Hence, we propose to jointly estimate the high-dimensional non-negative parameters \({{{{{{{{\boldsymbol{\eta
}}}}}}}}}_{i}=\{{{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(1)},{{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(2)},{{{{{{{{\boldsymbol{p}}}}}}}}}_{i},{{{{{{{{\boldsymbol{s}}}}}}}}}_{i}^{(1)},{{{{{{{{\boldsymbol{s}}}}}}}}}_{i}^{(2)}\}\)
by minimizing a loss function that integrates \({{{{{{{{\boldsymbol{y}}}}}}}}}_{i}^{(1)}\), \({{{{{{{{\boldsymbol{y}}}}}}}}}_{i}^{(2)}\). This is achieved by solving
$${{{{\hat{{{{{\boldsymbol{\eta }}}}}}}}}}_{i}=\arg \mathop{\min }\limits_{{{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}}\left\{{\left\Vert
{{{{{{{{\boldsymbol{y}}}}}}}}}_{i}^{(1)}-{{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(1)}{{{{{{{{\boldsymbol{p}}}}}}}}}_{i}{{{{{{{{\boldsymbol{s}}}}}}}}}_{i}^{(1)}\right\Vert }^{2}+{\left\Vert
{{{{{{{{\boldsymbol{y}}}}}}}}}_{i}^{(2)}-{{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(2)}{{{{{{{{\boldsymbol{p}}}}}}}}}_{i}{{{{{{{{\boldsymbol{s}}}}}}}}}_{i}^{(2)}\right\Vert }^{2}\right\}$$ subject
to \(\min \{{{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}\}\ge 0\). This loss function integrates the observed multimodal bulk data by the shared cell counts _P__i_ as an extension to the
Non-negative Matrix Factorization. This algorithm initializes the multi-omics reference panels \({{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(1)},{{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(2)}\) and the
cell counts fractions _P__i_ with an external RNA-seq signature matrix and the RNA proportions pre-estimated by CBS. These sample-wise parameters of multimodal reference panels and cellular
compositions are then optimized by the Projected Gradient Descent algorithm27, being simultaneously updated and adapted to tissue-specific bulk multi-omics profiles. ALGORITHM * 1. The
initial values of \({{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(1)},{{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(2)}\) are the cell-type-specific expression from an external RNA signature matrix, while the
initial values of \({s}_{ik}^{(1)}\), \({s}_{ik}^{(2)}\) are ones. The genes in initial \({{{{{{{{\boldsymbol{X}}}}}}}}}_{i}^{(2)}\) should be mapped to the features in bulk proteome
\({{{{{{{{\boldsymbol{y}}}}}}}}}_{i}^{(2)}\). Initialize _P__i_ by applying CIBERSORT to the target bulk RNA-seq data with the above (RNA) signature matrix. * 2. Evaluate the loss function’s
first gradient at current estimates: \({{{{{{{\boldsymbol{l}}}}}}}}({{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}^{(s)})\). * 3. Update parameters with non-negative bounds:
\({{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}^{(s+1)}={[{{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}^{(s)}-\Delta {{{{{{{\boldsymbol{l}}}}}}}}({{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}^{(s)})]}_{+}\),
where Δ is step size. * 4. Repeat steps 2-3 until convergence: \(\max \left\{| {{{{{{{{\boldsymbol{\eta }}}}}}}}}^{(s+1)}-{{{{{{{{\boldsymbol{\eta }}}}}}}}}^{(s)}| \right\}\, < \,
\epsilon\) or a limit of iterations (e.g., 1000). * 5. Normalize the cellular fractions as \({\theta }_{ik}^{(1)}=\frac{{p}_{ik}{s}_{ik}^{(1)}}{\mathop{\sum }\nolimits_{k =
1}^{K}{p}_{ik}{s}_{ik}^{(1)}}\) and \({\theta }_{ik}^{(2)}=\frac{{p}_{ik}{s}_{ik}^{(2)}}{\mathop{\sum }\nolimits_{k = 1}^{K}{p}_{ik}{s}_{ik}^{(2)}}\). INTEGRATIVE SIGNATURE SELECTION The
decomposition provided by AJIVE enables the identification of underlying biological patterns that are common to all molecular modalities. Suppose there are _J_ molecule sources for the same
sets of _N_ samples but different features from each source. For the bulk expression matrices _Y_(_j_) (_j_ = 1, …, _J_) in distinct modalities, AJIVE integrates _Y_(_j_)’s to reconstruct
each by three components: _Y_(_j_) = _C_(_j_) + _I_(_j_) + _E_(_j_), where _C_(_j_) represents the common variation originating from the _j_th modality, _I_(_j_) and _E_(_j_) are the
source-specific structured variation and the residual noise, respectively. The top proteins contributing to the common variation shared between proteome (_Y_(1)) and signature genes (_Y_(2))
are selected by using the loadings _V_(1) obtained from the singular value decomposition (SVD) of matrix _C_(1) = _V_(1)_D_(1)_U_(1), where _V_(1) is _G_ × _r_ with _r_ as reduced rank
estimated by the Wedin bound procedure13. Next, we compute
\({{{{\tilde{{{{{\boldsymbol{Y}}}}}}}}}}^{(1)}={{{{{{{{\boldsymbol{V}}}}}}}}}^{(1)}{{{{{{{{\boldsymbol{V}}}}}}}}}^{{(1)}^{T}}{{{{{{{{\boldsymbol{Y}}}}}}}}}^{(1)}\) as a projected
approximation to the observed bulk proteome with rank _r_. For each protein _g_, we calculate the distance between \({{{{{{{{\boldsymbol{y}}}}}}}}}_{g.}^{(1)}\) and
\({{{{\tilde{{{{{\boldsymbol{y}}}}}}}}}}_{g.}^{(1)}\) by the Euclidean distance \({d}_{g}=\Vert
{{{{{{{{\boldsymbol{y}}}}}}}}}_{g.}^{(1)}-{{{{\tilde{{{{{\boldsymbol{y}}}}}}}}}}_{g.}^{(1)}\Vert\). To select the proteins that contribute most significantly to the shared variation, we
choose the proteins with the smallest distances _d__g_. CELL-TYPE-SPECIFIC PROTEIN QTL MAPPING The potential errors produced in sample-wise deconvoluted proteomes may lead to bias in
cell-type-specific QTL mapping. Hence, we apply a published method implemented in TOAST28 to perform cell-type-specific protein differential analysis for genotypes based on the bulk
proteomes, estimated protein cell proportions, and whole genome sequencing variants. This method uses pre-estimated cell proportions and a linear model to describe the cell-type-specific
differential expression pattern in bulk data, and then performs F-test on the hypothesized cell-type-specific changes across three genotype groups29, with false discovery rate being well
controlled according to the previous simulation studies30,31. REPORTING SUMMARY Further information on research design is available in the Nature Portfolio Reporting Summary linked to this
article. DATA AVAILABILITY The processed signature matrices derived from both real datasets are available for access on the GitHub repository:
https://github.com/YuePan027/MICSQTL/tree/main/processed_signature. The CITE-seq data is available on GEO under accession number GSE164378. The human brain prefrontal cortex mass
spectrometry proteomics data and RNA-seq data are available in the Synapse database https://www.synapse.org under accession code syn32136022. The breast cancer bulk multi-omics data are
available as supplementary files in25, while the scRNA-seq and single cell CyTOF data of breast cancer are available at GEO: GSE180878 and https://data.mendeley.com/datasets/vs8m5gkyfn/1.
The source data necessary for creating the main figures can be found in Supplementary Data 1-5. CODE AVAILABILITY MICSQTL is a Bioconductor package with license GPL-3, available at
https://bioconductor.org/packages/MICSQTL. CIBERSORT (https://cibersortx.stanford.edu/). TOAST (https://bioconductor.org/packages/TOAST). TCA (https://cran.r-project.org/web/packages/TCA).
REFERENCES * Hao, Y. et al. Integrated analysis of multimodal single-cell data. _Cell_ 184, 3573–3587 (2021). Article CAS PubMed PubMed Central Google Scholar * Kumar, P. et al.
Single-cell transcriptomics and surface epitope detection in human brain epileptic lesions identifies pro-inflammatory signaling. _Nat. Neurosci._ 25, 956–966 (2022). Article CAS PubMed
PubMed Central Google Scholar * Sun, X. et al. Deep single-cell-type proteome profiling of mouse brain by nonsurgical aav-mediated proximity labeling. _Anal. Chem._ 94, 5325–5334 (2022).
Article CAS PubMed PubMed Central Google Scholar * Perkel, J. M. Single-cell proteomics takes centre stage. _Nature_ 597, 580–582 (2021). Article CAS PubMed Google Scholar * Cheung,
R. K. & Utz, P. J. Cytof—the next generation of cell detection. _Nat. Rev. Rheumatol._ 7, 502–503 (2011). Article PubMed PubMed Central Google Scholar * Frangieh, C. J. et al.
Multimodal pooled perturb-cite-seq screens in patient models define mechanisms of cancer immune evasion. _Nat. Genet._ 53, 332–341 (2021). Article CAS PubMed PubMed Central Google
Scholar * Zhu, Y. et al. Nanodroplet processing platform for deep and quantitative proteome profiling of 10–100 mammalian cells. _Nat. Commun._ 9, 882 (2018). Article PubMed PubMed
Central Google Scholar * Petelski, A. A. et al. Multiplexed single-cell proteomics using scope2. _Nat. Protocols_ 16, 5398–5425 (2021). Article CAS PubMed Google Scholar * Bennett, H.
M., Stephenson, W., Rose, C. M. & Darmanis, S. Single-cell proteomics enabled by next-generation sequencing or mass spectrometry. _Nat. Methods_ 20, 363–374 (2023). Article CAS PubMed
Google Scholar * Ben-David, E. et al. Whole-organism eqtl mapping at cellular resolution with single-cell sequencing. _Elife_ 10, e65857 (2021). Article CAS PubMed PubMed Central
Google Scholar * Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. _Nat. Methods_ 12, 453–457 (2015). Article CAS PubMed PubMed Central Google
Scholar * Hu, C. et al. Cellmarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scrna-seq data. _Nucleic Acids Res._ 51, D870–D876 (2023).
Article CAS PubMed Google Scholar * Feng, Q., Jiang, M., Hannig, J. & Marron, J. Angle-based joint and individual variation explained. _J. Multivariate Anal._ 166, 241–265 (2018).
Article Google Scholar * Rahmani, E. et al. Sparse pca corrects for cell type heterogeneity in epigenome-wide association studies. _Nat. Methods_ 13, 443–445 (2016). Article CAS PubMed
PubMed Central Google Scholar * Trobisch, T. et al. Cross-regional homeostatic and reactive glial signatures in multiple sclerosis. _Acta Neuropathol._ 144, 987–1003 (2022). Article CAS
PubMed PubMed Central Google Scholar * Sharma, K. et al. Cell type–and brain region–resolved mouse brain proteome. _Nat. Neurosci._ 18, 1819–1831 (2015). Article CAS PubMed PubMed
Central Google Scholar * Wang, F. et al. Deep domain adversarial neural network for the deconvolution of cell type mixtures in tissue proteome profiling. _Nat. Mach. Intell._ 5, 1236–1249
(2023). Article Google Scholar * Ruzicka, W. B. et al. Single-cell multi-cohort dissection of the schizophrenia transcriptome. _medRxiv_ 2022–08 https://doi.org/10.1101/2022.08.31.22279406
(2022). * Notaras, M. et al. Schizophrenia is defined by cell-specific neuropathology and multiple neurodevelopmental mechanisms in patient-derived cerebral organoids. _Mol. Psychiatry_ 27,
1416–1434 (2022). Article CAS PubMed Google Scholar * Puvogel, S. et al. Single-nucleus rna sequencing of midbrain blood-brain barrier cells in schizophrenia reveals subtle
transcriptional changes with overall preservation of cellular proportions and phenotypes. _Mol. Psychiatry_ 27, 4731–4740 (2022). Article CAS PubMed PubMed Central Google Scholar *
Hibar, D. P. et al. Novel genetic loci associated with hippocampal volume. _Nat. Commun._ 8, 13624 (2017). Article CAS PubMed PubMed Central Google Scholar * Tripathy, R. et al.
Mutations in mast1 cause mega-corpus-callosum syndrome with cerebellar hypoplasia and cortical malformations. _Neuron_ 100, 1354–1368 (2018). Article CAS PubMed PubMed Central Google
Scholar * Ressler, K. J. et al. Post-traumatic stress disorder is associated with pacap and the pac1 receptor. _Nature_ 470, 492–497 (2011). Article CAS PubMed PubMed Central Google
Scholar * Krug, K. et al. Proteogenomic landscape of breast cancer tumorigenesis and targeted therapy. _Cell_ 183, 1436–1456 (2020). Article CAS PubMed PubMed Central Google Scholar *
Gray, G. K. et al. A human breast atlas integrating single-cell proteomics and transcriptomics. _Dev. Cell_ 57, 1400–1420 (2022). Article CAS PubMed PubMed Central Google Scholar *
Chen, Y. et al. Deep autoencoder for interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis. _Nat. Commun._ 13, 6735 (2022). Article CAS PubMed PubMed Central
Google Scholar * Polyak, R. A. Projected gradient method for non-negative least square. _Contemp Math_ 636, 167–179 (2015). Article Google Scholar * Li, Z. & Wu, H. Toast: improving
reference-free cell composition estimation by cross-cell type differential analysis. _Genome Biol._ 20, 1–17 (2019). Article Google Scholar * Li, Z., Wu, Z., Jin, P. & Wu, H.
Dissecting differential signals in high-throughput data from complex tissues. _Bioinformatics_ 35, 3898–3905 (2019). Article CAS PubMed PubMed Central Google Scholar * Meng, G., Tang,
W., Huang, E., Li, Z. & Feng, H. A comprehensive assessment of cell type-specific differential expression methods in bulk data. _Briefings Bioinf._ 24, bbac516 (2023). Article Google
Scholar * Feng, H. et al. Islet: individual-specific reference panel recovery improves cell-type-specific inference. _Genome Biol._ 24, 174 (2023). Article CAS PubMed PubMed Central
Google Scholar Download references ACKNOWLEDGEMENTS This work was partially supported by Cancer Center Support Grant P30CA21765 (Y.P., Q.L.), the American Lebanese Syrian Associated
Charities (Y.P., X.W., J.S., J.P., Q.L.), and NIH R01MH110920 grant (C.L.). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Biostatistics, St. Jude Children’s Research Hospital,
Memphis, TN, 38105, USA Yue Pan, Jiao Sun & Qian Li * Center for Proteomics and Metabolomics, St. Jude Children’s Research Hospital, Memphis, TN, 38105, USA Xusheng Wang * Department of
Genetics, Genomics & Informatics, University of Tennessee Health Science Center, Memphis, TN, 38105, USA Xusheng Wang * Department of Psychiatry, SUNY Upstate Medical University,
Syracuse, NY, 13210, USA Chunyu Liu * Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN, 38105, USA Junmin Peng * Department of Developmental Neurobiology,
St. Jude Children’s Research Hospital, Memphis, TN, 38105, USA Junmin Peng Authors * Yue Pan View author publications You can also search for this author inPubMed Google Scholar * Xusheng
Wang View author publications You can also search for this author inPubMed Google Scholar * Jiao Sun View author publications You can also search for this author inPubMed Google Scholar *
Chunyu Liu View author publications You can also search for this author inPubMed Google Scholar * Junmin Peng View author publications You can also search for this author inPubMed Google
Scholar * Qian Li View author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS Y.P., X.W., J.P., and Q.L. conceive this study. Y.P. develops the
algorithms and R package as maintainer, performs simulation and real data analyses, and visualizes analysis results. Q.L. proposes the algorithm and designs R package, simulation study, and
real data analysis. J.S. runs experiments on scpDeconv in simulation study and breast cancer data, and visualizes the results. Q.L. and Y.P. write the manuscript. X.W., C.L., and J.P.
generate and share the real data, contribute to methodology discussion, and interpret real data analysis results. All authors reviewed the manuscript. CORRESPONDING AUTHOR Correspondence to
Qian Li. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. PEER REVIEW PEER REVIEW INFORMATION Communications Biology thanks Oliver Crook and the other,
anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Yuedong Yang and Luke R. Grinham. A peer review file is available. ADDITIONAL
INFORMATION PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION PEER REVIEW FILE
SUPPLEMENTARY FILE DESCRIPTION OF ADDITIONAL SUPPLEMENTARY FILES SUPPLEMENTARY DATA 1 SUPPLEMENTARY DATA 2 SUPPLEMENTARY DATA 3 SUPPLEMENTARY DATA 4 SUPPLEMENTARY DATA 5 REPORTING SUMMARY
RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and
reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS
ARTICLE Pan, Y., Wang, X., Sun, J. _et al._ Multimodal joint deconvolution and integrative signature selection in proteomics. _Commun Biol_ 7, 493 (2024).
https://doi.org/10.1038/s42003-024-06155-z Download citation * Received: 13 October 2023 * Accepted: 08 April 2024 * Published: 24 April 2024 * DOI:
https://doi.org/10.1038/s42003-024-06155-z SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not
currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative