Daily estimates of clinical severity of symptoms in bipolar disorder from smartphone-based self-assessments

Play all audios:

ABSTRACT Currently, the golden standard for assessing the severity of depressive and manic symptoms in patients with bipolar disorder (BD) is clinical evaluations using validated rating

scales such as the Hamilton Depression Rating Scale 17-items (HDRS) and the Young Mania Rating Scale (YMRS). Frequent automatic estimation of symptom severity could potentially help support

monitoring of illness activity and allow for early treatment intervention between outpatient visits. The present study aimed (1) to assess the feasibility of producing daily estimates of

clinical rating scores based on smartphone-based self-assessments of symptoms collected from a group of patients with BD; (2) to demonstrate how these estimates can be utilized to compute

individual daily risk of relapse scores. Based on a total of 280 clinical ratings collected from 84 patients with BD along with daily smartphone-based self-assessments, we applied a

hierarchical Bayesian modelling approach capable of providing individual estimates while learning characteristics of the patient population. The proposed method was compared to common

baseline methods. The model concerning depression severity achieved a mean predicted _R_2 of 0.57 (SD = 0.10) and RMSE of 3.85 (SD = 0.47) on the HDRS, while the model concerning mania

severity achieved a mean predicted _R_2 of 0.16 (SD = 0.25) and RMSE of 3.68 (SD = 0.54) on the YMRS. In both cases, smartphone-based self-reported mood was the most important predictor

variable. The present study shows that daily smartphone-based self-assessments can be utilized to automatically estimate clinical ratings of severity of depression and mania in patients with

BD and assist in identifying individuals with high risk of relapse. SIMILAR CONTENT BEING VIEWED BY OTHERS A SMARTPHONE- AND WEARABLE-BASED BIOMARKER FOR THE ESTIMATION OF UNIPOLAR

DEPRESSION SEVERITY Article Open access 01 November 2023 CLASSIFYING AND CLUSTERING MOOD DISORDER PATIENTS USING SMARTPHONE DATA FROM A FEASIBILITY STUDY Article Open access 21 December 2023

SMARTPHONE ACCELEROMETER DATA AS A PROXY FOR CLINICAL DATA IN MODELING OF BIPOLAR DISORDER SYMPTOM TRAJECTORY Article Open access 14 December 2022 INTRODUCTION Bipolar disorder (BD) is a

common and complex illness with an estimated prevalence of 1–2% and is regarded as one of the most important causes of disability worldwide1,2. BD is characterized by recurrent episodes of

depression, (hypo)mania and mixed episodes intervened by periods of euthymia3 and with a high degree of comorbidity and functional impairment4. BD is associated with an elevated risk of

mortality due to suicide and medical comorbidities such as cardiovascular disease and diabetes5,6,7, and among people with BD, life expectancy is decreased 8–12 years8,9. In clinical

practice, there are major challenges in diagnosing and treating BD10. Patients with BD are often misdiagnosed, and the correct diagnosis can be delayed for several years after illness

onset11,12,13. Currently, due to the lack of objective tests, the diagnostic process and the clinical assessment of the severity of depressive and manic symptoms relies on subjective

information, clinical evaluation and rating scales14. Periodic clinical evaluations using clinical rating scales such as the Hamilton Depression Rating Scale (HDRS)15 and the Young Mania

Rating Scale (YMRS)16 are currently used as the golden standard for assessing the severity of depressive and manic symptoms in patients with BD. Each rating scale consists of a series of

items reflecting various symptoms of depression and mania, and these items are finally added up to produce a total score summarizing the current severity of depressive (HDRS) or manic (YMRS)

state of the patient. However, the use of clinical rating scales involves a risk of potential patient recall bias, other recall distortions, decreased illness insight (mainly during

affective episodes) and individual clinician observer bias17,18,19,20,21. In addition, the clinical evaluations are time consuming and require a specialist who is trained and experienced in

using the rating scales to produce consistent, valid and reliable results. As part of treatment, patients may be asked to perform daily self-assessments to track changes in symptoms between

clinical evaluations. Modern smartphones provide a unique platform for fine-grained real-time symptom monitoring and management, and a convenient means of self-assessment that have

traditionally been carried out on paper22,23,24. A smartphone-based monitoring system enables users to ubiquitously record and review their own data, receive reminders, and even share data

with carers and clinicians. From the perspective of health care providers, it offers efficient, online monitoring of a group of patients and enables intervention in case any deterioration is

observed. Electronic self-monitoring has the additional benefit of making data available for immediate and automatic analysis that can help support monitoring and treatment tasks between

outpatient visits. Correlations between smartphone-based self-reported mood scores and clinical ratings of depressive and manic symptoms measured using the HDRS and the YMRS in patients with

BD have already been demonstrated by previous work25,26,27, but to our knowledge this is the first study to predict scores of clinical ratings directly from combinations of smartphone-based

self-assessed data in patients with BD. In related work, detection of daily self-reported mood from smartphone sensor and usage data is well studied23,28,29,30, but remains a difficult

problem due to noisy data. In ref. 31, Grünerbl et al. classified affective states and state changes derived from clinical ratings and phone interviews of patients with BD from a combination

of smartphone sensor modalities and argued that detecting deviations from the euthymic state is more important than the recognition of a particular affective state in practical

applications. Several studies in the field of affective computing have highlighted the need for personalized models to account for individual differences in order to achieve good predictive

performance29,30,32,33. However, a separate analysis is not feasible until sufficient data about each individual is available. Hierarchical Bayesian modelling is a well-suited approach for

providing individual models while borrowing statistical power from the population, which is especially useful when the individual datasets are too small to be analysed separately34. The main

of objective of this study was to examine the feasibility of producing daily estimates of clinical ratings of depression and mania based on smartphone self-assessments of symptoms collected

from a group of patients with BD, who were followed as part of a randomized controlled trial (RCT)35. Additionally, we aimed to demonstrate how uncertainty in the estimated quantities could

be used to compute individual, daily risk of relapse, useful for identifying high-risk individuals who need urgent assistance. Our assumption was that daily, automatic estimates of clinical

ratings augmented with individual relapse risk scores are more interpretable and actionable results than observing the smartphone-based self-assessments directly and can be a valuable tool

in continuous monitoring of illness activity and treatment of patients with BD. MATERIALS AND METHODS PATIENTS AND STUDY DESIGN Data analysed in this study was collected between September

2014 and January 2018 during the MONARCA II RCT, investigating the effect of smartphone-based monitoring in patients with BD35. All patients with a diagnosis of BD who had previously been

treated at the Copenhagen Clinic for Affective Disorder, Denmark, in the period from 2004 to January 2016 and who at the time of recruitment were being treated at community psychiatric

centres, private psychiatrists and general practitioners were invited to participate in the trial. The clinic is a specialized outpatient clinic with a catchment area consisting of the

Capital Region in Denmark corresponding to 1.4 million people. Patients with a newly diagnosis of BD or with treatment-resistant BD were referred to the clinic. The staff consists of

specialists in psychiatry, psychologists, nurses, and a social worker, all with specific experience and knowledge regarding BD. Treatment at the clinic comprises a two-year program including

combined evidence-based psychopharmacological treatment and supporting therapy, including group psychoeducation36. Patients were included in the study for a nine-month follow-up period if

they had a BD diagnosis according to ICD-10 using the Schedules for Clinical Assessments in Neuropsychiatry (SCAN)37 and previously were treated at the Copenhagen Clinic for Affective

Disorder. Patients with schizophrenia, schizotypal or delusional disorders, previous use of the MONARCA system, pregnancy and lack of Danish language skills were excluded. Patients with

other comorbid psychiatric disorders and substance use were eligible for the trial. As part of the MONARCA II trial, patients were randomized to either using a smartphone-based monitoring

system (the Monsenso system) for daily self-monitoring (the intervention group) or to treatment as usual (the control group). Patients from the intervention group who successfully provided

smartphone-based self-monitoring data were included in the analyses in the present study. DATA DESCRIPTION CLINICAL ASSESSMENTS The dataset consists of 280 clinical ratings collected from 84

patients with BD. Each clinical rating includes ratings for severity of depression and mania using the HDRS15 and the YMRS16, respectively. Each participant was evaluated by a clinician up

to 5 times during the study period (at baseline, after 4 weeks, 3 months, 6 months and 9 months). All clinical assessments were conducted by a researcher (MFJ), who was blinded to all

smartphone-based data. Thus, data on the severity of depressive and manic symptoms were collected rater-blinded. On both rating scales, the first item indicates mood and low severity ratings

indicate low levels of either depressive or manic symptoms while high severity ratings indicate severe symptoms. A score of 13 or more on either rating scale was classified as a depressive

or manic episode, respectively, while a high score on both scales at the same time constituted a mixed episode. The cut-off on the HDRS and the YMRS of 13, in contrast to a lower cut-off,

was chosen á priori to increase the validity of a current affective depressive or manic/mixed state (the more severe, the higher the validity). A euthymic state was defined as HDRS and YMRS

less than 13 thereby also including affective states with partial remission. Clinical ratings with the HDRS and the YMRS were considered to be valid on the day of the assessment as well as

the 3 previous days, thus each rating is attributed a total of 4 days in the present dataset. SMARTPHONE-BASED SELF-ASSESSMENTS In addition to periodic clinical ratings, patients were

instructed to carry out daily self-assessments via a smartphone application (the Monsenso system) configured for the present study. The smartphone application was developed using an

iterative, user-centred design process involving patients, IT researchers, clinicians and clinical researchers, and the items chosen for the self-assessments were designed to capture

clinically important symptoms of bipolar disorder23. The self-assessment included the following items: activity level (scored from −3 to +3); alcohol consumption (number of units from 0 to

10+); anxiety level (scored from 0 to 2); irritability level (scored from 0 to 2); cognitive problems (scored from 0 to 2); medicine adherence (not taken/taken/taken with changes); mixed

mood (yes/no); mood (scored from −3 to +3 including −0.5 and +0.5); sleep duration (in hours); and stress level (scored from 0 to 2). The activity, medicine, mood and sleep items were

mandatory items, which the patients evaluated daily. Additionally, the smartphone application enabled users to configure reminders and users were allowed to provide self-assessments

retrospectively for up to 2 days in case they forgot the daily entry. The entered self-assessed data collected over time was visually presented to the users on their smartphone. STATISTICAL

ANALYSIS DATA PREPROCESSING Three smartphone-based self-assessment variables, _mood_, _sleep_ and _medicine_, required preprocessing prior to analysis. We split the mood variable into a

negative and positive component, _mood negative_ and _mood positive_, allowing for non-linear relationships with the clinical ratings as we expected negative mood to be associated mainly

with severity of depression (reflected by scores on the HDRS) and positive mood to be associated mainly with severity of mania (reflected by scores on the YMRS). Additionally, we expected

the relationship between sleep duration and symptom severity to be non-linear as increased or decreased sleep duration can both represent signs of deterioration during depression and mania.

To encode this, we subtracted the individual-level mean of the sleep duration variable and split the result into positive and negative components, _sleep negative_ and _sleep positive_. When

testing the out-of-sample predictive performance of statistical models, the individual mean sleep duration was computed on the training set and applied to generate features in the training

set and test set. The medicine adherence variable was categorical by design with categories: _medicine not taken_, _medicine taken as prescribed_, _medicine taken with changes_. To prepare

the data for analysis, the three possible answers were encoded with two exclusive binary variables indicating if medicine was not taken, _medicine omitted_, or if medicine was taken with

changes, _medicine changed_. The expected most common answer, _medicine taken as prescribed_, was not encoded to avoid collinearity in the regression models (a.k.a. “the dummy variable

trap”). Finally, all variables were normalized by their allowed minimum and maximum values to allow for easier selection of model hyperparameters and interpretation of the inferred model

weights. It was a common problem for patients to occasionally forget to fill in their daily self-assessment, resulting in missing values in the dataset. In most cases, self-assessments were

either complete for all items or missing, but in a few instances, they were only partially answered. To avoid discarding observations with only a few missing values, we experimented with

filling in values from the previous day, which is a common method for dealing with missing values in time series data38. However, it resulted in very few additional complete observations and

we therefore decided to leave this step out. MODELLING APPROACH When analysing several related sets of measurements, such as data from individuals of a population, the two extreme

approaches are to either pool the datasets in a one-size-fits-all solution or to analyse the datasets separately, the latter only being possible when sufficient data is available (also known

as the cold start problem). A hierarchical Bayesian approach provides an intermediate solution that enables personalized models while learning the characteristics of the population39. In a

hierarchical Bayesian regression model, individuals have their own set of regression intercept and weights, _α__j_,_β__j_, sampled from a common population distribution parameterized by

population-level means _μ_ and variances _τ_ determining the amount of pooling: $$\begin{array}{l}\alpha _j,\beta _j\left. \sim \right.{\mathrm{Normal}}\left( {\mu ,\tau } \right)\\

y_{ji}\left. \sim \right.{\mathrm{Normal}}\left( {\alpha _j + \beta _j^T{\boldsymbol{x}}_{ji},\sigma } \right),\end{array}$$ where _y__ji_ is the _i_th observation of the target variable for

individual _j_, _x__ji_ are the corresponding predictor variables and _σ_ is the standard error. This hierarchical tying together of parameters means that data from the population helps

regularize the individual-level weights. An additional benefit of the Bayesian approach is that it expresses uncertainty in all the model parameters and predictions by their posterior

distributions, which is important for interpretability of the model. For further details, a complete description of the hierarchical Bayesian model is provided in the Supplementary

Information (SI). In the present study, we used Stan40 to specify and perform inference in the Bayesian models and then compared the predictive results with pooled and separate naïve mean

baselines and common machine learning methods: Ridge Regression from the scikit-learn machine learning library41 and XGBoost regression from the XGBoost Python package42. Details of the Stan

setup is also included in the SI. To estimate the predictive performance of the models we designed a cross-validation experiment where in each iteration we held out one randomly sampled

clinical evaluation (consisting of up to 4 days of data) from each individual and used the remaining data to fit the models. This procedure was repeated _K_ times and the predicted

coefficient of determination (_R_2) and root mean square error (RMSE) was computed on the held-out data in each iteration. We evaluated the models on the HDRS and the YMRS total scores as

well as item 1 of each rating scale, since these items reflect mood only. Additionally, we evaluated the models using all smartphone-based self-assessment items, the mandatory

self-assessment items (activity, medicine, mood and sleep) and using only the mood self-assessment item, respectively. Estimating scores on the HDRS and the YMRS with separate models enables

prediction of high values of the HDRS and the YMRS at the same time, indicating a mixed episode. COMPUTING RISK OF RELAPSE In some practical applications, it may be more relevant to

accurately identify high-risk individuals than to estimate the exact value of the severity score. Applying a Bayesian approach does not only provide a point estimate of the outcome of

interest but provides a probability distribution of unobserved (future) outcomes given previously observed data, i.e. the posterior predictive distribution, which can be utilized to reason

about uncertainty in the predictions. Specifically, samples from the posterior predictive distribution can be used to compute the probability that an unobserved outcome, $\tilde y_{ji}$,

exceeds a predefined threshold, _T_: $${\mathrm{Pr}}\left( {\tilde y_{ji} \ge T} \right).$$ When estimating scores of clinical ratings, by applying a threshold _T_ = 13 we can interpret this

probability as the risk that an individual is experiencing severe symptoms and utilize it as a personal score indicating the risk of relapse. ETHICAL CONSIDERATIONS The MONARCA II RCT was

approved by the Regional Ethics Committee in the Capital Region of Denmark (H-2-2014-059) and the Danish Data protection agency (2013-41-1710). The law on handling of personal data was

respected. All potential participants were given both written and oral information about the study before informed consent was obtained. Prior to commencement the trial was registered at

ClinicalTrials.gov (NCT02221336). Electronic data collected from the smartphones were stored at a secure server at Concern IT, Capital Region, Denmark (I-suite number RHP-292 2011-03). The

trial complied with the Helsinki Declaration of 1975, as revised in 2008. RESULTS DESCRIPTIVE STATISTICS The MONARCA II dataset consists of 280 clinical evaluations, with a mean number of

clinical evaluations per patients during the study of 3.33 (SD = 1.14), and a total of 15975 daily smartphone-based self-assessments with a mean number of smartphone-based self-assessments

during the study of 190.18 (SD = 70.97) from 84 patients with BD assigned to the intervention group of the RCT. The age ranged from 21 to 71 years (mean = 43.1, SD = 12.4) and 61.9% (_N_ =

52) were women. During the study period, most patients presented with rather low severity of depressive and manic symptoms resulting in low HDRS and YMRS scores. The mean HDRS total score

was 7.56 (SD = 6.29) and 20.4% of scores were greater than or equal to 13. The mean YMRS total score was 2.85 (SD = 4.17) and 5.0% of scores were greater than or equal to 13. The mean HDRS

item 1 score was 0.69 (SD = 0.85) and the mean YMRS item 1 score was 0.24 (SD = 0.53). Similarly, the majority of the smartphone-based self-reported mood scores were close to zero with a

mean of −0.14 (SD = 0.48), indicating neutral mood (euthymia). After filling back the clinical severity ratings 4 days (since the clinical rating scales reflect this time period) there were

764 observations with associated smartphone-based self-assessments. Figure 1 shows the association between the clinical ratings and the smartphone-based self-reported mood scores. Overall, a

high score on the HDRS corresponded to neutral or depressed smartphone-based self-assessed mood (_r_ = −0.40, _P_ < 0.01) while a high score on the YMRS corresponded to neutral or

elevated smartphone-based self-assessed mood (_r_ = 0.22, _P_ < 0.001). Only in a few instances were the HDRS and the YMRS rated high at the same time, indicating a mixed episode (_r_ =

0.13, _P_ = 0.02). MODEL ESTIMATES The hierarchical Bayesian regression model was evaluated on the entire dataset of clinical ratings combined with all self-assessed items of the completed

smartphone-based self-assessments for all participants with at least two data points (_N_ = 433). The model predicting total scores on the HDRS achieved an _R_2 of 0.84, indicating that the

model accounted for 84% of the variance in the data, and a residual RMSE of 2.41. The model predicting total scores on the YMRS achieved an _R_2 of 0.81 and a residual RMSE of 2.07. The

model predicting the HDRS item 1 score achieved an _R_2 of 0.89 and a residual RMSE of 0.30, and the model predicting the YMRS item 1 score achieved an _R_2 of 0.86 and a residual RMSE of

0.22. The distributions of inferred population-level mean, _μ_, and variance, _τ_, parameters in the hierarchical Bayesian regression HDRS total and YMRS total models are summarized in Table

1. The absolute _t_-statistic of the mean parameters, computed as the mean scaled by the standard error of the parameter: $t_\mu = \bar \mu /{\it{SE}}(\mu )$, is included as a measure of

variable importance, following the intuition that larger absolute weights and lower variance implies importance43. This shows that negative mood was the most important predictor variable in

the HDRS model while positive mood was the most important predictor and in the YMRS model. A visual presentation of the population-level parameters and a weight matrix summarising the

individual parameters are included in the SI. A figure showing the effect size of each self-assessment item is also included in the SI. CROSS-VALIDATION RESULTS The predictive performance of

the hierarchical Bayesian model was evaluated in _K_ = 100 cross-validation experiments on all data where participants had complete observations of clinical ratings and smartphone-based

self-assessments from at least three different clinical evaluations (_N_ = 329). In each iteration, data from one randomly sampled clinical evaluation from each patient was held out and the

remaining data was used to fit the models. Models were fitted to predict HDRS total, YMRS total, HDRS item 1 and YMRS item 1, from (1) all; (2) mandatory and (3) mood self-assessment items,

respectively. The hierarchical Bayesian model was compared to naïve pooled and separate mean models along with pooled and separate ridge regression and XGBoost regression models. Table 2

presents the cross-validation results of predicting HDRS total and YMRS total. Because of low variance in the data, the naïve mean models performed relatively well. Still the hierarchical

Bayesian regression model achieved the best overall performance in every case and was significantly better than the separate mean model in both the HDRS and YMRS case according to

independent _t_-tests (_P_ < 0.001). Overall, the separate models performed better than their pooled counterparts. Table 3 presents the cross-validation results of predicting HDRS item 1

and YMRS item 1, indicating mood. The pooled XGBoost achieved the best result at predicting HDRS item 1 using all self-assessment items. When reducing the feature set to the mandatory or

mood self-assessment items, the hierarchical Bayesian model was best. It was not possible to predict YMRS item 1 significantly better than the naïve mean baselines. PREDICTED RISK OF RELAPSE

SCORES The results from cross-validation experiments predicting the HDRS total score and the YMRS total score using all self-assessment items presented in the previous section were used to

compute risk of relapse scores ${\mathrm{Pr}}\left( {\tilde y_{{\it{ji}}} \ge {\it{T}} = 13} \right)$. The ability of the model to correctly assign high risk to instances with high ratings

can be evaluated as a binary classification problem with severity ratings equal to or greater than the threshold _T_ constituting the positive class. Figure 2 presents receiver operating

characteristic (ROC) curves of the HDRS total and the YMRS total models illustrating the trade-off between true positive rate (TPR) and false positive rate (FPR), comparing the hierarchical

Bayesian regression model to the naïve pooled and separate mean models. The pooled mean model corresponds to a model that either classifies all instances as low risk or high risk, achieving

an area under the curve (AUC) of 0.50 in both the HDRS and YMRS case. The separate mean model independently classifies each individual as either high or low risk based on observed values of

the ratings and achieved an AUC of 0.67 in the HDRS case and AUC of 0.49 in the YMRS case. The hierarchical Bayesian regression model was able to account for information in the

smartphone-based self-assessments as well as individual differences and achieved the highest AUC of 0.89 in the HDRS case and 0.84 in the YMRS case. DISCUSSION In the present study, we

analysed clinical ratings of depression reflected by the HDRS and mania reflected by the YMRS along with daily smartphone-based self-assessments including self-reported mood in a population

of 84 patients with BD. As hypothesized, there was a negative correlation between the HDRS and self-reported mood and a positive correlation between the YMRS and mood. This confirms previous

work25,26,27, and suggests that smartphone-based self-reported mood is a valid indicator of symptom severity in patients with BD and thereby a clinically relevant feature for monitoring and

analysis. Interestingly and as hypothesized, the proposed approach of applying hierarchical Bayesian regression models was able to fit the data distributions of the HDRS total score and the

YMRS total score and all smartphone-based self-assessment items and accounted for more than 80% of the variance in the data according to _R_2. Using the absolute _t_-statistic of the

population-level regression weights as a measure of variable importance, decreased and increased smartphone-based self-reported mood were the most important variables for predicting the

severity of depression (HDRS) and mania (YMRS). This is not surprising since sampling of self-reported mood from the patients was designed to collect indicators on the patient’s affective

state and thus should reflect the clinically rated symptoms. Other important variables in the HDRS total model were decreased sleep and feelings of mixed mood and anxiety, while in the YMRS

total model only mood ranked important (see Table 1). To assess the predictive performance of the hierarchical Bayesian model compared to pooled and separate baseline models, we performed

cross-validation experiments of estimating the HDRS total score, the YMRS total score, the HDRS item 1 score and the YMRS item 1 score using all smartphone-based self-assessment items, the

four mandatory items and mood self-assessment item alone, respectively. Thus, we were able to estimate the total clinical rating scores using regression models based on smartphone-based

self-assessments. The hierarchical Bayesian model achieved the best performance in predicting the HDRS total and was significantly better than a naïve model using the observed individual

(separate) mean as a prediction (_P_ < 0.001). Similarly, the hierarchical Bayesian model was best at predicting the YMRS total score and was significantly better than the naïve separate

mean model. Additionally, we tested models for predicting the first item of the HDRS and the YMRS, indicating mood. The pooled XGBoost model achieved the best result in predicting the HDRS

item 1 score, while estimating the YMRS item 1 score could not be improved over the naïve baseline. In all the presented experiments, we found that models based only on self-assessed mood

were able to retain most of the predictive performance of models based on all self-assessment items. This further shows that mood is the most important self-reported predictor variable for

estimating scores of the HDRS and the YMRS. Overall, the YMRS models did not account for much of the variance in the data, indicated by the low _R_2 scores. This could be mainly due to low

variation in the observed YMRS data. In clinical settings of monitoring illness activity in patients with bipolar disorder, detecting individuals with a high risk of relapse is highly

important in order to enable intervention. Therefore, a sensitive indication if a symptom severity rating is above a critical threshold might be more useful than estimating the exact value

of the severity rating itself. Thus, we demonstrated how uncertainty in the estimated total severity scores can be utilized to compute individual daily risk of relapse scores by considering

samples from the posterior predictive distribution of the hierarchical Bayesian model. In the case of both the HDRS and the YMRS, using hierarchical Bayesian approach achieved substantial

improvements over naïve models using pooled and separate means of observed data as predictions. Hence, including self-assessments in a regression model provided additional useful information

for estimating the level of the clinical severity ratings and hence the relapse risk scores, which is a promising and clinically relevant result. The findings that a combination of

fine-grained daily smartphone-based self-assessment items can be used to estimate and predict clinical ratings are interesting and innovative. Daily longitudinal self-monitoring of mood

symptoms gives valuable information of mood fluctuation experienced by patients with BD between clinical outpatient visits. Long-term monitoring of symptoms has been an essential part of the

monitoring and treatment of BD for decades44 and rapidly evolving smartphone technologies have made it possible to monitor symptoms more continuously, fine-grained and in real-time. This

can be clinically relevant for detection of symptoms before the first or recurrent depressive or manic episodes45, and allow for early intervention on prodromal symptoms. In the latest

version of the Diagnostic and Statistical Manual of Mental Disorders (DSM-V), increased activity level or energy is acknowledged as a core feature of hypomania and mania together with mood

changes46. Several studies using factor analysis have described activation and not mood state as the primary symptom in manic episodes47,48. However, in the present study we found mood to be

the most important predictor variable for estimating the HDRS and the YMRS severity ratings while activity presented with low importance in both models. Furthermore, sleep disturbances and

anxiety has been identified as early symptoms of depression and mania49,50, which is in line with our findings in the HDRS model while sleep and anxiety were less important in the YMRS

model. ADVANTAGES The patients included in the present study were clinically well characterized and were receiving treatment or had received treatment at the Copenhagen Clinic for Affective

Disorders, Denmark. The clinical evaluations were conducted multiple times during follow-up by experienced researchers with a specific knowledge within BD. The smartphone-based

self-assessment system used in the present studies (the Monsenso system) was developed by the authors and has been shown easy to use with a high usability, usefulness, ease of learning to

use and interface quality—also when compared with other smartphone-based self-assessment systems22,51. The use of smartphones for fine-grained real-time monitoring reduced the risk of recall

bias. The proposed hierarchical Bayesian modelling approach is well suited for analysis of small related datasets, especially when the individual datasets are too small to analyse

separately. Additionally, the linear regression method and ability to express uncertainty in all estimated quantities makes the model easy to interpret, which is essential in a clinical

setting. Overall, the findings from the present study are found to be innovative and generalizable to patients with BD not presenting with an acute affective episode and who are willing to

use a monitoring tool during prolonged time periods. LIMITATIONS The dataset used in this study primarily contained clinical ratings of low severity of affective symptoms indicating most

participants did not experience severe symptoms of depression or mania during the study period. Similarly, a large proportion of the self-reported mood scores were close to zero (indicating

euthymia) and had low variance. Consequently, the naïve mean baseline models could fit the data well and achieved good performance in the prediction task. However, the best regression model

was still significantly better than the naïve mean models, showing that it is possible to utilize smartphone-based self-reported data to produce more accurate estimates of the clinical

ratings of symptom severity. Although we saw significant correlations between self-reported mood and the HDRS and the YMRS, respectively, the correlations were weaker than what has been

reported in some other studies45. Furthermore, the absence of high ratings makes it difficult to reason about the performance of the models in detecting extreme cases, which are the most

critical in a monitoring and intervention application. Our analysis does not explore the distribution of missing data and thus assumes data is missing at random. However, it is reasonable to

believe that individuals who are experiencing severe depression or mania have difficulties coping with self-assessment while euthymic individuals find it less relevant. Thus, analysing the

missing data distribution might hold valuable information regarding symptom severity which can be explored further. Lastly, our analysis did not include any temporal information in the

models, but rather used smartphone self-assessment data from a given day to estimate clinical ratings on the same day and treated each day independently from other days. Thus, the analysis

made no assumptions regarding temporal patterns of mood but relied entirely on relationship between data collected on the same day. PERSPECTIVES AND FUTURE IMPLICATIONS Smartphones have

become a ubiquitous technology in modern society and can be utilized to provide improved and personalized illness management and monitoring in psychiatry. Smartphone-based self-assessment

makes data available for immediate analysis and can enable new tools for improved illness monitoring. In particular, accurate, daily estimates of symptom severity could help identify

critical cases and enable timely and individualized intervention. Additionally, advances in sensor technology and algorithms is making it possible to extract a growing range of increasingly

accurate behavioural features directly from sensor data. Utilizing these automatically generated features to infer symptom severity scores could be used to eliminate the need for frequent,

intrusive self-assessments and improve the user experience of illness monitoring systems in psychiatry going forward. In this paper, we have explored the relationship between

smartphone-based self-assessments and clinical ratings observed on the same day with the purpose of identifying current high-risk individuals. A related objective with possible great

clinical potential would be to predict individual risk of relapse ahead of time. We see this as an important topic for future studies. CONCLUSIONS In the present study, clinical ratings of

the severity of depression and mania were estimated from smartphone-based self-assessments collected from patients with BD. We found that our approach of applying a hierarchical Bayesian

model could estimate severity of depression and mania with low error compared to commonly used baseline methods and within 4 points of RMSE on the HDRS and the YMRS rating scales.

Furthermore, we showed how uncertainty in the estimates can be utilized to compute personal relapse risk scores suited for identifying critical cases of patients experiencing severe symptoms

and that our approach achieved substantial improvements over naïve pooled and separate mean models. The results presented in this work show that it is feasible to compute daily estimates of

clinical severity ratings of depression and mania from smartphone-based self-assessments, which can be used to improve and automate continuous disease monitoring and treatment of BD.

REFERENCES * Pini, S. et al. Prevalence and burden of bipolar disorders in European countries. _Eur. Neuropsychopharmacol._ 15, 425–434 (2005). Article CAS Google Scholar * Vos, T. et al.

Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990–2010: a systematic analysis for the Global Burden of Disease Study 2010. _Lancet_ 380, 2163–2196

(2012). Article Google Scholar * Goodwin, F. K. & Jamison, K. R. _Manic-depressive illness_. (Oxford University Press, New York, 1990). * Sanchez-Moreno, J. et al. Neurocognitive

dysfunctions in euthymic bipolar patients with and without prior history of alcohol use. _J. Clin. Psychiatry_ 70, 1120–1127 (2009). Article Google Scholar * Angst, F., Stassen, H. H.,

Clayton, P. J. & Angst, J. Mortality of patients with mood disorders: follow-up over 34-38 years. _J. Affect Disord._ 68, 167–181 (2002). Article CAS Google Scholar * Tondo, L.,

Isacsson, G. & Baldessarini, R. Suicidal behaviour in bipolar disorder: risk and prevention. _CNS Drugs_ 17, 491–511 (2003). Article CAS Google Scholar * Hayes, J. F., Miles, J.,

Walters, K., King, M. & Osborn, D. P. J. A systematic review and meta-analysis of premature mortality in bipolar affective disorder. _Acta Psychiatr. Scand._ 131, 417–425 (2015). Article

CAS Google Scholar * Kessing, L. V., Vradi, E. & Andersen, P. K. Life expectancy in bipolar disorder. _Bipolar Disord._ 17, 543–548 (2015). Article Google Scholar * Kessing, L. V.,

Vradi, E., McIntyre, R. S. & Andersen, P. K. Causes of decreased life expectancy over the life span in bipolar disorder. _J. Affect Disord._ 180, 142–147 (2015). Article Google Scholar

* Kupfer, D. J., Frank, E. & Ritchey, F. C. Staging bipolar disorder: what data and what models are needed? _Lancet Psychiatry_ 2, 564–570 (2015). Article Google Scholar * Kessing,

L. V. Diagnostic stability in bipolar disorder in clinical practise as according to ICD-10. _J. Affect Disord._ 85, 293–299 (2005). Article Google Scholar * Agius, M., Murphy, C. L. &

Zaman, R. Under-diagnosis of bipolar affective disorder in A bedford CMHT. _Psychiatr. Danub._ 22(Suppl. 1), S36–S37 (2010). PubMed Google Scholar * Knežević, V. & Nedić, A. Influence

of misdiagnosis on the course of bipolar disorder. _Eur. Rev. Med Pharm. Sci._ 17, 1542–1545 (2013). Google Scholar * Phillips, M. L. & Kupfer, D. J. Bipolar disorder diagnosis:

challenges and future directions. _Lancet_ 381, 1663–1671 (2013). Article Google Scholar * Hamilton, M. Development of a rating scale for primary depressive illness. _Br. J. Soc. Clin.

Psychol._ 6, 278–296 (1967). Article CAS Google Scholar * Young, R. C., Biggs, J. T., Ziegler, V. E. & Meyer, D. A. A rating scale for mania: reliability, validity and sensitivity.

_Br. J. Psychiatry_ 133, 429–435 (1978). Article CAS Google Scholar * Peralta, V. & Cuesta, M. J. Lack of insight in mood disorders. _J. Affect Disord._ 49, 55–58 (1998). Article CAS

Google Scholar * Cassidy, F. Insight in bipolar disorder: relationship to episode subtypes and symptom dimensions. _Neuropsychiatr. Dis. Treat._ 6, 627–631 (2010). Article Google Scholar

* Látalová, K. Insight in bipolar disorder. _Psychiatr. Q._ 83, 293–310 (2012). Article Google Scholar * de Assis da Silva, R. et al. Insight across the different mood states of bipolar

disorder. _Psychiatr. Q_ 86, 395–405 (2015). Article Google Scholar * de Assis da Silva, R., Mograbi, D. C., Bifano, J., Santana, C. M. T. & Cheniaux, E. Insight in bipolar mania:

evaluation of its heterogeneity and correlation with clinical symptoms. _J. Affect Disord._ 199, 95–98 (2016). Article Google Scholar * Bardram, J. E. et al. Designing Mobile Health

Technology for Bipolar Disorder: A Field Trial of the Monarca System. in _Proc. SIGCHI Conference on Human Factors in Computing Systems. CHI ’13_, 2627–2636 (ACM, New York, 2013). * Frost,

M., Doryab, A., Faurholt-Jepsen, M., Kessing, L. V. & Bardram, J. E. Supporting Disease Insight Through Data Analysis: Refinements of the Monarca Self-assessment System. in _Proc. 2013

ACM International Joint Conference on Pervasive and Ubiquitous Computing. UbiComp ’13_, 133–142 (ACM, New York, 2013). * Bardram, J. E. & Frost, M. The personal health technology design

space. _IEEE Pervasive Comput._ 15, 70–78 (2016). Article Google Scholar * Faurholt-Jepsen, M. et al. Behavioral activities collected through smartphones and the association with illness

activity in bipolar disorder. _Int J. Methods Psychiatr. Res._ 25, 309–323 (2016). Article Google Scholar * Faurholt-Jepsen, M. et al. Smartphone data as an electronic biomarker of illness

activity in bipolar disorder. _Bipolar Disord._ 17, 715–728 (2015). Article Google Scholar * Faurholt-Jepsen, M. et al. Smartphone data as objective measures of bipolar disorder symptoms.

_Psychiatry Res._ 217, 124–127 (2014). Article Google Scholar * Ma, Y., Xu, B., Bai, Y., Sun, G. & Zhu, R. Daily Mood Assessment Based on Mobile Phone Sensing. in _Proc. 2012 Ninth

International Conference on Wearable and Implantable Body Sensor Networks_, 142–147 (IEEE, 2012). * LiKamWa, R., Liu, Y., Lane, N. D. & Zhong, L. MoodScope: Building a Mood Sensor from

Smartphone Usage Patterns. in _Proc. 11th Annual International Conference on Mobile Systems, Applications, and Services. MobiSys ’13_, 389–402 (ACM, New York, 2013). * Canzian, L. &

Musolesi, M. Trajectories of Depression: Unobtrusive Monitoring of Depressive States by Means of Smartphone Mobility Traces Analysis. in _Proc. 2015 ACM International Joint Conference on

Pervasive and Ubiquitous Computing. UbiComp ’15_, 1293–1304 (ACM, New York, 2015). * Grünerbl, A. et al. Smartphone-based recognition of states and state changes in bipolar disorder

patients. _IEEE J. Biomed. Health Inf._ 19, 140–148 (2015). Article Google Scholar * Abdullah, S. et al. Automatic detection of social rhythms in bipolar disorder. _J. Am. Med. Inform.

Assoc._ 23, 538–543 (2016). Article Google Scholar * Taylor, S. A., Jaques, N., Nosakhare, E., Sano, A. & Picard, R. Personalized multitask learning for predicting tomorrow's

mood, stress, and health. _IEEE Transac. Affect. Comput._ 11, 1 (2018). Google Scholar * Gelman, A. et al. Bayesian Data Analysis, 3rd edn. in Chapman & Hall/CRC Texts in Statistical

Science. (Taylor & Francis, 2013). * Faurholt-Jepsen, M. et al. Daily electronic monitoring of subjective and objective measures of illness activity in bipolar disorder using

smartphones—the MONARCA II trial protocol: a randomized controlled single-blind parallelgroup trial. _BMC Psychiatry_ 14, 309 (2014). Article Google Scholar * Kessing, L. V. et al.

Treatment in a specialised out-patient mood disorder clinic v. standard out-patient treatment in the early course of bipolar disorder: randomised clinical trial. _Br. J. Psychiatry_ 202,

212–219 (2013). Article Google Scholar * Wing, J. K. et al. SCAN. Schedules for clinical assessment in neuropsychiatry. _Arch. Gen. Psychiatry_ 47, 589–593 (1990). Article CAS Google

Scholar * Hyndman, R. & Athanasopoulos, G. _Forecasting: Principles and Practice_, 2nd edn. (OTexts, Melbourne, 2018). * Murphy, K. P. _Machine Learning: A Probabilistic Perspective_.

(The MIT Press, 2012). * Carpenter, B. et al. Stan: a probabilistic programming language. _J. Stat. Softw., Artic._ 76, 1–32 (2017). Google Scholar * Pedregosa, F. et al. Scikit-learn:

machine learning in Python. _J. Mach. Learn. Res._ 12, 2825–2830 (2011). Google Scholar * Chen, T., & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in _Proc. 22nd ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining. KDD ’16_, 785–794 (ACM, New York, 2016). * Molnar, C. Interpretable machine learning. A Guide for Making Black Box Models

Explainable. https://christophm.github.io/interpretable-ml-book/. (2019). * Schärer, L. O., Krienke, U. J., Graf, S. M., Meltzer, K. & Langosch, J. M. Validation of life-charts

documented with the personal life-chart app - a self-monitoring tool for bipolar disorder. _BMC Psychiatry_ 15, 49 (2015). Article Google Scholar * Faurholt-Jepsen, M., Munkholm, K.,

Frost, M., Bardram, J. E. & Kessing, L. V. Electronic self-monitoring of mood using IT platforms in adult patients with bipolar disorder: a systematic review of the validity and

evidence. _BMC Psychiatry_ 16, 7 (2016). Article Google Scholar * Diagnostic and Statistical Manual of Mental Disorders (DSM–5). American Psychiatric Association.

(http://www.webcitation.org/78BxWU0gk). https://www.psychiatry.org/psychiatrists/practice/dsm. (2019). * Bauer, M. S. et al. Independent assessment of manic and depressive symptoms by

selfrating. Scale characteristics and implications for the study of mania. _Arch. Gen. Psychiatry_ 48, 807–812 (1991). Article CAS Google Scholar * Scott, J. et al. Activation in bipolar

disorders: a systematic review. _JAMA Psychiatry_ 74, 189–196 (2017). Article Google Scholar * Jackson, A., Cavanagh, J. & Scott, J. A systematic review of manic and depressive

prodromes. _J. Affect Disord._ 74, 209–217 (2003). Article Google Scholar * Pavlova, B., Perlis, R. H., Alda, M. & Uher, R. Lifetime prevalence of anxiety disorders in people with

bipolar disorder: a systematic review and metaanalysis. _Lancet Psychiatry_ 2, 710–717 (2015). Article Google Scholar * Faurholt-Jepsen, M. et al. Smartphone-based self-monitoring in

bipolar disorder: evaluation of usability and feasibility of two systems. _Int J. Bipolar Disord._ 7, 1 (2019). Article Google Scholar Download references ACKNOWLEDGEMENTS We would like to

thank the participants of the MONARCA II RCT as well as the clinical staff at the Psychiatric Center Copenhagen who helped facilitate the trial and assemble the dataset. The study was

funded by the Innovation Fund Denmark through the RADMIS project and the Copenhagen Center for Health Technology (CACHET). AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Applied

Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark Jonas Busk & Ole Winther * Department of Health Technology, Technical University of Denmark, Lyngby,

Denmark Jonas Busk & Jakob E. Bardram * Copenhagen Affective Disorder Research Center (CADIC), Psychiatric Center Copenhagen, Rigshospitalet, Copenhagen, Denmark Maria Faurholt-Jepsen

& Lars Vedel Kessing * Monsenso ApS, Copenhagen, Denmark Mads Frost * Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark Lars Vedel Kessing * Center

for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark Ole Winther * Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen,

Denmark Ole Winther Authors * Jonas Busk View author publications You can also search for this author inPubMed Google Scholar * Maria Faurholt-Jepsen View author publications You can also

search for this author inPubMed Google Scholar * Mads Frost View author publications You can also search for this author inPubMed Google Scholar * Jakob E. Bardram View author publications

You can also search for this author inPubMed Google Scholar * Lars Vedel Kessing View author publications You can also search for this author inPubMed Google Scholar * Ole Winther View

author publications You can also search for this author inPubMed Google Scholar CORRESPONDING AUTHOR Correspondence to Jonas Busk. ETHICS DECLARATIONS CONFLICT OF INTEREST J.B., M.F.J., and

O.W. have no conflicts of interest. M.F. and J.E.B. are founders and shareholders of Monsenso. L.V.K. has during recent three years been a consultant for Lundbeck. ADDITIONAL INFORMATION

PUBLISHER’S NOTE Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTAL INFORMATION SUPPLEMENTAL INFORMATION

RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and

reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes

were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If

material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain

permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS

ARTICLE Busk, J., Faurholt-Jepsen, M., Frost, M. _et al._ Daily estimates of clinical severity of symptoms in bipolar disorder from smartphone-based self-assessments. _Transl Psychiatry_ 10,

194 (2020). https://doi.org/10.1038/s41398-020-00867-6 Download citation * Received: 04 November 2019 * Revised: 18 April 2020 * Accepted: 29 April 2020 * Published: 18 June 2020 * DOI:

https://doi.org/10.1038/s41398-020-00867-6 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not

currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative