Play all audios:
ABSTRACT On-line comprehension of natural speech requires segmenting the acoustic stream into discrete linguistic elements. This process is argued to rely on theta-gamma oscillation
coupling, which can parse syllables and encode them in decipherable neural activity. Speech comprehension also strongly depends on contextual cues that help predicting speech structure and
content. To explore the effects of theta-gamma coupling on bottom-up/top-down dynamics during on-line syllable identification, we designed a computational model (Precoss—predictive coding
and oscillations for speech) that can recognise syllable sequences in continuous speech. The model uses predictions from internal spectro-temporal representations of syllables and theta
oscillations to signal syllable onsets and duration. Syllable recognition is best when theta-gamma coupling is used to temporally align spectro-temporal predictions with the acoustic input.
This neurocomputational modelling work demonstrates that the notions of predictive coding and neural oscillations can be brought together to account for on-line dynamic sensory processing.
SIMILAR CONTENT BEING VIEWED BY OTHERS NEURAL DYNAMICS OF PHONEME SEQUENCES REVEAL POSITION-INVARIANT CODE FOR CONTENT AND ORDER Article Open access 03 November 2022 SEQUENCES OF INTONATION
UNITS FORM A ~ 1 HZ RHYTHM Article Open access 28 September 2020 THE ROLE OF ISOCHRONY IN SPEECH PERCEPTION IN NOISE Article Open access 11 November 2020 INTRODUCTION Neural oscillations are
involved in many different cognitive operations1,2,3, and considering their cross-frequency coupling permits to more closely approach their function, e.g., perception, memory, attention,
etc.4. In the domain of natural speech recognition, an important role has been assigned to the coupling of theta and gamma oscillations5,6,7, as it permits to hierarchically coordinate the
encoding of phonemes within syllables, without prior knowledge of their duration and temporal occurrence, i.e. in a purely bottom-up online manner8. Natural speech recognition also strongly
relies on contextual cues to anticipate the content and temporal structure of the speech signal9,10,11. Recent studies underline the importance of top-down predictive mechanisms during
continuous speech perception and relate them to another range of oscillatory activity, the low-beta band12,13,14,15,16,17. Predictive coding18,19,20,21, on the other hand, offers a theory of
brain function that, in common with Analysis-by-Synthesis22,23 and the Bayesian Brain hypothesis24, relies on the agent having an internal model of how sensory signals are generated from
their underlying hidden causes. Predictive coding also provides a message passing scheme in which top-down predictions and bottom-up prediction errors work together to identify the hidden
causes of sensory signals. It therefore incorporates the contextual and prior knowledge which are invoked as critical in speech processing25. Bottom-up and top-down approaches of speech
processing both find support in theoretical studies. A neurocomputational model involving the coupling of realistic theta and gamma excitatory/inhibitory networks was able to pre-process
speech in such a way that it could then be correctly decoded8. This model aimed at understanding the computational potential of biophysically realistic oscillatory neural processes rather
than simply fitting existing data. A radically different model, solely based on predictive coding, could faithfully recognise isolated speech items (such as words, or full sentences when
considered as a single speech item)26. Although both approaches intend to describe speech perception, one model focused on the on-line parsing aspect of speech processing, and the other on
the recognition of isolated speech segments (no parsing needed). Combining the physiological notion of neural oscillations with the cognitive notion of predictive coding is appealing27 as it
could broaden the capacity, improve performance, and enhance the biological realism of neurocomputational models of speech processing. More fundamentally, such an attempt offers the
opportunity to explore the possible orchestration between two equally important neuroscientific levels of description, computational/algorithmic for analysis-by-synthesis and
algorithmic/implementational for neural oscillations28. In this study, we addressed whether a predictive coding speech recognition model could benefit from neural oscillation processes. We
designed the Precoss neurocomputational model based on the predictive coding framework in which we included theta and gamma oscillatory functions to deal with the continuous nature of
natural speech. Although natural sentence comprehension involves many processing steps up to the syntax level, the parsing of syllables and their on-line recognition is a crucial step and a
challenging issue, even for current automatic speech recognition (ASR) systems29,30,31. The specific goal of this modelling work was hence to address whether combining predictive coding and
neural oscillations could be advantageous for on-line identification of the syllabic components of natural sentences. Specifically, we examined the possible mechanisms by which theta
oscillations can interact with bottom-up and top-down information flows and assessed the effects of this interaction on the efficacy of the syllable decoding process. We show that on-line
syllable identification from speech works best when theta-gamma coupling is combined with the internal knowledge about syllables’ spectral structure, and more broadly when continuous
inferential processes are informed by dynamic oscillation-based cues. RESULTS PRECOSS ARCHITECTURE AND THETA-GAMMA COUPLING An important prerequisite of the model is that it must be able to
use the temporal information/cues present in continuous speech, to define syllable boundaries. We hypothesised that internal generative models including temporal predictions should benefit
from such cues. To address this hypothesis and to account for recurrent processes occurring during speech recognition13,32,33,34, we used a continuous predictive coding model (described in
Methods). Our model explicitly separates _what_ from _when_, with _what_ referring to the identity of a syllable and its spectral representation (a non-temporal but ordered sequence of
spectral vectors), and _when_ to the prediction of the timing and duration of syllables as implemented through periodic/oscillatory processes35,36. _When_ predictions take two forms:
syllable _onset_ as signalled by a theta module, and syllable _duration_ signalled either by exogenous or endogenous theta oscillations that set the duration of a sequence of gamma-timed
units (see Methods for details) (Fig. 1). Precoss retrieves the sensory signal from internal representations about its causes by inverting a generative model. In this case the sensory input
corresponds to the slow amplitude modulation of the speech waveform (see Methods section) and a 6-channel auditory spectrogram37 of a full natural sentence (bottom rows of Fig. 1), which the
model internally generates from the four components depicted in Fig. 1: 1/ a theta oscillation, 2/ a slow amplitude modulation unit in a _theta_-module, 3/ a pool of syllable units (as many
syllables as present in the natural input sentence, i.e., from 4 to 25), and 4/ a bank of eight gamma units in a _spectrotemporal_-module. Together, gamma and syllable units generate
top-down predictions about the input spectrogram. Each of the eight gamma units represents a phase in the syllable; they activate sequentially and the whole activation sequence repeats
itself. Each syllable unit is hence associated with a sequence of eight vectors (one per gamma unit) with 6 components each (one per frequency channel) (see Supplementary Fig. 1). The
acoustic spectrogram of a single syllable is generated by the activation of the corresponding syllable unit over the entire syllable duration. While the syllable unit encodes a specific
acoustic pattern, the gamma units temporally deploy the corresponding spectral prediction over the syllable duration. Information about syllable duration is provided by the theta oscillation
(dashed arrow in Fig. 1), its instantaneous rate affecting the rate/duration of the gamma sequence. Finally, the accumulated evidence about inferred syllable has to be erased before the
next syllable is processed. To achieve this, the last (8th) gamma unit (Fig. 1 upwards blue arrows, Supplementary Fig. 1), which encodes the last part of the syllable, resets all syllable
units to a common low activation level, enabling new evidence accumulation. The model performance depends on whether the gamma sequence aligns with syllable onsets, and whether its duration
is consistent with the syllable duration (50–600 ms, mean = 182 ms in the dataset used here, Fig. 2). To meet these essential criteria, the theta oscillator, as modelled using the
Ermentrout-Koppel canonical model38, is fed by the dynamics of the slow amplitude modulation of the speech signal inferred from the input, and whenever it reaches a specific, predefined
phase, the model generates a Gaussian trigger signalling syllable onset. This setup implies that the operating theta frequency dynamically changes as a function of the inferred slow
amplitude fluctuations and can hence estimate syllables duration. This estimation is used to adjust the duration of the gamma sequence (via a hidden variable controlling the rate of the
gamma sequence—see Methods section, Eq. (12)). The model’s estimate about the syllable sequence is provided by the syllable units, which, together with gamma units, generate expected
spectrotemporal patterns that are compared with the input spectrogram. The model updates its estimates about the ongoing syllable to minimise the difference between the generated and the
actual spectrogram. The activity level increases in those syllable units whose spectrogram is consistent with the sensory input and decreases in the others. In the ideal case, online
prediction error minimisation leads to elevated activity in one single syllable unit matching the input syllable. We quantify the model’s performance by its ability to correctly identify
syllables in the input, which is determined by the activity of syllable units within the temporal window between two consecutive activations of the first gamma unit. For each window the
model selects the syllable unit that has the highest average activation, the _winning syllable_. Optimal recognition occurs when the winning syllable and the inferred gamma sequence duration
correspond to the real identity and duration of the syllable in the input (see corresponding Methods section and Supplementary Fig. 2 for more details). MODEL VARIANTS AND PERFORMANCE The
model presented above includes a physiologically motivated theta oscillation that is driven by the slow amplitude modulations of the speech waveform and signals information about syllable
onset and duration to a gamma component. This theta-gamma coupling achieves the temporal alignment of internally generated predictions with the syllable boundaries detected from the input
(variant A in Fig. 3). To assess the relevance of syllable timing based on the slow amplitude modulation, we compared the performance of model A against that of a variant in which theta
activity is not modelled by an oscillation but emerges from the self-repetition of the gamma sequence (Model B, Fig. 3). In this alternative version, the duration of the gamma sequence is no
longer controlled exogenously by the theta oscillation, but endogenously using a preferred gamma rate that, as the sequence repeats, results in an intrinsic theta rhythm (Eqs. (12) and
(15)). As with the theta oscillation, the duration of the gamma sequence has a preferred rate in the theta range that can potentially adapt to variable syllable durations during the
inference process, via prediction errors that change the hidden variable responsible for its rate. With this variant, the model allows for testing a theta rhythm arising from the repetition
of the gamma sequence rather than from an explicit stimulus driven oscillator. This scheme where theta emerges from endogenously rhythmic modulations of gamma activity is argued in the
literature39 as a possible alternative to a separate cortical theta rhythm driving gamma activity40. Note that both model versions (A and B) operate with a reset of accumulated evidence by
silencing syllable units at the end of each syllable. In order to more precisely assess the specific effects of theta-gamma coupling and of the reset of accumulated evidence in syllable
units, we run additional simulations in two more variants of models A and B. Variants C and D differed by having no preferred gamma rate (Supplementary Fig. 1). Variants E and F further
differed from variants C and D by having no reset of accumulated evidence in syllable units (Fig. 3, left panel, Table 1 in Methods). Among all model variants only A has a true theta-gamma
coupling, where gamma activity is determined by the theta module (Fig. 3, red dashed arrow), whereas for B the gamma rate is set endogenously (red dashed circle). In all other cases, the
rate of gamma activity can freely adapt to the input. To address the impact of the reset of gamma units by exogenous theta triggers we compared models A, C and E, with B, D and F,
respectively. Simulations were performed on natural sentences from the TIMIT41 speech corpus. Bar plots in Fig. 3 show the median performance for each architecture. Although all model
variants performed well above chance, there were significant differences across them (p-values for pairwise comparisons and Bonferroni corrected critical α-value are presented in
Supplementary Table 1). Relative to models A and B, performance was significantly lower in models E and F (on average by 23%), and C and D (by 15%) indicating that erasing accumulated
evidence about the previous syllable before processing a new syllable is a crucial factor for syllable stream encoding in natural speech. The comparison of variants A and B versus C and D
indicates that theta-gamma coupling, whether stimulus-driven (A) or endogenous (B), significantly improved the model’s performance (on average by 8.6%). The simulations further show that the
model performed best when syllable units were reset after completion of each gamma-units sequence (based on internal information about the spectral syllable structure), and when the gamma
rate was driven by theta-gamma coupling irrespective of whether it was stimulus- (red dashed arrow in A) or endogenously-timed (red dashed circle in B). The model’s performance with natural
sentences hence depended neither on the precise signalling of syllable onsets via a stimulus-driven theta oscillation, nor on the exact mechanism of the theta-gamma coupling. Although this
is a surprising finding, the absence of performance difference between stimulus-driven and endogenous theta-gamma coupling reflects that the duration of the syllables in natural, regular
speech is very close to the model’s expectations, in which case there would be no advantage for a theta signal driven directly from the input. To address this point, we run additional
simulations on model variants A (theta-gamma coupling through stimulus-driven theta) and B (endogenous theta-gamma coupling via preferred gamma rate) using compressed speech signals (normal
speed ×2 and ×3). Behavioural studies14,42,43 show that comprehension remains almost intact when speech is compressed by 2 but drops for a compression factor of 3. Under such adverse
conditions, stimulus-driven theta-gamma coupling might become beneficial for parsing and decoding syllables. Simulation results with compressed speech stimuli are presented in Fig. 4. For
the sake of comparison, we display again performance of variants A and B with natural uncompressed sentences. As expected, overall performance dropped with increased compression factor.
However, while for compression factor 2, there was still no significant difference between stimulus-driven and endogenous theta-gamma coupling, a significant difference emerged for
compression factor 3, (mean difference = 1.74% [0.28, ∞, 95% CI], Cohen’s _d_ = 0.1363, _p_ = 0.0248), indicating that a stimulus-driven theta oscillation driving theta-gamma coupling was
more advantageous for the syllable encoding process than an endogenously-set theta rate. These simulations therefore suggest that natural speech can be processed with a relatively fixed
endogenous theta oscillator but that a more flexible, stimulus-driven, theta oscillator signalling precise syllable timing to the gamma encoding module, might become essential for decoding
speech in adverse conditions such as when speech rate is variable. BAYESIAN INFORMATION CRITERION The ability of the model to accurately recognise syllables in the input sentence does not
take into account the variable complexity of the different models compared. We therefore estimated the Bayesian Information Criterion44 (BIC) for each model, which quantifies the trade-off
between model accuracy and complexity (Fig. 5). Variant A, the model in which exogenously driven theta oscillation informs gamma rate and resets the accumulated evidence based on the
internal knowledge of syllable structure, showed the highest BIC value. While, the comparison based on syllable recognition accuracy (Fig. 3) could not distinguish between variants A and B,
the BIC (Supplementary Table 2) shows that variant A results in more confident syllable recognition than the model without stimulus-driven theta oscillation (Variant B). Overall, these
simulations show that two processes determine the performance of online syllable identification in the model. The first and most important one is the reset of accumulated evidence based on
the model’s information about the syllable content (here its spectral structure). The second one is the coupling between theta and gamma periodic processes, which ensures that gamma activity
is embedded within a theta cycle corresponding to the expected syllable duration. Importantly, the coupling was more efficient when it arose from a theta oscillator driven by the speech
acoustics; the model was marginally more resilient to speech rate variations (Fig. 4), and it had the best accuracy versus complexity trade-off (Fig. 5). DISCUSSION This work focuses on a
necessary operation for speech recognition; the parsing of the continuous speech signal into discrete elements. Precoss was exapted from a speech recognition model26 inspired by birdsong
generation45. While most simulations described in Yildiz et al.26 were performed on monosyllabic, normalised (in the temporal domain) words (digits), our proposed model extends the approach
to natural continuous speech made up of subunits (here syllables) whose order is not known a priori. Since speech is made of linguistic building blocks allowing for (quasi)infinite
combinatorial possibilities, this constitutes an important step toward neurobiological realism compared with Yildiz et al.26. The model emulates the combinatorial freedom of language by
assuming that syllables present in a given sentence can appear in any order. Given real speech statistics, context can reduce—but not generally eliminate—the uncertainty about which syllable
might come next. Our main observation suggests that when uncertainty about the identity of the next syllable remains, which is true in most cases, coordination of bottom-up and top-down
information flow is critical for recognition and that neural oscillations can enhance this coordination by signalling syllable temporal structure (onsets and duration). Syllabification is a
non-trivial problem that is classically dealt with by off-line methods, e.g., Mermelstein algorithm46. It was shown, however, that a theta-range natural oscillator built from reciprocal
connections between excitatory and inhibitory leaky integrate-and-fire neurons could achieve highly accurate on-line syllabification8,31. The efficiency of such theta oscillator is due to
the fact that its intrinsic frequency can adapt, within limits, to that of an external stimulus. In Precoss, we took advantage of the theta oscillation for syllable parsing, using a
simplified version of this network including one single canonical theta neuron38, whose function is to signal syllable onset based on the slow amplitude modulation of the speech waveform.
This simplification was necessary because integrate and fire neurons do not permit to work in a continuous framework. This simplified theta oscillation implementation allowed us to reach
around 53% correct syllable onset detection (theta trigger within 50 ms of a real onset), a relatively modest rate, which however was sufficient to demonstrate the advantage of an
oscillatory mechanism that can adapt to the acoustics. We indeed compared this model, which includes a stimulus-informed theta oscillation, with a model where theta-gamma coupling is
endogenous, i.e., when it arises from the repetition of the gamma sequence at a predetermined rate that flexibly adapts to the variable syllable lengths during the inference process (Eq.
(12), Methods). While both models performed similarly for natural speech up to a compression factor of 2, the model with stimulus informed theta oscillator performed better for a compression
factor 3. The advantage of the latter model is that gamma activity precisely aligns with the syllables in the input, allowing to encode syllables of variable length as they occur in
continuous natural speech. A critical feature of the model is the coupling between the theta and gamma modules. Many experimental studies indicate that neural theta and gamma activity
interact2,8,47,48,49 and most likely that theta organises gamma activity to align neural encoding timing with the temporal structure of the auditory sensory signal. In speech processing,
cross-frequency coupling is thought to preserve the hierarchy between phonemes, syllables and even higher order linguistic elements such as phrases7,50,51,52,53. Here, by having two
alternative implementations for the theta rhythm, we effectively modelled two forms of theta-gamma coupling: one in which the gamma sequence is controlled (onset and duration) by the
stimulus-driven theta oscillation, and another one in which the theta rhythm emerges from an endogenous gamma sequence. The observation that model variants A and B, which implement one or
the other theta-gamma coupling option perform better than the variants without coupling (C, D, E, F), suggests that the nesting of two temporal timescales facilitates on-line syllable
recognition, most presumably by aligning top-down and bottom-up information flows with the syllabic structures in the input. However, we also observed that theta-gamma coupling was not the
most important factor for correct performance. Irrespective of theta-gamma coupling the model required explicit reset of syllable units to function at optimal level (Fig. 3 C, D vs. E, F):
the high activity level of the winning syllable unit adversely affected the processing of the next syllable, reflecting that syllable units were integrating information across syllable
boundaries. This issue was dealt with by incorporating a reset of the accumulated evidence about the previous syllable, which increased performance by up to 15–20%. Importantly however,
model variants A and B, which combine theta-gamma coupling with the reset of accumulated evidence in syllable units, performed better (by 8–10%) than model variants C and D that only relied
on syllable units reset, even when the latter had a flexible gamma network rhythm and could in principle have adapted to the variable syllable length in the input sentences. The performance
measure was based on the syllable unit with the highest average activity within a cycle of the gamma network (Supplementary Fig. 2). This measure does not take into account model complexity
or the robustness of the model’s syllable estimate. After accounting for both by considering BIC, we found that the model that best accounted for the actual syllables present in the input
was variant A, the model in which both the theta rhythm and the gamma network rhythm are stimulus driven. The log-likelihood value used to calculate the BIC value uses the actual values of
syllable units together with their precision, and therefore is more sensitive than the performance measure used in Fig. 3. The larger BIC for variant A suggests that even in the cases in
which the identity of the syllable unit with highest average activity within a gamma network cycle is the same in both variants, the precision of recognised syllables may be higher for
variant A than B. This also suggests that variant A might be more robust to other input modifications such as background noise. The absence of a difference in our basic measure of
performance between models with stimulus-driven versus endogenous theta-gamma coupling at natural speech rate is a puzzling finding. It suggests that the notion that a theta rhythm drives
the gamma network integration window, i.e. that the gamma sequence has a plausible syllable-length duration, is a more important factor than a precise syllable timing. Yet, a possible reason
for this observation could be that theta-signalled syllable onsets were too imprecise to provide an optimal gain (as the single neuron theta model only yielded 53% accuracy). To address
this point, we ran yet another round of simulations where we compared model performance depending on whether we use the spiking theta oscillator or true syllable onsets to reset the gamma
sequence (Fig. 6 model A versus A’). Performance was higher for variant A’ (Fig. 6, by 5.44% ± 2.04 (95% CI), Cohen’s _d_ = 0.363, _p_ = 3.63e-7), i.e. when endogenous theta-gamma coupling
was driven by true syllable onset information, suggesting that the model can still benefit from accurate syllable onset signalling. Although such an ideal onset detection is non-realistic in
a biological system and in natural listening conditions (noise, cocktail-party situations, etc.), this simulation suggests that by improving the way the model handles syllable onset
detection the gain provided by exogenous theta-gamma coupling could be further increased. So far, the best performance for on-line syllable boundary detection was obtained using a theta
oscillation network of 32 neurons (16 excitatory and 16 inhibitory) bursting together, which made boundary detection more flexible and robust to fluctuations in the speech waveform8,31.
Increasing the number of theta neurons in the model’s theta module would significantly increase the computational load, yet presumably improve the accuracy of the syllable onset detection.
Although top-down control is constitutively present in the predicting coding implementation, our model lacks the notion that top-down processes involve specific discharge rates, and that, as
recently demonstrated, they occur preferentially at a low-beta rate range12,13,17,21. The predictive coding model we present here works in a continuous inferential mode, which is
discretized only by virtue of syllable timing. In human auditory cortex, bottom-up gamma activity is modulated at the low-beta rate13,54, which could offer top-down integration constants
that are intermediate between syllables and gamma phonemic-range chunks, and whose advantage could be to smooth the decoding process by providing sequential priors at an intermediate scale
between phonemes and syllables. Alternatively, the beta top-down rhythm could be related to expected precision, thus encoding second-order statistics20. Expected precision weighs bottom-up
prediction errors, hypothesised to work at gamma rate, and could control their impact on the evidence integration process. When the sensory input corresponding to a new syllable arrives, the
large prediction error could decrease the estimated confidence in top-down prediction and boost confidence in bottom-up prediction error. If the relative weight of bottom-up and top-down
information is carried by low beta activity, we would then expect an alternation between up- and down-going processes at a theta rhythm, a finding that was experimentally observed13. An
important generalisation for the model would thus consist of adopting a framework that estimates precision55,56,57. These proposals remain speculative, and neurocomputational modelling could
be one way to address whether the principle of frequency and temporal division of bottom-up and top-down processing is functionally interesting and whether the low-beta rate for top-down
flow is optimal or merely a “just so” phenomenon. Although our goal was not to design a speech processing model that can compete with those used in the domain of automatic speech
recognition58,59 it turns out that the notion of neural oscillations can be relevant for the latter8, even at the hardware level60. Hyafil and Cernak31 demonstrated that a biophysically
plausible theta oscillator which can syllabify speech _online_ in a flexible manner makes a speech recognition system more resilient to noise and to variable speech rates. Here, we confirm
that a biologically realistic theta oscillation is useful to signal syllable boundaries, mostly when they are out of natural statistics. More importantly, we show that the coupling between
two processing timescales at theta and gamma rates can boost on-line syllable recognition in natural speech. It is hence possible that introducing oscillatory mechanisms in ASR could further
improve performance and resilience to noise. For instance, using a basic sampling rate in the low-gamma range as implemented in the current model instead of faster ones as commonly used61
could save computing resources without reducing ASR systems performance. The possible coupling of a syllable parsing oscillatory mechanism, with an encoding process that is both exogenously
and endogenously timed could also be useful. Finally, using oscillation-based top-down updating, which could deploy predictions about when speech events are expected to happen, a process
that most ASR systems do not yet implement62, might also turn out to be advantageous. This theoretical work shows the interest of extending predictive coding approaches to the domain of
neural oscillations. This combination permits both to emulate a neurally plausible interface with the real world that is able to deal with the continuous nature of biological stimuli and the
difficulty to parse them into elements with discrete representational value, and to provide internal orchestration of the information flow that supply possible deficiencies of interfacing
mechanisms. Overall, this study suggests an alternative perspective of combining neural oscillations with predictive coding: namely the longitudinal effect of theta-gamma coupling on the
temporal alignment of top-down/bottom-up informational flows during inferential processes. METHODS SPEECH INPUT We used 220 recorded English sentences from the TIMIT database41 for our
simulations. The sentences were spoken by 22 different speakers (10 sentences each). Overall, those 220 sentences include 2888 syllables. We used the first 10 sentences of the dataset to
adjust parameters of the model (Eq. (7) and precisions, Table 2). Thus, these sentences were not included in the model performance analysis. Input to the model consisted of (1) a
time-frequency representation of the sound wave and (2) the slow amplitude modulation of the sound waveform. We used a biologically inspired model of the auditory periphery37 to obtain the
time-frequency representation, therefore we will refer to it as the auditory spectrogram. The model transforms the auditory signal into 128 logarithmically spaced frequency channels, ranging
from 150 Hz up to 7 kHz. We normalised the auditory spectrogram so that the maximum value across all channels is 1 and the minimum 0. After averaging the activity of adjacent channels, we
reduced the number of channels to 6, covering the range of frequencies from 150 Hz to 5 kHz. The slow amplitude modulation is calculated following the procedures described in Hyafil et al.8
and Hyafil and Cernak31. The procedure follows these steps: (1) transform the auditory spectrogram into 32 frequency channels by averaging every 4 channels (from the full 128 channel
auditory spectrogram) into 1 channel. (2) convolve with a spectrotemporal filter optimised to signal syllable boundaries. In Hyafil et al.8 the output of the spectrotemporal filter, which
corresponds to the slow amplitude modulation in the speech waveform, was used to drive a theta oscillator made of integrate and fire model neurons. In Precoss, it is used to drive the theta
module that is implemented by a continuous theta neuron. In summary, each recorded sentence was represented by seven sensory input channels, the slow amplitude modulation (_S_(t)) plus 6
frequency bands (_F__f_(t); _f_ = 1, … 6). SYLLABIFICATION The model’s goal is to recognise syllables on-line, which requires defining syllables in the input to subsequently assess the
model’s performance. The TIMIT corpus provides phonemic boundaries labelled by professional phoneticians. This information was passed to Tsylb2, a programme that provides candidate syllable
boundaries based on the TIMIT phonemic annotation and on English grammar rules63. These boundaries were then used to calculate spectrotemporal patterns ST_fγω_ (_f_ = 1, … 6, number of
frequency channels and _γ_ = 1, … 8, number of gamma units) for each syllable _ω_ of a given sentence. Firstly, each syllable (_ω_) was divided into 8 equal Δ_T_ temporal chunks. Then,
syllable spectrotemporal patterns ST_fγω_ were calculated by averaging each of the 6 frequency channels (_F__f_(t)) within each of the eight Δ_T_ windows (_γ_) for each syllable (_ω_)). We
used 6 frequency channels per syllable, thus, the spectrotemporal patterns are matrices with 6 rows and 8 columns for each syllable (_ω_)) (Supplementary Fig. 1). GENERATIVE MODEL We used a
predictive coding model to parse and recognise individual syllables from the continuous acoustic waveform of spoken English sentences. The core of the predictive coding framework is a
hierarchically structured generative model that represents the internal knowledge about the statistics and structure of the external world. During the inference process, the brain inverts
the generative model and tries to infer the hidden causes of the sensory input. To invert the generative model, we used the Dynamic Expectation Maximisation algorithm55,64, which is based on
top-down predictions and bottom-up prediction errors. We considered a generative model with two hierarchically related levels. At each level _i_ in the hierarchy, dynamics are determined by
local hidden states (denoted by _x__(i)_) and causal states from the level above (denoted by _υ__(i)_). At the same time, each level generates causal states that pass information to the
level below (_υ__(i-1)_). Hidden states are subject to dynamical equations while causal states are defined by static, generally nonlinear transformations of hidden states and causal states.
Schematically Top level (_i_ = 2) $${\dot x}^{\left( 2 \right)} = {\it{f}}^{\left( 2 \right)}\left( {{\it{x}}^{\left( 2 \right)}} \right) + {\it{\upvarepsilon }}^{\left( 2 \right)}$$ (1)
$${\it{v}}^{(1)}{\it{ = g}}^{(2)}\left( {{\it{x}}^{(2)}} \right)+ \upeta^{(2)}$$ (2) The dynamics at this level are only determined by hidden states _x__(2)_. _υ__(1)_ is the output to the
level below. The hidden states at this level include hidden states for a theta oscillator, the slow amplitude modulation of speech waveform, syllable units, and gamma units (see below for
details). _ε__(i)_ and _η__(i)_ (_i_ = 1, 2) stand for random fluctuations for hidden and causal states respectively (the same notation is used in the next sections); their precision
determines how predictions errors are weighted55. Causal states passing information to the bottom level include causal states for syllable units, gamma units, and the slow amplitude
modulation. Bottom level (_i_ = 1) $${\dot x}^{\left( 1 \right)} = {\it{f}}^{\left( 1 \right)}\left( {{\it{x}}^{\left( 1 \right)},{\it{v}}^{\left( 1 \right)}} \right) + {\it{\upvarepsilon
}}^{\left( 1 \right)}$$ (3) $${\it{v}}^{\left( 0 \right)}{\it{ = g}}^{\left( 1 \right)}\left( {{\it{x}}^{\left( 1 \right)}{\it{,v}}^{\left( 1 \right)}} \right) + \upeta^{\left( 1 \right)}$$
(4) At this level, there are hidden and causal states related to the 6 frequency channels and a causal state for the slow amplitude modulation (which is relayed without modification from the
top level). The output of the bottom level _υ__(0)_ is then compared with the input Z(t): a vector containing the slow amplitude modulation and the reduced 6-channel auditory spectrogram.
$${\it{v}}^{\left( 0 \right)} = {\bf{{Z}}}\left( {\it{t}} \right)$$ (5) In the following, we write the explicit form of these equations. Supplementary Fig. 3 provides a schematic with all
the variables used in the model. TOP LEVEL The top level has two modules; a theta module with _slow amplitude modulation_ and _theta oscillator_ units and a spectrotemporal module with
_gamma_ and _syllable_ units. We have used Ermentrout-Kopell’s canonical model38 for the model’s theta oscillation. The theta model receives as an input the slow amplitude modulation signal
in the input. Whenever the theta oscillator reaches a predefined phase, the model generates a gaussian trigger, which constitutes the model’s estimate about syllable onsets (called also
theta triggers). The model tracks the slow amplitude modulation in the input with a perfect integrator: $$\frac{{{\it{dA}}}}{{{\it{dt}}}} = 0 + {\it{\upvarepsilon }}_A^{\left( 2 \right)}$$
(6) During inference, _A_ generates an equivalent signal that is compared with the slow amplitude fluctuations in the input (see Supplementary Fig. 3); precision weighted prediction errors
in generalised coordinates drive Eq. 6 and result in variable _A_ tracking the slow amplitude modulations. We modify the tracked amplitude and use it as an input to the canonical theta
model. $$R = 0.25 + 0.21A$$ (7) The first parameter on the right-hand side is chosen so that theta frequency is around 5 Hz whenever _A_ is 0. The coefficient for _A_ ensures that the range
of the resulting rhythm is within the biological theta range (3–8 Hz, tested on the first 10 sentences of the dataset). The following pair of the equations corresponds to the theta
oscillator in the theta module. $$\begin{array}{*{20}{c}} {\frac{{dq_1}}{{dt}} = - kq_2\left( {1 + R + q_1\left( {R - 1} \right)} \right) + \varepsilon _{{\mathrm{q}}_{\mathrm{1}}}^{\left( 2
\right)}} \\ {\frac{{dq_2}}{{dt}} = kq_1\left( {1 + R + q_1\left( {R - 1} \right)} \right) + \varepsilon _{{\mathrm{q}}_{\mathrm{2}}}^{\left( 2 \right)}} \\ {k = \frac{{2\pi \Omega
}}{{1000}}} \end{array}$$ (8) Where 1000 is the sampling rate and _Ω_ = 5 Hz is the frequency of theta in the absence of input (_A_ = 0). The quantity within brackets of the right-hand side
of Eq. (8) stands for the normalized instantaneous rate of the theta oscillation (instantaneous rate = _k_ ∙ _s_θ). $$s_{\uptheta} = 1 + R + q_1\left( {R - 1} \right)$$ (9) The value of _k_
∙ _s_θ determines the interval between two consecutive theta triggers, hence also the interval between two consecutive syllable onsets. As this interval indirectly estimates the syllable
duration between signalled onsets, we have used the value of _s_θ to set the preferred rate of the gamma sequence (and by extension gamma sequence duration) as discussed in the next section.
Gamma units are modelled as a stable heteroclinic channel, which results in their sequential activation65 (for details see26,45). The duration of the gamma sequence depends on the hidden
variable _s_ (Eq. (12)); which sets the rate of the gamma sequence through _κ__2_ in Eq. (10) and whose dynamics depends on whether the model variant: 1) has no preferred rate for the gamma
sequence 2) has a fixed preferred rate or 3) has a preferred rate determined via the exogenous theta oscillation. For the second case, an endogenous rhythm is associated with the gamma
sequence (endogenous theta) whose rate is set to generate a rhythm at 5 Hz (Eq. (15)), therefore corresponding approximately to the average syllable duration in English30 and to the syllabic
rate in our data set (Fig. 2). In the model, gamma units provide processing windows for the syllable encoding process. The active gamma unit determines which part of the syllable is encoded
at each moment of time. For example, if the first gamma unit is active, then the first 1/8 part of the spectral content of a syllable is encoded, if the second gamma unit is active then the
second 1/8 part is encoded and so on. The mathematical equations are adapted from Yildiz et al.26. $$\frac{{{\it{dz}}}}{{{\it{dt}}}} = {\it{\upkappa }}_2\left( {\it{s}} \right)\left[ { -
\lambda {\mathbf{z}} - {\it{\uprho S}}\left( {\mathbf{z}} \right) + 1} \right] - {\it{\upbeta }}\left( {{\mathbf{z}} - {\mathbf{z}}_0} \right){\it{T}}_\gamma + {\it{\upvarepsilon
}}_z^{\left( 2 \right)}$$ (10) $$\frac{{{\it{dy}}_{\it{i}}}}{{{\it{dt}}}} = {\it{e}}^{{\it{z}}_{\it{i}}} - {\it{y}}_{\it{i}}\mathop {\sum}\limits_{{\it{j}} = 1}^{{\it{N}}_{\it{\upgamma }}}
{{\it{e}}^{{\it{z}}_{\it{j}}}} - {\it{\upbeta }}\left( {{\it{y}}_{\it{i}}{\it{ - y}}_{{\it{i}},0}} \right){\it{T}}_{\it{\upgamma }} + {\it{\upvarepsilon }}_{{\it{y}}_{\it{i}}}^{\left( 2
\right)}$$ (11) $$\frac{{ds}}{{dt}} = f\left( s \right) + \varepsilon _{\mathrm{s}}^{\left( 2 \right)}$$ (12) Where * _i_ represents the index of the gamma unit and takes values from 1 to
_N_γ = 8. * Z is a vector of 8 units encoding the amplitude fluctuations of the gamma units, whereas the vector _y_ represents the amplitude of the gamma units scaled to the [0, 1] interval.
* Z0 and Y0 represent the reset values of Z and Y, corresponding to the state when the first gamma unit is active (the start of the gamma sequence). * _T_γ stands for the trigger that gamma
units receive from the theta module (Table 1). * _β_= 0.5 is a scaling factor for theta triggers. * _S_(Z) = 1/(1 + e_-_Z) is applied component-wise. * ρ_ij_ ≥ 0 is the connectivity matrix,
determining the inhibition strength from unit _j_ to _i_. Its values are: $${\it{\uprho }}_{{\it{ij}}} = \left\{ \begin{array}{l}0\quad {\it{i}} = {\it{j}}\\ 1.5\quad {\it{j}} = {\it{i}} +
1\\ 0.5\quad {\it{j}} = {\it{i}} - 1\\ 1\quad {\rm{otherwise}}\end{array} \right.$$ (13) The first term on the right-hand side of both Eqs. (10) and (11) is taken from Yildiz et al.
(2013)26. The trigger term introduced on the right-hand side of Eqs. (10) and (11) ensures that irrespective of the current state of the network, the first gamma unit is activated, and the
gamma sequence is deployed from the beginning whenever there is a trigger. When the trigger corresponds to a syllable onset, it ensures that the gamma sequence is properly aligned with the
input and therefore that the spectrotemporal predictions temporally align with the input. We have also modified the equations so that the value of κ2 now is a function of the hidden variable
_s_ (Eq. (12)). $$\begin{array}{l}\kappa _2\left( s \right) = \kappa _0e^{\left( {s - 1} \right)}\\ s\left( {t = 0} \right) = 1\\ \kappa _0 = 0.2625\end{array}$$ (14) The transformation
from _s_ to _κ_2, guarantees that the latter stays positive. _s_ was always initialised to _s(t_ = 0_)_ = 1. This initial value leads to κ2 = κ0 in Eq. (10), which corresponds to a 25 ms
duration for each gamma unit (40 Hz frequency) and an overall duration of 200 ms for the whole gamma sequence. During inference, the value of _s_ (and consequently the value of _κ_2) can
vary depending on the form of _f(s)_ in Eq. (12). $$f\left( s \right) = \left\{ {\begin{array}{*{20}{c}} 0 \\ {1 - s} \\ {s_{\uptheta} - s} \end{array}} \right.$$ (15) _f(s)_ = 0,
corresponds to the no preferred gamma rate condition (no theta-gamma coupling); the value of _s_ can change freely. For _f(s) =_ 1-_s, s_ can change its value (and as a consequence the
duration of the gamma sequence) but it will have a tendency to decay back to _s_ = 1, that is, the model has a preferred gamma sequence duration of 200 ms; we refer to this as the endogenous
theta rhythm and endogenous theta-gamma coupling. Finally, when _f(s) = s_θ_-s_, the preferred rate of the gamma sequence is set by the exogenous theta oscillation, corresponding to the
exogenous theta-gamma coupling. For all but model variants A, B and A’ _f(s) = 0_, variant A features _f(s) = s_θ_-s_ (exogenous theta-gamma coupling) and finally, variants B and A’ have
endogenous theta-gamma coupling with _f(s) =_ 1-_s_. A summary of the model variants can be found in Table 1. Across the simulations, in the case of the exogenous theta-gamma coupling
(variant A, Fig. 3) the average value of κ2 across all sentences was equal to 1.23κ0 (with standard deviation equal to 0.45κ0), which corresponds to a gamma sequence duration around 163 ms.
This suggests that the stimulus driven theta oscillation adapts to the syllable duration in the input. Even though the theta frequency was tuned to 5 Hz (corresponding to a gamma sequence
duration to 200 ms), the resulting average frequency corresponds to the median syllable duration in our dataset (Fig. 2). In case of variant B (endogenous theta), the gamma rate remained
fixed at κ2 = κ0 (with standard deviation equal to 0.0025κ0). The last module of the top-level contains the syllable units; they represent evidence that the associated syllable corresponds
to the syllable in the input. The number of syllable units varies from sentence to sentence and corresponds to the number of syllables in the input sentence plus a silent unit which
generates silence; e.g. it should be maximally active whenever there is no signal in the input. The equations for syllable units are: $$\frac{{{\it{d\upomega }}}}{{{\it{dt}}}} = - \left(
{{\it{\upomega - \upomega }}_0} \right){\it{T}}_{\it{\upomega }} + {\it{\upvarepsilon }}_{\it{\upomega }}^{\left( 2 \right)}$$ (16) where omega is a vector with as many components as
syllables in the sentence (plus silent unit). _T_ω corresponds to triggers (Table 1) that reset the activation level of the syllable units. A trigger drives the activity level of all
syllable units towards an equal value _ω__0_. As we will specify below, triggers originated from the last gamma unit, signalling internal expectations about the end of a syllable; in the
case of model variants without the reset of syllable units, the trigger was set to 0. Between triggers, syllable units act as evidence accumulators. The activation level of each unit
determines its contribution to the generated auditory spectrogram (Eqs. (18) and (21)). The causal states of the second level pass information to the bottom level: $${\it{\upnu
}}_{\it{\upgamma }}^{\left( 1 \right)} = {\mathbf{y}} + {\it{\upeta }}_{\it{\upgamma }}^{\left( 2 \right)}$$ (17) $${\it{\upnu }}_{\it{\upomega }}^{\left( 1 \right)} = \frac{{{\it{e}}^{{\it{
- \upomega }}}}}{{{\sum} {{\it{e}}^{{\it{ - \upomega }}}} }} + {\it{\upeta }}_{\it{\upomega }}^{\left( 2 \right)}$$ (18) $${\it{\upnu }}_A^{\left( 1 \right)} = {\it{A}} + {\it{\upeta
}}_A^{\left( 2 \right)}$$ (19) Equation (17) corresponds to the 8 scaled gamma units (Eq. (11)); that have sequential activation. Equation (18) corresponds to the syllable units; where we
used the softmax function to scale the activity of the syllable units. Since all the syllables in the input are present in the syllable pool, prediction error (the difference between
predicted and actual spectrotemporal patterns at the first level) will be minimised when the causal state of the corresponding syllable unit in the model is driven close to 1 while all
others are driven close to 0. Finally, Eq. (19) sends information about the current estimate of the slow amplitude modulation. BOTTOM LEVEL The bottom level contains variables related to the
amplitude fluctuations of the frequency channels as well as the slow amplitude modulation. The amplitude fluctuations of the frequency channels are modelled with a Hopfield attractor-based
neural network66. The following equations were adapted from Yildiz et al.26 $$\frac{{{\it{d}}{\mathbf{x}}^{\left( 1 \right)}}}{{{\it{dt}}}} = {\it{\upkappa }}_1\left[ { -
{\it{D}}{\mathbf{x}}^{\left( 1 \right)} + {\it{W}}{\mathbf{{tanh}}}\left( {{\mathbf{x}}^{\left( 1 \right)}} \right) + {\it{I}}} \right] + {\it{\upvarepsilon }}^{\left( 1 \right)}$$ (20)
$${\it{I}}_{\it{f}} = \mathop {\sum}\limits_{{\it{\upgamma }} = 1}^8 {\mathop {\sum}\limits_{{\it{\upomega }} = 1}^{{\it{N}}_{{\it{syl}}}} {{\it{\upupsilon }}_{\it{\upgamma }}^{\left( 1
\right)}{\it{\upupsilon }}_{\it{\upomega }}^{\left( 1 \right)}{\it{P}}_{{\it{f\upgamma \upomega }}}} }$$ (21) X(1) is a vector with 6 components (one per frequency channel), _D_ is a
diagonal self-connectivity matrix and _W_ is an asymmetric synaptic connectivity matrix; they were designed so that the Hopfield network has a global attractor whose location depends on
vector I26. In Eq. (21), ΝΓ(1) and ΝΩ(1) are the causal states for the gamma and syllable units from the top level (Eqs. (17) and (18)). Pfγω is defined from the spectrotemporal patterns
STfγω associated with each syllable as follows: $${\it{P}}_{{\it{f\upgamma \upomega }}} = \mathop {\sum}\limits_{{\it{i}} = 1}^6 {{\it{D}}_{{\it{fi}}}{\it{ST}}_{{\it{i\upgamma \upomega }}}}
- \mathop {\sum}\limits_{{\it{i}} = 1}^6 {{\it{W}}_{{\it{fi}}}{\mathrm{tanh}}\left( {{\it{ST}}_{{\it{i\upgamma \upomega }}}} \right)}$$ (22) Because the vector _I__f_ determines the global
attractor, sequential activation of the gamma units makes the global attractor change continuously over time and generate the pattern corresponding to syllable ‘ω’ when _υ__(1)_ω _= 1_ and
_υ__(1)_not ω _= 0_. The outputs of this level are the states of the Hopfield network, which predict the activity of the frequency channels in the input, and the causal state associated with
the slow amplitude modulation (relayed from the top level): $$\begin{array}{c}{\it{\upupsilon }}_{\it{f}}^{\left( 0 \right)} = {\mathbf{x}}^{\left( 1 \right)} + {\it{\upeta
}}_{\it{f}}^{\left( 1 \right)}\\ {\it{\upupsilon }}_A^{\left( 0 \right)} = {\it{\upupsilon }}_A^{\left( 1 \right)} + {\it{\upeta }}_A^{\left( 1 \right)}\end{array}$$ (23) These quantities
are compared with the slow amplitude modulation (_S(t)_) and frequency channels (F_f_(t)) in the input signal: $$\begin{array}{l}{\it{\upnu }}_A^{\left( 0 \right)} = {\it{S}}\left( {\it{t}}
\right)\\ {\it{\upnu }}_{\it{f}}^{\left( 0 \right)} = {\it{F}}_{\it{f}}\left( {\it{t}} \right)\end{array}$$ (24) The discrepancy between top-down predictions and sensory input is propagated
back in the hierarchy to update hidden state and causal state estimates so that prediction errors at every level of the hierarchy are minimized. The values of all parameters used in the
model, as well as precisions for hidden and causal states for both levels, are presented in Tables 3 and 2 respectively. RESETS/TRIGGERS AND MODEL VARIANTS To ensure that predictions are
temporally aligned with the input, the model needs to align the gamma network with syllable onsets. Moreover, ideally, evidence accumulation should be reset before syllable onset. Although
both resets could in principle be driven by prediction errors, our basic model also involves explicit resets. When present, the trigger to reset gamma units (denoted by _T_γ in Eqs. (10) and
(11)) was driven either by exogenous theta signalled onset triggers (referred to as theta triggers _T_θ) or by explicitly provided onsets _Tr_. Theta-triggers _T_θ: A Gaussian pulse was
generated whenever the phase of the exogenous theta oscillation reached a specific phase: $$\begin{array}{l}{\it{T}}_\theta = {\it{e}}^{{\it{ - }}\frac{{\left( {{\it{q}}_{1{\it{n}}}{\it{ +
}}1} \right)^2{\it{ + q}}_{2{\it{n}}}^2}}{{2{\it{\upsigma }}^2}}}{\kern 1pt} \;\;\;\;\;\;\\ {\it{q}}_{1n}{\it{ = }}\frac{{{\it{q}}_1}}{{\sqrt {{\it{q}}_1^2{\it{ + q}}_2^2} }}{\it{,}}\quad
{\it{q}}_{2n}{\it{ = }}\frac{{{\it{q}}_2}}{{\sqrt {{\it{q}}_1^2{\it{ + q}}_2^2} }}\end{array}$$ (25) When present, the reset to syllable units (_T_ω, Eq. (16)) was driven by the model’s
knowledge about syllable spectral structure (e.g. knowledge that each syllable is a sequence of 8 spectral target points). Since the last (eighth) gamma unit signals the end of the syllable,
it can define a trigger that we refer as the internal trigger (denoted as _T_int): $${\it{T}}_{int} = {\it{y}}_8$$ (26) In summary, to reset gamma units the model uses theta triggers, which
are derived from the slow amplitude modulation of the sound waveform, whereas to reset syllable units the model uses internal information about the spectral structure of syllables - _T_int.
To explore the relative importance of each resetting mechanism for the overall performance of the model, we compared different model variants (Table 1). MODEL OUTPUT To define the syllables
identified by the model, we considered the time average of the causal state (_v_ω, Eq. (18)) of each of the syllable units taken within the boundaries defined by the gamma sequence
(Supplementary Fig. 2). We have calculated for how long the unit with highest average activation between the internally signalled boundaries corresponds to the syllable in the input at that
time. Thus, for each sentence, we calculate the percentage of the total sentence duration which was correctly identified by the model (Supplementary Fig. 2). STATISTICAL ANALYSES As
described in the previous section, a single number (% correctly identified sentence duration) describes the performance of the model for each sentence. Simulations were run on 210 sentences,
and the performance of each model architecture is thus described by a vector of 210 numbers. The non-parametric Wilcoxon signed-rank test for repeated measures was used to compare models’
performance. The Bonferroni correction was used to control for multiple comparisons. Supplementary Table 1 reports p-values for pairwise comparisons for the model variants presented in Fig.
3. Table also includes the adjusted α-value. For the simulations with compressed speech, we only compared the performance difference between 2 model variants for each condition separately.
Each pair was compared using the paired t-test (we only compare performance difference of the 2 model variants (A and B) for a particular compression factor). For the simulations presented
in Figs. 4 and 6, Cohen’s d value was calculated as the difference between the means of each group, divided by the standard deviation calculated from differences in observations. Finally, to
estimate the chance level, the following steps were performed: 1) for each sentence in the database, we mimicked the sequence of model-selected syllables (Supplementary Fig. 2, panel C) by
randomly choosing 2) a duration of “selected” syllable from the syllable duration distribution in Figs. 2a and 3) an identity from the syllables of the corresponding sentence. Thus, we have
calculated what would be the model’s performance when syllable identity and duration were selected by chance. For each sentence, the procedure was repeated 1000 times and the mean
performance value was stored (a chance level for a sentence). Furthermore, the whole procedure was repeated 1000 times on all 220 sentences in the database, and the median value of the
resulting distribution was selected as a chance level for the whole dataset, which was equal to 6.9%. MODEL COMPARISON We have calculated the Bayesian information criterion value based on
the posterior probability of the true syllable sequence S (Supplementary Fig. 2d) for each sentence and model variant m; _p(__S_ | m_)_. This quantity also represents the likelihood of the
model given the input (true syllables that the model aims to recover). Similar to the syllable recognition accuracy presented in Fig. 3, the likelihood value is based on the dynamics of the
second level causal states (Supplementary Fig. 2) but it also takes into account the precisions (confidence) of the estimated dynamics of each variable (e.g. dynamics of syllable and gamma
units in Supplementary Fig. 2a, b). To derive the BIC, we first calculated model m likelihood for each sentence _i_, and each moment of time (for convenience we do not write the explicit
time dependence of _D_ and Νμ on the right-hand side of the equations below): $$\begin{array}{l}p\left( {{\mathbf{s}}_i\left( t \right)|m} \right) = {\int} {dz\,p\left(
{{\mathbf{s}}_i,{\mathbf{z}}|m_i} \right)} \equiv \frac{{\sqrt {\det \,D} }}{{\sqrt {(2\pi )^d} }}{\int} {dz\,\exp \left[ { - \frac{1}{2}\left( {\left( {{\mathbf{\upsilon }} -
{\mathbf{\upsilon }}_\mu } \right)^TD\left( {{\mathbf{\upsilon }} - {\mathbf{\upsilon }}_\mu } \right)} \right)} \right]} \\ D \equiv \left[ {\begin{array}{*{20}{c}} A & C \\ {C^T} &
B \end{array}} \right],\,\upsilon \equiv \left[ {\begin{array}{*{20}{c}} {\mathbf{s}} \\ {\mathbf{z}} \end{array}} \right]\end{array}$$ (27) S_i_ is the vector of syllable units as they
appear in the input sentence _i_ at time _t_ (Supplementary Fig. 2d). _D_ is the overall conditional precision matrix, which we separate into the components corresponding to the syllable
units _A_, components corresponding to the other causal states _B_, and their interaction terms _C_. Ν represents the vector of causal states in the second level and Νμ the model’s estimated
means. It contains the syllable units S (mean values Sμ, Supplementary Fig. 2a) and all other causal states represented here by Z (mean values Zμ) for variants B, D, F Z represents the
dynamics of the gamma units, e.g. Supplementary Fig. 2b and, for variants A, D and E, the gamma units and the slow amplitude modulation. After integration over Z, the likelihood of the model
given the true syllables is: $$p\left( {{\mathbf{s}}_i\left( t \right)|m_i} \right) = \frac{{\sqrt {\det \,D} }}{{\sqrt {\left( {2\pi } \right)^{n_s}} \sqrt {\det B} }}\exp \left[ { -
\frac{1}{2}\left( {{\mathbf{s}}_i - {\mathbf{s}}_\mu } \right)^T\left( {A - CB^{ - 1}C^T} \right)\left( {{\mathbf{s}}_i - {\mathbf{s}}_\mu } \right)} \right]$$ (28) This value was calculated
for each moment of time, thus, to get the value for the whole sentence we calculated the average value of log(_p(__S__I__(t)_|m_)_) per syllable and sum across all syllables in the
sentence. $$\log p\left( {{\mathbf{s}}_i|m} \right) \approx \mathop {\sum}\limits_{j = 1}^{N_i} {\mathop {\sum}\limits_k {\Delta t\,\frac{{\log \left( {p\left( {{\mathbf{s}}_i\left( {t_k}
\right)|m} \right)} \right.}}{{T_j}}} }$$ (29) Where _T__j_ is the duration of syllable _j_ in the sentence _i_, _N__i_ is the number of syllables in the sentence and Δ_t_ = 1 ms. Equations
(28) and (29) represent the approximate log-likelihood value for model m given the sentence _i_. Thus, the value for the whole dataset is: $$\log p\left( {{\mathbf{s}}|m} \right) = \mathop
{\sum}\limits_i {\log p\left( {{\mathbf{s}}_i|m_i} \right)}$$ (30) This was used to approximate the Bayesian information criterion (BIC), which considers also the number of free variables in
the model. $$BIC\left( m \right) = \log p\left( {{\mathbf{s}}|m} \right) - \frac{{N_p\left( m \right)}}{2}\log N_{{\mathrm{sentences}}}$$ (31) Where _N__p__(_m_)_ is the number of free
variables for model variant m, which includes state precision values (Table 2), number of resets and the presence of gamma rate information. Thus, the most complex variant A would have 14
(precisions) + 2 (reset of gamma units by theta signalled onsets, reset of syllable units by the last gamma unit) + 1 (rate) = 17 parameters. On the other hand, the simplest version F would
have only 10 parameters, no precisions values for theta and slow amplitude modulations (−(1 + 3)), values for resets of syllable and gamma units (−2) and no preferred gamma rate value (−1).
The number of parameters for each model variant is given in Table 4. REPORTING SUMMARY Further information on research design is available in the Nature Research Reporting Summary linked to
this article. DATA AVAILABILITY Source data files are provided for Figs. 3, 4 and 6. Subset of the TIMIT dataset41 used for simulations and the corresponding Tsylb263 generated syllable
boundaries can be found at: https://github.com/sevadah/precoss/tree/master/data_construction. Source data are provided with this paper. CODE AVAILABILITY Simulations were performed with DEM
Toolbox in SPM64 using MATLAB 2018b, The MathWorks, Inc., Natick, Massachusetts, United States. Custom code is available at: https://github.com/sevadah/precoss. Source data are provided with
this paper. REFERENCES * Buzsáki, G. & Draguhn, A. Neuronal olscillations in cortical networks. _Science_ 304, 1926–1929 (2004). Article ADS PubMed CAS Google Scholar * Lakatos,
P., Karmos, G., Mehta, A. D., Ulbert, I. & Schroeder, C. E. Entrainment of neuronal oscillations as a mechanism of attentional selection. _Science_ 320, 110–113 (2008). Article ADS CAS
PubMed Google Scholar * Wang, X. J. Neurophysiological and computational principles of cortical rhythms in cognition. _Physiol. Rev._ 90, 1195–1268 (2010). Article PubMed Google
Scholar * Hyafil, A., Giraud, A. L., Fontolan, L. & Gutkin, B. Neural cross-frequency coupling: connecting architectures, mechanisms, and functions. _Trends Neurosci._ 38, 725–740
(2015). Article CAS PubMed Google Scholar * Canolty, R. T. et al. High gamma power is phase-locked to theta oscillations in human neocortex. _Science_ 313, 1626–1628 (2006). Article ADS
CAS PubMed PubMed Central Google Scholar * Ghitza, O. Linking speech perception and neurophysiology: Speech decoding guided by cascaded oscillators locked to the input rhythm. _Front.
Psychol._ 2, 130 (2011). Article PubMed PubMed Central Google Scholar * Giraud, A. L. & Poeppel, D. Cortical oscillations and speech processing: emerging computational principles and
operations. _Nat. Neurosci._ 15, 511–517 (2012). Article CAS PubMed PubMed Central Google Scholar * Hyafil, A., Fontolan, L., Kabdebon, C., Gutkin, B. & Giraud, A. L. Speech
encoding by coupled cortical theta and gamma oscillations. _Elife_ 4, 1–45 (2015). Article Google Scholar * Rimmele, J. M., Zion Golumbic, E., Schröger, E. & Poeppel, D. The effects of
selective attention and speech acoustics on neural speech-tracking in a multi-talker scene. _Cortex_ 68, 144–154 (2015). Article PubMed PubMed Central Google Scholar * Klimovich-Gray,
A. et al. Balancing prediction and sensory input in speech comprehension: the spatiotemporal dynamics of word recognition in context. _J. Neurosci._ 39, 519–527 (2019). Article CAS PubMed
PubMed Central Google Scholar * Donhauser, P. W. & Baillet, S. Two distinct neural timescales for predictive speech processing. _Neuron_. https://doi.org/10.1016/j.neuron.2019.10.019
(2019). * Chao, Z. C., Takaura, K., Wang, L., Fujii, N. & Dehaene, S. Large-scale cortical networks for hierarchical prediction and prediction error in the primate brain. _Neuron_ 100,
1252–1266.e3 (2018). Article CAS PubMed Google Scholar * Fontolan, L., Morillon, B., Liegeois-Chauvel, C. & Giraud, A. L. The contribution of frequency-specific activity to
hierarchical information processing in the human auditory cortex. _Nat. Commun._ 5, 4694 (2014). Article ADS CAS PubMed Google Scholar * Pefkou, M., Arnal, L. H., Fontolan, L. &
Giraud, A. L. Θ-Band and Β-band neural activity reflects independent syllable tracking and comprehension of time-compressed speech. _J. Neurosci._ 37, 7930–7938 (2017). Article CAS PubMed
PubMed Central Google Scholar * Park, H., Ince, R. A. A., Schyns, P. G., Thut, G. & Gross, J. Frontal top-down signals increase coupling of auditory low-frequency oscillations to
continuous speech in human listeners. _Curr. Biol._ 25, 1649–1653 (2015). Article CAS PubMed PubMed Central Google Scholar * Lewis, A. G., Schoffelen, J. M., Schriefers, H. &
Bastlaansen, M. A predictive coding perspective on beta oscillations during sentence-level language comprehension. _Front. Hum. Neurosci._ 10, 85 (2016). Article PubMed PubMed Central
Google Scholar * Sedley, W. et al. Neural signatures of perceptual inference. _Elife_ 5, e11476 (2016). Article PubMed PubMed Central Google Scholar * Mumford, D. On the computational
architecture of the neocortex - II The role of cortico-cortical loops. _Biol. Cybern._ 66, 241–251 (1992). Article CAS PubMed Google Scholar * Rao, R. P. N. & Ballard, D. H.
Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. _Nat. Neurosci._ 2, 79–87 (1999). Article CAS PubMed Google Scholar *
Friston, K. & Kiebel, S. Predictive coding under the free-energy principle. _Philos. Trans. R. Soc. B Biol. Sci._ 364, 1211–1221 (2009). Article Google Scholar * Bastos, A. M. et al.
Canonical microcircuits for predictive coding. _Neuron_ 76, 695–711 (2012). Article CAS PubMed PubMed Central Google Scholar * Liberman, A. M., Cooper, F. S., Shankweiler, D. P. &
Studdert-Kennedy, M. Perception of the speech code. _Psychol. Rev._ 74, 431–461 (1967). Article CAS PubMed Google Scholar * Halle, M. & Stevens, K. Speech recognition: a model and a
program for research. _IRE Trans. Inf. Theory_ 8, 155–159 (1962). Article Google Scholar * Knill, D. C. & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and
computation. _Trends Neurosci._ 27, 712–719 (2004). Article CAS PubMed Google Scholar * Poeppel, D., Idsardi, W. J. & Van Wassenhove, V. Speech perception at the interface of
neurobiology and linguistics. _Philos. Trans. R. Soc. B Biol. Sci._ 363, 1071–1086 (2008). Article Google Scholar * Yildiz, I. B., von Kriegstein, K. & Kiebel, S. J. From birdsong to
human speech recognition: Bayesian inference on a hierarchy of nonlinear dynamical systems. _PLoS Comput. Biol._ 9, e1003219 (2013). Article CAS PubMed PubMed Central Google Scholar *
Giraud, A. L. & Arnal, L. H. Hierarchical predictive information is channeled by asymmetric oscillatory activity. _Neuron_ 100, 1022–1024 (2018). Article CAS PubMed Google Scholar *
Marr, D. C. & Poggio, T. From understanding computation to understanding neural circuitry. _Neurosci. Res. Program Bull._ 15, 470–488 (1977). Google Scholar * Kamakshi Prasad, V.,
Nagarajan, T. & Murthy, H. A. Automatic segmentation of continuous speech using minimum phase group delay functions. _Speech Commun._ 42, 429–446 (2004). Article Google Scholar *
Greenberg, S. Speaking in shorthand - a syllable-centric perspective for understanding pronunciation variation. _Speech Commun._ 29, 159–176 (1999). Article Google Scholar * Hyafil, A.
& Cernak, M. Neuromorphic based oscillatory device for incremental syllable boundary detection. in _Proceedings of the Annual Conference of the International Speech Communication
Association, INTERSPEECH_ vols 2015-Janua 1191–1195 (ISCA, 2015). * Wacongne, C. et al. Evidence for a hierarchy of predictions and prediction errors in human cortex. _Proc. Natl. Acad. Sci.
USA_ 108, 20754–20759 (2011). Article ADS CAS PubMed PubMed Central Google Scholar * Gagnepain, P., Henson, R. N. & Davis, M. H. Temporal predictive codes for spoken words in
auditory cortex. _Curr. Biol._ 22, 615–621 (2012). Article CAS PubMed PubMed Central Google Scholar * Lewis, A. G. & Bastiaansen, M. A predictive coding framework for rapid neural
dynamics during sentence-level language comprehension. _Cortex_ 68, 155–168 (2015). Article PubMed Google Scholar * Arnal, L. H. & Giraud, A. L. Cortical oscillations and sensory
predictions. _Trends Cogn. Sci._ 16, 390–398 (2012). * Arnal, L. H., Doelling, K. B. & Poeppel, D. Delta-beta coupled oscillations underlie temporal prediction accuracy. _Cereb. Cortex_
25, 3077–3085 (2015). Article PubMed Google Scholar * Chi, T., Ru, P. & Shamma, S. A. Multiresolution spectrotemporal analysis of complex sounds. _J. Acoust. Soc. Am._ 118, 887–906
(2005). Article ADS PubMed Google Scholar * Ermentrout, G. B. & Kopell, N. Parabolic bursting in an excitable system coupled with a slow oscillation. _SIAM J. Appl. Math._ 46,
233–253 (1986). Article MathSciNet MATH Google Scholar * Kösem, A., Basirat, A., Azizi, L. & van Wassenhove, V. High-frequency neural activity predicts word parsing in ambiguous
speech streams. _J. Neurophysiol._ 116, 2497–2512 (2016). Article PubMed PubMed Central Google Scholar * Schroeder, C. E. & Lakatos, P. Low-frequency neuronal oscillations as
instruments of sensory selection. _Trends Neurosci._ 32, 9–18 (2009). Article CAS PubMed Google Scholar * Garofolo, J. et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1.
Web Download. _Philadelphia Linguist. Data Consortium_, 1–94 (1993). * Nourski, K. V. et al. Temporal envelope of time-compressed speech represented in the human auditory cortex. _J.
Neurosci._ 29, 15564–15574 (2009). Article CAS PubMed PubMed Central Google Scholar * Ghitza, O. & Greenberg, S. On the possible role of brain rhythms in speech perception:
intelligibility of time-compressed speech with periodic and aperiodic insertions of silence. _Phonetica_ 66, 113–126 (2009). Article PubMed Google Scholar * Schwarz, G. Estimating the
dimension of a model. _Ann. Stat._ 6, 461–464 (1978). Article MathSciNet MATH Google Scholar * Yildiz, I. B. & Kiebel, S. J. A hierarchical neuronal model for generation and online
recognition of birdsongs. _PLoS Comput. Biol_. 7, 1–18 (2011). Article CAS Google Scholar * Mermelstein, P. Automatic segmentation of speech into syllabic units. _J. Acoust. Soc. Am._ 58,
880–883 (1975). Article ADS CAS PubMed Google Scholar * Luo, H. & Poeppel, D. Cortical oscillations in auditory perception and speech: Evidence for two temporal windows in human
auditory cortex. _Front. Psychol._ 3, 170 (2012). Article PubMed PubMed Central Google Scholar * Lisman, J. E. & Jensen, O. The theta-gamma neural code. _Neuron_ 77, 1002–1016
(2013). Article CAS PubMed PubMed Central Google Scholar * Lam, N. H. L., Schoffelen, J. M., Uddén, J., Hultén, A. & Hagoort, P. Neural activity during sentence processing as
reflected in theta, alpha, beta, and gamma oscillations. _Neuroimage_ 142, 43–54 (2016). Article PubMed Google Scholar * Ding, N. et al. Characterizing neural entrainment to hierarchical
linguistic units using electroencephalography (EEG). _Front. Hum. Neurosci_. 11, 481–490 (2017). Article PubMed PubMed Central Google Scholar * Martin, A. E. & Doumas, L. A. A. A
mechanism for the cortical computation of hierarchical linguistic structure. _PLoS Biol_. 15, 1–23 (2017). Article CAS Google Scholar * Venezia, J. H., Thurman, S. M., Richards, V. M.
& Hickok, G. Hierarchy of speech-driven spectrotemporal receptive fields in human auditory cortex. _Neuroimage_ 66, 647–666 (2019). Article Google Scholar * de Heer, W. A., Huth, A.
G., Griffiths, T. L., Gallant, J. L. & Theunissen, F. E. The hierarchical cortical organization of human speech processing. _J. Neurosci_. 37, 6539–6557 (2017). Article CAS PubMed
PubMed Central Google Scholar * Bouton, S. et al. Focal versus distributed temporal cortex activity for speech sound category assignment. _Proc. Natl. Acad. Sci. USA_ 115, E1299–E1308
(2018). Article CAS PubMed PubMed Central Google Scholar * Friston, K. J., Trujillo-Barreto, N. & Daunizeau, J. DEM: a variational treatment of dynamic systems. _Neuroimage_ 41,
849–885 (2008). Article CAS PubMed Google Scholar * Friston, K. J. Variational filtering. _Neuroimage_ 41, 747–766 (2008). Article CAS PubMed Google Scholar * Friston, K., Stephan,
K., Li, B. & Daunizeau, J. Generalised filtering. _Math. Probl. Eng_. 2010, e621670 (2010). * Li, J., Deng, L., Gong, Y. & Haeb-Umbach, R. An overview of noise-robust automatic
speech recognition. _IEEE Trans. Audio Speech Lang. Process._ 22, 745–777 (2014). Article Google Scholar * Prabhavalkar, R. et al. A Comparison of sequence-to-sequence models for speech
recognition. in _Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH_ vols 2017-Augus 939–943 (ISCA, 2017). * Romera, M. et al. Vowel
recognition with four coupled spin-torque nano-oscillators. _Nature_ 563, 230–234 (2018). Article ADS CAS PubMed Google Scholar * Hirsch, H. G., Hellwig, K. & Dobler, S. _Speech
recognition at multiple sampling rates_. _EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology_ vol. 2001
http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Speech+Recognition+at+Multiple+Sampling+Rates#0 (2001). * Davis, M. H. & Scharenborg, O. Speech perception by humans
and machines. In _Speech Perception and Spoken Word Recognition_ 181–203 (Psychology Press, 2016). https://doi.org/10.4324/9781315772110. * Fisher, W. M. Tsylb2-1.1 Syllabification
Software. http://www.nist.gov/speech/tools (1996). * SPM - Statistical Parametric Mapping. https://www.fil.ion.ucl.ac.uk/spm/. * Rabinovich, M. I., Varona, P., Selverston, A. I. &
Abarbanel, H. D. I. Dynamical principles in neuroscience. _Rev. Mod. Phys._ 78, 1213–1265 (2006). Article ADS Google Scholar * Hopfield, J. J. Neural network and physical systems with
emergent collective computational abilities. _Proc. Natl. Acad. Sci. USA_. https://doi.org/10.1073/pnas.79.8.2554 (1982). Download references ACKNOWLEDGEMENTS This work was funded by a grant
from the Swiss National Science Foundation (#320030_163040) and the National Centre of Competence in Research - Evolving Language. Authors are thankful Julien Diard and Jaime Delgado for
helpful discussions and feedback. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Department of Basic Neurosciences, University of Geneva, Biotech Campus, 9 Chemin des Mines, C.P. 87, 1211,
Genève, Switzerland Sevada Hovsepyan, Itsaso Olasagasti & Anne-Lise Giraud Authors * Sevada Hovsepyan View author publications You can also search for this author inPubMed Google Scholar
* Itsaso Olasagasti View author publications You can also search for this author inPubMed Google Scholar * Anne-Lise Giraud View author publications You can also search for this author
inPubMed Google Scholar CONTRIBUTIONS A.L.G. and I.O. designed the study, S.H. carried out the study under I.O. and A.L.G. guidance, S.H., I.O., and A.L.G. wrote the manuscript.
CORRESPONDING AUTHOR Correspondence to Sevada Hovsepyan. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PEER REVIEW INFORMATION
_Nature Communications_ thanks Andrea Martin and Jonas Obleser for their contribution to the peer review of this work. Peer reviewer reports are available. PUBLISHER’S NOTE Springer Nature
remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION PEER REVIEW FILE SUPPLEMENTARY INFORMATION REPORTING SUMMARY
SOURCE DATA SOURCE DATA RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation,
distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to
the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE
CITE THIS ARTICLE Hovsepyan, S., Olasagasti, I. & Giraud, AL. Combining predictive coding and neural oscillations enables online syllable recognition in natural speech. _Nat Commun_ 11,
3117 (2020). https://doi.org/10.1038/s41467-020-16956-5 Download citation * Received: 15 January 2019 * Accepted: 04 June 2020 * Published: 19 June 2020 * DOI:
https://doi.org/10.1038/s41467-020-16956-5 SHARE THIS ARTICLE Anyone you share the following link with will be able to read this content: Get shareable link Sorry, a shareable link is not
currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative