Generic predictions of output probability based on complexities of inputs and outputs

Generic predictions of output probability based on complexities of inputs and outputs

Play all audios:

Loading...

ABSTRACT For a broad class of input-output maps, arguments based on the coding theorem from algorithmic information theory (AIT) predict that simple (low Kolmogorov complexity) outputs are


exponentially more likely to occur upon uniform random sampling of inputs than complex outputs are. Here, we derive probability bounds that are based on the complexities of the inputs as


well as the outputs, rather than just on the complexities of the outputs. The more that outputs deviate from the coding theorem bound, the lower the complexity of their inputs. Since the


number of low complexity inputs is limited, this behaviour leads to an effective lower bound on the probability. Our new bounds are tested for an RNA sequence to structure map, a finite


state transducer and a perceptron. The success of these new methods opens avenues for AIT to be more widely used. SIMILAR CONTENT BEING VIEWED BY OTHERS TOWARD A FORMAL THEORY FOR COMPUTING


MACHINES MADE OUT OF WHATEVER PHYSICS OFFERS Article Open access 16 August 2023 DEEP NEURAL NETWORKS HAVE AN INBUILT OCCAM’S RAZOR Article Open access 14 January 2025 SYNTHESIZING THEORIES


OF HUMAN LANGUAGE WITH BAYESIAN PROGRAM INDUCTION Article Open access 30 August 2022 INTRODUCTION Deep links between physics and theories of computation1,2 are being increasingly exploited


to uncover new fundamental physics and to provide novel insights into theories of computation. For example, advances in understanding quantum entanglement are often expressed in


sophisticated information theoretic language, while providing new results in computational complexity theory such as polynomial time algorithms for integer factorization3. These connections


are typically expressed in terms of Shannon information, with its natural analogy with thermodynamic entropy. There is, however, another branch of information theory, called algorithmic


information theory (AIT)4, which is concerned with the information content of individual objects. It has been much less applied in physics (although notable exceptions occur, see5 for a


recent overview). Reasons for this relative lack of attention include that AIT’s central concept, the Kolmogorov complexity _K__U_(_x_) of a string _x_, defined as the length of the shortest


program that generates _x_ on an optimal reference universal Turing machine (UTM) _U_, is formally uncomputable due to its link to the famous halting problem of UTMs6 — see7 for technical


details. Moreover, many important results, such as the invariance theorem which states that for two UTMs _U_ and _W_, the Kolmogorov complexities \({K}_{U}(x)={K}_{W}(x)+{\mathcal{O}}(1)\)


are equivalent, hold asymptotically up to \({\mathcal{O}}(1)\) terms that are independent of _x_, but not always well understood, and therefore hard to control. Another reason applications


of AIT to many practical problems have been hindered can be understood in terms of hierarchies of computing power. For example, one of the oldest such categorisations, the Chomsky


hierarchy8, ranks automata into four different classes, of which the UTMs are the most powerful, and finite state machines (FSMs) are the least. Many key results in AIT are derived by


exploiting the power of UTMs. Interestingly, if physical processes can be mapped onto UTMs, then certain properties can be shown to be uncomputable9,10. However, many problems in physics are


fully computable, and therefore lower on the Chomsky hierarchy than UTMs. For example, finite Markov processes are equivalent to FSMs, and RNA secondary structure (SS) folding algorithms


can be recast as context-free grammars, the second level in the hierarchy. Thus, an important cluster of questions for applications of AIT revolve around extending its methods to processes


lower in computational power than UTMs. To explore ways of moving beyond these limitations and towards practical applications, we consider here one of the most iconic results of AIT, namely


Levin’s coding theorem11 (see also the work of Solomonoff12 who had a version of the a priori distribution), which predicts that upon randomly chosen programs (e.g. a binary string program


constructed by coin flips), the probability _P__U_(_x_) that a universal Turing machine (UTM) generates output _x_ can be bounded as \({2}^{-K(x)}\le P(x)\le {2}^{-K(x)+{\mathcal{O}}(1)}\).


Technically, because we require the programs of the Turing machine to be prefix-free, we restrict ourselves to what are called prefix Turing machines. Given this profound prediction of a


general exponential bias towards simplicity (low Kolmogorov complexity) one might have expected widespread study and applications in science and engineering. This has not been the case


because the theorem unfortunately suffers from the general issues of AIT described above. Nevertheless, it has recently been shown13,14 that a related exponential bias towards low complexity


outputs obtains for a range of non-universal input-output maps _f_: _I_ → _O_ that are lower on the Chomsky hierarchy than UTMs. In particular, an upper bound on the probability _P_(_x_)


that an output obtains upon uniform random sampling of inputs, $$P(x)\le {2}^{-a\widetilde{K}(x)-b}\ $$ (1) was recently derived13 using a computable approximation \(\widetilde{K}(x)\) to


the Kolmogorov complexity of _x_, typically calculated using lossless compression techniques. Here _a_ and _b_ are constants that are independent of _x_ and which can often be determined


from some basic information about the map. The so-called _s_implicity bias bound (1) holds for computable maps _f_ where the number of inputs _N__I_ is much greater than the number of


outputs _N__O_ and the map _f_ is simple, meaning that asymptotically \(K(f)+K(n)\ll K(x)+{\mathcal{O}}(1)\) for a typical output _x_, where _n_ specifies the size of _N__I_, e.g. _N__I_ = 


2_n_. It is important to distinguish the map _f_ (which we assume is simple) and the initial conditions15, i.e. a program for _x_ (which may or may not be simple). Equation (1) typically


works better for larger _N__I_ and _N__O_. Approximating the true Kolmogorov complexity also means that the bound shouldn’t work for maps where a significant fraction of outputs have


complexities that are not qualitatively captured by compression based approximations. For example many pseudo random-number generators are designed to produce outputs that appear to be


complex when measured by compression or other types of Kolmogorov complexity approximations. Yet these outputs must have low _K_(_x_) because they are generated by relatively simple


algorithms with short descriptions. Nevertheless, it has been shown that the bound (1) works remarkably well for a wide class of input-output maps, ranging from the sequence to RNA secondary


structure map, to systems of coupled ordinary differential equations, to a stochastic financial trading model, to the parameter-function map for several classes of deep neural


networks13,16,17. The simplicity bias bound (1) predicts that high _P_(_x_) outputs will be simple, and that complex outputs will have a low _P_(_x_). But, in sharp contrast to the full AIT


coding theorem, it doesn’t have a lower bound, allowing low \(\widetilde{K}(x)\) outputs with low _P_(_x_) that are far from the bound. Indeed, this behaviour is generically observed for


many (non-universal) maps13,16 (see also Fig. 1), but should not be the case for UTMs that obey the full coding theorem. Understanding the behaviour of outputs far from the bound should shed


light on fundamental differences between UTMs and maps with less computational power that are lower on the Chomsky hierarchy, and may open up avenues for wider applications of AIT in


physics. RESULTS With this challenge in mind, we take an approach that contrasts with the traditional coding theorem of AIT or with the simplicity bias bound, which only consider the


complexity of the outputs. Instead, we derive bounds that also take into account the complexity of the inputs that generate a particular output _x_. While this approach is not possible for


UTMs, since the halting problem means one cannot enumerate all inputs4, and so averages over input complexity cannot be calculated, it can be achieved for non-UTM maps. Among our main


results, we show that the further outputs are from the simplicity bias bound (1), the lower the complexity of the set of inputs. Since, by simple counting arguments, most strings are


complex4, the cumulative probability of outputs far from the bound is therefore limited. We also show that by combining the complexities of the output with that of the inputs, we can obtain


better bounds on and estimates of _P_(_x_). Whether such bounds nevertheless have real predictive power needs to be tested empirically. Because input based bounds typically need exhaustive


sampling, full testing is only possible for smaller systems, which restricts us here to maps where finite size effects may still play a role13. We test our bounds on three systems, the


famous RNA sequence to secondary structure map (which falls into the context-free class in the Chomsky hierarchy), here for a relatively small size with length _L_ = 15 sequences, a finite


state transducer (FST), a very simple input-output map that is lowest on the Chomsky hierarchy8, with length _L_ = 30 binary inputs, and finally the parameter-function map16,17 of a


perceptron18 with discretized weights to allow complexities of inputs to be calculated. The preceptron plays a key role in deep learning neural network architectures19. Nevertheless, as can


be seen in Fig. 1(a–c) all three maps exhibit simplicity bias predicted by Eq (1), even if they are relatively small. In ref. 13, much cleaner simplicity bias behaviour can be observed for


larger RNA maps, but these are too big to exhaustively sample inputs. Similarly, cleaner simplicity bias behaviour occurs for the undiscretised perceptron17, but then it is hard to analyse


the complexity of the inputs. Figure 1(a–c) shows that the complexity of the input strings that generate each output _x_ decreases for further distances from the simplicity bias bound. This


is the kind of phenomenon that the we will attempt to explain. To study input based bounds, consider a map _f_: _I_ → _O_ between _N__I_ inputs and _N__O_ outputs that satisfies the


requirements for simplicity bias13. Let _f_(_p_) = _x_, where _p_ is some input program _p_ ∈ _I_ producing output _x_ ∈ _O_. For simplicity let _p_ ∈ {0, 1}_n_, so that all inputs have


length _n_ and _N__I_ = 2_n_ (this restriction can be relaxed later). Define _f_−1(_x_) to be the set of all the inputs that map to _x_, so that the probability that _x_ obtains upon


sampling inputs uniformly at random is $$P(x)=\frac{| \,{f}^{-1}(x)| }{{2}^{n}}\ $$ (2) Any arbitrary input _p_ can be described using the following \({\mathcal{O}}(1)\) procedure13:


Assuming _f_ and _n_ are given, first enumerate all 2_n_ inputs and map them to outputs using _f_. The index of a specific input _p_ within the set _f_ −1(_x_) can be described using at most


\({{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )\) bits. In other words, this procedure identifies each input by first finding the output _x_ it maps to, and then finding its label within the set


_f_ −1(_x_). Given _f_ and _n_, an output _x_ = _f_(_p_) can be described using \(K(x| \,f,n)+{\mathcal{O}}(1)\) bits13. Thus, the Kolmogorov complexity of _p_, given _f_ and _n_ can be


bounded as: $$K(p| \,f,n)\le K(x| \,f,n)+{{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )+{\mathcal{O}}(1).$$ (3) We note that this bound holds in principle for all _p_, but that it is tightest for


\({K}_{\max }(p| x)\equiv {\max }_{p}\{K(p| \,f,n)\}\) for _p_ ∈ _f_ −1(_x_). More generally, we can expect these bounds to be fairly tight for the maximum complexity \({K}_{\max }(p|


\,f,n)\) of inputs due to the following argument. First note that $${K}_{\max }(p| \,f,n)\ge {{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )+{\mathcal{O}}(1)$$ (4) because any set of | _f_ −1(_x_)|


different elements must have strings of at least this complexity. Next, $$K(x| \,f,n)\le K(p| \,f,n)+{\mathcal{O}}(1)$$ (5) because each _p_ can be used to generate _x_. Therefore: $$\max


(K(x| \,f,n),{{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| ))\le {K}_{\max }(p| \,f,n)+{\mathcal{O}}(1),$$ (6) so the bound (3) cannot be too weak. In the worst case scenario, where \({K}_{\max }(p|


n)\approx {{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )\approx K(x| \,f,n)\), the right hand side of the bound (3) is approximately twice the left hand side (up to additive \({\mathcal{O}}(1)\)


terms). It is tighter if either _K_(_x_| _f_, _n_) is small, or if _K_(_x_| _f_, _n_) is big relative to \({{\rm{\log }}}_{2}(| \,{f}^{-1}(x)| )\). As is often the case for AIT predictions,


the stronger the constraint/prediction, the more likely it is to be observed in practice, because, for example, the \({\mathcal{O}}(1)\) terms are less likely to drown out the effects. By


combining with Eq. (2), the bound (3) can be rewritten in two complementary ways. Firstly, a lower bound on _P_(_x_) can be derived of the form: $$P(x)\ge {2}^{-K(x| f,n)-[n-K(p|


f,n)]+{\mathcal{O}}(1)}$$ (7) ∀ _p_ ∈ _f_ −1(_x_) which complements the simplicity bias upper bound (1). This bound is tightest for \({K}_{\max }(p| n)\). In ref. 13 it was shown that


\(P(x)\le {2}^{-K(x| f,n)+{\mathcal{O}}(1)}\) by using a similar counting argument to that used above, together with a Shannon-Fano-Elias code procedure. Similar results can be found in


standard works4,20. A key step is to move from the conditional complexity to one that is independent of the map and of _n_. If _f_ is simple, then the explicit dependence on _n_ and _f_ can


be removed by noting that since \(K(x)\le K(x| \,f,n)+K(\,f)+K(n)+{\mathcal{O}}(1)\), and \(K(x| \,f,n)\le K(x)+{\mathcal{O}}(1)\) then \(K(x| \,f,n)\approx K(x)+{\mathcal{O}}(1)\). In Eq.


(1) this is further approximated as \(K(x| \,f,n)+{\mathcal{O}}(1)\approx a\widetilde{K}(x)+b\), leading to a practically useable upper bound. The same argument can be used to remove


explicit dependence on _n_ and _f_ for _K_(_p_| _f_, _n_). If we define a maximum randomness deficit \({\delta }_{\max }(x)=n-{K}_{\max }(p| n)\), then this tightest version of bound (7) can


be written in a simpler form as $$P(x)\ge {2}^{-a\widetilde{K}(x)+b-{\delta }_{\max }(x)+{\mathcal{O}}(1)}$$ (8) In Fig. 1(d–f) we plot this lower bound for all three maps studied.


Throughout the paper, we use a scaled complexity measure, which ensures that \(\widetilde{K}(x)\) ranges between ≈0 and ≈_n_ bits, for strings of length _n_, as expected for Kolmogorov


complexity. See Methods for more details. When comparing the data in Fig. 1(d–f) to Fig. 1(a–c), it is clear that including the input complexities reduces the spread in the data for RNA and


the FST, although for the perceptron model the difference is less pronounced. This success suggests using the bound (8) as a predictor \(P(x)\approx {2}^{-K(x| f,n)-{\delta }_{\max }(x)}\),


with the additional constraint that ∑_x__P_(_x_) = 1 to normalise it. As can be seen in Fig. 1(d–f), this simple procedure works reasonably well, showing that the input complexity provides


additional predictive power to estimate _P_(_x_) from some very generic properties of the inputs and outputs. A second, complimentary way that bound (3) can be expressed is in terms of how


far _P_(_x_) differs from the simplicity bias bound (1): $$[{{\rm{\log }}}_{2}({P}_{0}(x))-{{\rm{\log }}}_{2}(P(x))]\le [n-K(p| \,f,n)]+{\mathcal{O}}(1)$$ (9) where \({P}_{0}(x)={2}^{-K(x|


f,n)}\approx {2}^{-a\widetilde{K}(x)+b}\) is the upper bound (1) shown in Fig. 1(a–c). For a random input _p_, with high probability we expect \(K(p| \,f,n)=n+{\mathcal{O}}(1)\)4. Thus, Eqs.


(7) and (9) immediately imply that large deviations from the simplicity bias bound (1) are only possible with highly non-random inputs with a large randomness deficit \({\delta }_{\max


}(x)\). In Fig. 2(a–c) we directly examine bound (9), showing explicitly the prediction that a drop of probability _P_(_x_) by Δ bits from the simplicity bias bound (1) corresponds to a Δ


bit randomness deficit in the set of inputs. Simple counting arguments can be used to show that the number of non-random inputs is a small fraction of the total number of inputs21. For


example, for binary strings of length _n_, with _N__I_ = 2_n_, the number of inputs with complexity _K_ = _n_ − _δ_ is approximately 2−_δ__N__I_. If we define a set \({\mathcal{D}}(f)\) of


all outputs _x__i_ that satisfy \(({{\rm{\log }}}_{2}({P}_{0}({x}_{i}))-{{\rm{\log }}}_{2}({P}_{({x}_{i})}))\ge \Delta \), i.e. the set of all outputs for which \({{\rm{\log }}}_{2}P(x)\) is


at least Δ bits below the simplicity bias bound (1), then this counting argument leads to the following cumulative bound: $$\sum _{x\in {\mathcal{D}}(f)}P(x)\le {2}^{-\Delta


+1+{\mathcal{O}}(1)}$$ (10) which predicts that, upon randomly sampling inputs, most of the probability weight is for outputs with _P_(_x_) relatively close to the upper bound. There may be


many outputs that are far from the bound, but their cumulative probability drops off exponentially the further they are from the bound because the number of simple inputs is exponentially


limited. Note that this argument is for a cumulative probability over all inputs. It does not predict that for a given complexity _K_(_x_), that the outputs should all be near the bound. In


that sense this lower bound is not like that of the original coding theorem which holds for any output _x_. See the Supplementary Information for an alternative derivation of this bound.


Bound (10) does not need an exhaustive enumeration to be tested. In Fig. 3 we show this bound for a series of different maps, including many maps from13. The cumulative probability weight


scales roughly as expected, implying that most of the probability weight is relatively close to the bound (at least on a log scale). What is the physical nature of these low complexity, low


probability outputs that occur far from the bound? They must arise in one way or another from the lower computational power of these maps, since they don’t occur in the full AIT coding


theory. Low complexity, low probability outputs correspond to output patterns which are simple, but which the given computable map is not good at generating. In RNA it is easy to construct


outputs which are simple but will have low probability. Compare two _L_ = 15 structures _S_1 = ((.(.(...).).)). and _S_2 = .((.((...)).))., which are both symmetric and thus have a


relatively low complexity _K_(_S_1) = _K_(_S_2) = 21.4. Nevertheless they have a significant difference in probability, _P_(_S_2)∕_P_(_S_1) ≈ 560 because _S_1 has several single bonds, which


is much harder to make according to the biophysics of RNA. Only specially ordered input sequences can make _S_1, in other words they are simple, with \({K}_{\max }(p| n)=8.6\). By contrast,


the inputs of _S_2 are much higher at \({K}_{\max }(p| n)=21.4\) because they need to be constrained less to produce this structure. This example illustrates how the system specific details


of the RNA map can unfavourably bias away from some outputs due to a system specific constraint. Similar examples of system specific constraint for the FST and perceptron can be found in


the SI. We hypothesise that such low complexity, low probability structures highlight specific non-universal aspects of the maps, and extra information (in the form of a reduced set of


inputs) are needed to generate such structures. DISCUSSION In conclusion, it is striking that bounds based simply on the complexity of the inputs and outputs can make powerful and general


predictions for such a wide range of systems. Although the arguments used to derive them suffer from the well known problems – e.g. the presence of uncomputable Kolmogorov complexities and


unknown \({\mathcal{O}}(1)\) terms – that have led to the general neglect of AIT in the physics literature, the bounds are undoubtably successful. It appears that, just as is found in other


areas of physics, these relationships hold well outside of the asymptotic regime where they can be proven to be correct. This practical success opens up the promise of using such AIT based


techniques to derive other results for computable maps from across physics. Many new questions arise. Can it be proven when the \({\mathcal{O}}(1)\) terms are relatively unimportant? Why do


our rather simple approximations to _K_(_x_) work? It would be interesting to find maps where these classical objections to the practical use of AIT are important. There may also be


connections between our work and finite state complexity22 or minimum description length23 approaches. Progress in these domains should generate new fundamental understandings of the physics


of information. METHODS RNA SEQUENCE TO SECONDARY STRUCTURE MAPPING RNA is made of a linear sequence of 4 different kinds of nucleotides, so that there are _N__I_ = 4_L_ possible sequences


for any particular length _L_. A versatile molecule, it can store information, as messenger RNA, or else perform catalytic or structural functions. For functional RNA, the three-dimensional


(3D) structure plays an important role in its function. In spite of decades of research, it remains difficult to reliably predict the 3D structure from the sequence alone. However, there are


fast and accurate algorithms to calculate the so-called secondary structure (SS) that determines which base binds to which base. Given a sequence, these methods typically minimize the


Turner model24 for the free-energy of a particular bonding pattern. The main contributions in the Turner model are the hydrogen bonding and stacking interactions between the nucleotides, as


well as some entropic factors to take into account motifs such as loops. Fast algorithms based on dynamic programming allow for rapid calculations of these SS, and so this mapping from


sequences to SS has been a popular model for many studies in biophysics. In this context, we view it as an input-output map, from _N__I_ input sequences to _N__O_ output SS structures. This


map has been extensively studied (see e.g.25,26,27,28,29,30,31,32,33) and provided profound insights into the biophysics of folding and evolution. Here we use the popular Vienna package26 to


fold sequences to structures, with all parameters set to their default values (e.g. the temperature _T_ = 37°_C_). We folded all _N__I_ = 415 ≈ 109 sequences of length 15, into 346


different structures which were the free-energy minimum structures for those sequences. The number of sequences mapping to a structure is often called the _neutral set size_. The structures


can be abstracted in standard dot-bracket notation, where brackets denote bonds, and dots denote unbonded pairs. For example, ...((. . . .))..... means that the first three bases are not


bonded, the fourth and fifth are bonded, the sixth through ninth are unbonded, the tenth base is bonded to the fifth base, the eleventh base is bonded to the fourth base, and the final four


bases are unbounded. To estimate the complexity of an RNA SS, we first converted the dot-bracket representation of the structure into a binary string _x_, and then used the complexity


estimator described below to estimate its complexity. To convert to binary strings, we replaced each dot with the bits 00, each left-bracket with the bits 10, and each right-bracket with 01.


Thus an RNA SS of length _n_ becomes a bit-string of length 2_n_. As an example, the following _n_ = 15 structure yields the displayed 30-bit string $$..(((...)))....\to


000010101000000001010100000000$$ Because we are interested in exhaustive calculations, we are limited to rather small RNA sequence lengths. This means that finite-size effects may play an


important role. In13, we compared the simplicity bias bound (1) from the main text to longer sequences where only partial sampling can be achieved, and showed much clearer simplicity bias is


evident in those systems. FINITE STATE TRANSDUCER Finite state transducers (FSTs) are a generalization of finite state machines that produce an output. They are defined by a finite set of


states \({\mathcal{S}}\), finite input and output alphabets \({\mathcal{I}}\) and \({\mathcal{O}}\), and a transition function \(T:{\mathcal{S}}\times {\mathcal{I}}\to {\mathcal{S}}\times


{\mathcal{O}}\) defining, for each state, and input symbol, a next state, and output symbol. One also needs to define a distinguished state, \({S}_{0}\in {\mathcal{S}}\), which will be the


initial state, before any input symbol has been read. Given an _input sequence_ of _L_ input symbols, the system visits different states, and simultaneously produces an _output sequence_ of


_L_ output symbols. FST form a popular toy system for computable maps. They can express any computable function that requires only a finite number of memory, and the number of states in the


FSTs offers a good parameter to control the complexity of the map. The class of machines we described above is also known as Mealy machines34. If one restricts the transition function to


only depend on the current state, one obtains Moore machines35. If one considers the input sequence to a Moore machine to be stochastic, it immediately follows that its state sequence


follows a Markov chain, and its output sequence is a Markov information source. Therefore, FSTs can be used to model many stochastic systems in nature and engineering, which can be described


by finite-state Markov dynamics. FSTs lie in the lowest class in the Chomsky hierarchy. However, they appear to be biased towards simple outputs in a manner similar to Levin’s coding


theorem. In particular, Zenil _et al_.14 show evidence of this by correlating the probability of FSTs and UTMs producing particular outputs. More precisely, they sampled random FSTs with


random inputs, and random UTMs with random inputs, and then compared the empirical frequencies with with individual output strings are obtained by both families, after many samples of


machines and inputs. For both types of machines, simple strings were much more likely to be produced than complex strings. We use randomly generated FSTs with 5 states. The FSTs are


generated by uniformly sampling complete initially connected DFAs (where every state is reachable from the initial state, and the transition function is defined for every input) using the


library FAdo36, which uses the algorithm developed by Almeida _et al_.37. Output symbols are then added to each transition independently and with uniform probability. In our experiments, the


inputs are binary strings and the outputs are binary strings of length _L_ = 30. The outputs for the whole set of 2_L_ input strings are computed using the HFST library


(https://hfst.github.io/). Not all FSTs show bias, but we have observed that all those that show bias show simplicity bias, and have the same behavior as that shown in Fig. 1 for low


complexity - low probability outputs. We can see why some simple outputs will occur with low probability by considering system specific details of the FST. For an FST, an output of length


_n_ which is _n_/2 zeros followed by _n_/2 ones is clearly simple, but we find that it has a low probability. We can understand this intuitively as follows. Producing such a string requires


the "counting” up to _n_/2 to know when to switch output, and counting requires a memory that grows with _n_, while FSTs have finite memory. We can also prove that, for instance, an FST


that only produces such strings (for any _n_) is impossible. The set of possible strings that an FST can produce comprises a regular language, as constructed by using the output symbols at


each transition as input symbols, giving us a non-deterministic finite automaton. Finally, using the pumping lemma38, it is easy to see that this family of strings isn’t a regular language.


PERCEPTRON The perceptron18 is the simplest type of artificial neural network. It consists of a single linear layer, and a single output neuron with binary activation. Because modern deep


neural network architectures are typically made of many layers of perceptrons, this simple system is important to study17. In this paper we use perceptrons with Boolean inputs and


discretized weights. For inputs _x_ ∈ {0, 1}_n_, the discretized perceptron uses the following parametrized class of functions: $${f}_{w,b}(x)={\bf{1}}(w\cdot x+b),$$ where _w_ ∈ {−_a_, _a_ 


+ _δ_, …, _a_ − _δ_, _a_}_n_ and _b_ ∈ { − _a_, _a_ + _δ_, …, _a_ − _δ_, _a_} are the weight vector and bias term, which take values in a discrete lattice with _D_ := 2_a_/_δ_ + 1 possible


values per weight. We used _D_ = 2_k_, so that each weight can be represented by _k_ bits, and _a_ = (2_k_ − 1)/2, so that _δ_ = 1. Note that rescaling all the weights _w_ and the bias _b_


by the same fixed constant wouldn’t change the family of functions. To obtain the results in Fig. 1, for which _n_ = 7, we represented the weights and bias with _k_ = 3 bits. We exhausitvely


enumerated all 23(7+1) possible values of the weights and vectors, and we counted how many times we obtained each possible Boolean function on the Boolean hypercube {0, 1}7. The weight-bias


pair was represented using 3 × (7 + 1) = 24 bits. A pair (_w_, _b_) is an input to the parameter-function map of the perceptron. The complexity of inputs to this map can therefore be


approximated by computing the Lempel-Ziv complexity of the 24-bit representation of the pair (_w_, _b_). In Fig. 4, we compare the simplicity bias of a perceptron with real-valued weights


and bias, sampled from a standard Normal distribution, to the simplicity bias of the perceptron with discretized weights. We observe that both display similar simplicity bias, although the


profile of the upper bound changes slightly. For the perceptron we can also understand some simple examples of low complexity, low probability outputs. For example, the function with all 0s


except a 1, for the inputs (1, 0, 0, 0, 0, 0, 0) and (0, 1, 0, 0, 0, 0, 0) has a similar complexity to the function which only has 1s at the inputs (1, 0, 0, 0, 0, 0, 0) and (0, 1, 1, 1, 1,


1, 1). However, the latter has much lower probability. One can understand this because if we take the dot product of a random weight vector _w_ with two different inputs _x_1 and _x_2, the


results have correlation given _x_1 ⋅ _x_2∕(||_x_1||||_x_2||). Therefore we expect the input (0, 1, 1, 1, 1, 1, 1) to be correlated to more other inputs, than (0, 1, 0, 0, 0, 0, 0), so that


the probability of it having a different value than the majority of inputs (as is the case for the second function) is expected to be significantly lower. METHODS TO ESTIMATE COMPLEXITY


\(\TILDE{{\BOLDSYMBOL{K}}}{\BOLDSYMBOL{(}}{\BOLDSYMBOL{X}}{\BOLDSYMBOL{)}}\) There is a much more extensive discussion of different ways to estimate the Kolmogorov complexity in the


supplementary information of13 and16. Here we use compression based measures, and as in these previous papers, we these are based on the 1976 Lempel Ziv (LZ) algorithm39, but with some small


changes: $${C}_{LZ}(x)=\left\{\begin{array}{ll}{{\rm{\log }}}_{2}(n), & x={0}^{n}\ {\rm{or}}\ {1}^{n}\\ {{\rm{\log


}}}_{2}(n)[{N}_{w}({x}_{1}\,...{x}_{n})+{N}_{w}({x}_{n}\,...{x}_{1})]/2, & \,{\rm{otherwise}}\,\end{array}\right.$$ (11) Here _N__w_(_x_) is the number of code words found by the LZ


algorithm. The reason for distinguishing 0_n_ and 1_n_ is merely an artefact of _N__w_(_x_) which assigns complexity _K_ = 1 to the string 0 or 1, but complexity 2 to 0_n_ or 1_n_ for _n_ ≥ 


2, whereas the Kolmogorov complexity of such a trivial string actually scales as \({{\rm{\log }}}_{2}(n)\), as one only needs to encode _n_. In this way we ensure that our _C__L__Z_(_x_)


measure not only gives the correct behaviour for complex strings in the \({{\rm{lim}}}_{n\to \infty }\), but also the correct behaviour for the simplest strings. In addition to the


\({{\rm{\log }}}_{2}(n)\) correction, taking the mean of the complexity of the forward and reversed strings makes the measure more fine-grained, since it allows more values for the


complexity of a string. Note that _C__L__Z_(_x_) can also be used for strings of larger alphabet sizes than just 0/1 binary alphabets. To directly test the input based measures we typically


need fairly small systems, where the LZ based measure above may show some anomalies (see also the supplementary information of13 for a more detailed description). Thus, for such small


systems, or when comparing different types and sizes of objects (e.g. RNA SS and RNA sequences) a slightly different scaling may be more appropriate, which not only accounts for the fact


that _C__L__Z_(_x_) > _n_ for strings of length _n_, but also the lower complexity limit may not be ~0, which it should be (see also the discussion in the supplementary information of13).


Hence we use a different rescaling of the complexity measure $$\widetilde{K}(x)={{\rm{\log }}}_{2}({N}_{O})\cdot \frac{{C}_{LZ}(x)-\min ({C}_{LZ}(x))}{\max ({C}_{LZ}(x))-\min


({C}_{LZ}(x))}$$ (12) which will now range between \(0\le \widetilde{K}(x)\ \le \ {{\rm{\log }}}_{2}({N}_{O})=n\) if for example _N__O_ = 2_n_. For large objects, this different scaling will


reduce to the simpler one, because \(\max ({C}_{LZ}(x))\gg \min ({C}_{LZ}(x))\). We note that there is nothing fundamental about using LZ to generate approximations to true Kolmogorov


complexity. Many other approximations could be used, and their merits may depend on the details of the problems involved. For further discussion of other complexity measures, see for example


the supplementary information of refs. 13,16. DATA AVAILABILITY The data that support the findings of this work are available from the corresponding authors upon request. REFERENCES *


Mezard, M. & Montanari, A.  _Information, physics, and computation_ (Oxford University Press, USA, 2009). * Moore, C. & Mertens, S.  _The nature of computation_ (OUP Oxford, 2011). *


Shor, P. W.  Algorithms for quantum computation: Discrete logarithms and factoring. In _Proceedings 35th annual symposium on foundations of computer science_, 124–134 (Ieee, 1994). * Li, M.


& Vitanyi, P.  _An introduction to Kolmogorov complexity and its applications_ (Springer-Verlag New York Inc, 2008). * Devine, S.  Algorithmic information theory: Review for physicists


and natural scientists (2014). * Turing, A. M. On computable numbers, with an application to the entscheidungsproblem. _J. of Math_ 58, 5 (1936). MATH  Google Scholar  * Shen, A., Uspensky,


V. A. & Vereshchagin, N.  _Kolmogorov complexity and algorithmic randomness_, vol. 220 (American Mathematical Soc., 2017). * Chomsky, N. Three models for the description of language.


_Information Theory, IRE Transactions on_ 2, 113–124 (1956). Article  Google Scholar  * Lloyd, S. Quantum-mechanical computers and uncomputability. _Physical review letters_ 71, 943 (1993).


Article  ADS  CAS  Google Scholar  * Cubitt, T. S., Perez-Garcia, D. & Wolf, M. M. Undecidability of the spectral gap. _Nature_ 528, 207–211 (2015). Article  ADS  CAS  Google Scholar  *


Levin, L. Laws of information conservation (nongrowth) and aspects of the foundation of probability theory. _Problemy Peredachi Informatsii_ 10, 30–35 (1974). ADS  Google Scholar  *


Solomonoff, R. J. A formal theory of inductive inference. part i. _Information and control_ 7, 1–22 (1964). Article  MathSciNet  Google Scholar  * Dingle, K., Camargo, C. Q. & Louis, A.


A. Input-output maps are strongly biased towards simple outputs. _Nature communications_ 9, 761 (2018). Article  ADS  Google Scholar  * Zenil, H., Badillo, L., Hernández-Orozco, S. &


Hernández-Quiroz, F. Coding-theorem like behaviour and emergence of the universal distribution from resource-bounded algorithmic probability. _International Journal of Parallel, Emergent and


Distributed Systems_ 34, 161–180 (2019). Article  Google Scholar  * Chibbaro, S., Rondoni, L. & Vulpiani, A. Reductionism, emergence and levels of reality. _Springer, Berlin. by SWISS


FEDERAL INSTITUTE OF TECHNOLOGY ZURICH (ETH) on_  3, 17 (2014). * Pérez, G. V., Louis, A. A. & Camargo, C. Q. Deep learning generalizes because the parameter-function map is biased


towards simple functions. empharXiv preprint arXiv:1805.08522 (2018). * Mingard, C.  _et al_.  Neural networks are a priori biased towards Boolean functions with low entropy. _Arxiv preprint


arXiv:1909.11522_ (2019). * Rosenblatt, F. The perceptron: a probabilistic model for information storage and organization in the brain. _Psychological review_ 65, 386 (1958). Article  CAS 


Google Scholar  * LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. _nature_ 521, 436 (2015). Article  ADS  CAS  Google Scholar  * Gács, P.  _Lecture notes on descriptional complexity


and randomness_ (Boston University, Graduate School of Arts and Sciences, Computer Science Department, 1988). * Chaitin, G. Information-theoretic computation complexity. _IEEE Transactions


on Information Theory_ 20, 10–15 (1974). Article  MathSciNet  Google Scholar  * Calude, C. S., Salomaa, K. & Roblot, T. K. Finite state complexity. _Theoretical Computer Science_ 412,


5668–5677 (2011). Article  MathSciNet  Google Scholar  * Grünwald, P. and Roos, T. Minimum description length revisited. _arXiv preprint arXiv:1908.08484_ (2019). * Mathews, D. H. _et al_.


Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of rna secondary structure. _Proceedings of the National Academy of Sciences_ 101,


7287–7292 (2004). Article  ADS  CAS  Google Scholar  * Schuster, P., Fontana, W., Stadler, P. & Hofacker, I. From sequences to shapes and back: A case study in RNA secondary structures.


_Proceedings: Biological Sciences_ 255, 279–284 (1994). CAS  Google Scholar  * Hofacker, I. _et al_. Fast folding and comparison of RNA secondary structures. _Monatshefte für Chemie/Chemical


Monthly_ 125, 167–188 (1994). Article  CAS  Google Scholar  * Fontana, W. Modelling evo-devo’ with RNA. _BioEssays_ 24, 1164–1177 (2002). Article  CAS  Google Scholar  * Cowperthwaite, M.,


Economo, E., Harcombe, W., Miller, E. & Meyers, L. The ascent of the abundant: how mutational networks constrain evolution. _PLoS computational biology_ 4, e1000110 (2008). Article  ADS


  MathSciNet  Google Scholar  * Aguirre, J., Buldú, J. M., Stich, M. & Manrubia, S. C. Topological structure of the space of phenotypes: the case of rna neutral networks. _PloS ONE_ 6,


e26324 (2011). Article  ADS  CAS  Google Scholar  * Schaper, S. & Louis, A. A. The arrival of the frequent: how bias in genotype-phenotype maps can steer populations to local optima.


_PloS one_ 9, e86635 (2014). Article  ADS  Google Scholar  * Wagner, A.  _Robustness and evolvability in living systems_ (Princeton University Press Princeton, NJ:, 2005). * Wagner, A.  _The


Origins of Evolutionary Innovations: A Theory of Transformative Change in Living Systems_ (Oxford University Press, 2011). * Dingle, K., Schaper, S. & Louis, A. A. The structure of the


genotype-phenotype map strongly constrains the evolution of non-coding RNA. _Interface focus_  5, 20150053 (2015). * Holcombe, M. and Holcombe, W.  _Algebraic automata theory_, vol. 1


(Cambridge University Press, 2004). * Moore, E. F. Gedanken-experiments on sequential machines. _Automata studies_ 34, 129–153 (1956). MathSciNet  Google Scholar  * Almeida, A., Almeida, M.,


Alves, J., Moreira, N. & Reis, R.  Fado and guitar: tools for automata manipulation and visualization. In _International Conference on Implementation and Application of Automata_, 65–74


(Springer, 2009). * Almeida, M., Moreira, N. & Reis, R. Enumeration and generation with a string automata representation. _Theoretical Computer Science_ 387, 93–102 (2007). Article 


MathSciNet  Google Scholar  * Rabin, M. O. & Scott, D. Finite automata and their decision problems. _IBM journal of research and development_ 3, 114–125 (1959). Article  MathSciNet 


Google Scholar  * Lempel, A. & Ziv, J. On the complexity of finite sequences. _Information Theory, IEEE Transactions on_ 22, 75–81 (1976). Article  MathSciNet  Google Scholar  * Vilar,


J., Kueh, H., Barkai, N. & Leibler, S. Mechanisms of noise-resistance in genetic oscillators. _Proceedings of the National Academy of Sciences of the United States of America_ 99, 5988


(2002). Article  ADS  CAS  Google Scholar  Download references ACKNOWLEDGEMENTS K.D. acknowledges partial financial support from the Kuwait Foundation for the Advancement of Sciences (KFAS)


grant number P115-12SL-06. G.V.P. acknowledges financial support from EPSRC through grant EP/G03706X/1. AUTHOR INFORMATION AUTHORS AND AFFILIATIONS * Centre for Applied Mathematics and


Bioinformatics, Department of Mathematics and Natural Sciences, Gulf University for Science and Technology, Hawally, 32093, Kuwait Kamaludin Dingle * Rudolf Peierls Centre for Theoretical


Physics, University of Oxford, Parks Road, Oxford, OX1 3PU, United Kingdom Kamaludin Dingle, Guillermo Valle Pérez & Ard A. Louis Authors * Kamaludin Dingle View author publications You


can also search for this author inPubMed Google Scholar * Guillermo Valle Pérez View author publications You can also search for this author inPubMed Google Scholar * Ard A. Louis View


author publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS K.D. and A.A.L. conceived the project. K.D., A.A.L. and G.V.P. derived the theoretical results.


K.D. performed the numerical calculations for RNA, and G.V.P. performed the numerical calculations for the finite state transducer and the perceptron. K.D., G.V.P. and A.A.L. wrote the


paper. CORRESPONDING AUTHOR Correspondence to Ard A. Louis. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PUBLISHER’S NOTE


Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION. RIGHTS AND


PERMISSIONS OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any


medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The


images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not


included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly


from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Dingle, K., Pérez,


G.V. & Louis, A.A. Generic predictions of output probability based on complexities of inputs and outputs. _Sci Rep_ 10, 4415 (2020). https://doi.org/10.1038/s41598-020-61135-7 Download


citation * Received: 12 November 2019 * Accepted: 16 February 2020 * Published: 10 March 2020 * DOI: https://doi.org/10.1038/s41598-020-61135-7 SHARE THIS ARTICLE Anyone you share the


following link with will be able to read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer


Nature SharedIt content-sharing initiative