Optimising the function of a protein of length N amino acids by directed evolution involves navigating a ‘search space’ of possible sequences of some 20N. Optimising the expression levels of P proteins that materially affect host performance, each of which might also take 20 (logarithmically spaced) values, implies a similar search space of 20P. In this combinatorial sense, then, the problems of directed protein evolution and of host engineering are broadly equivalent. In practice, however, they have different means for avoiding the inevitable difficulties of implementation. The spare capacity exhibited in metabolic networks implies that host engineering may admit substantial increases in flux to targets of interest. Thus, we rehearse the relevant issues for those wishing to understand and exploit those modern genome-wide host engineering tools and thinking that have been designed and developed to optimise fluxes towards desirable products in biotechnological processes, with a focus on microbial systems. The aim throughput is ‘making such biology predictable’. Strategies have been aimed at both transcription and translation, especially for regulatory processes that can affect multiple targets. However, because there is a limit on how much protein a cell can produce, increasing kcat in selected targets may be a better strategy than increasing protein expression levels for optimal host engineering.

Much of microbial biotechnology consists conceptually of two main optimisation problems [1]: (i) deciding which proteins whose levels should be changed, and (ii) by which amounts. The former is ostensibly somewhat simpler, e.g. when a specific enzyme is the target for overproduction, since the assumption is then that the aim is simply the maximal production of the active target (whether intracellularly or in a secreted form). Where the overproduction of a small molecule is the target the optimal levels of individual metabolic network enzyme proteins depend on their specific kinetic properties and the consequent distribution of flux control (e.g. [2–8]). Since both circumstances ultimately seek to maximise the flux to the product of interest, we shall discuss them both, albeit mostly at a high level. Recognising that many pathways are poorly expressed in their natural hosts we shall be somewhat organism-agnostic [9,10], (though we largely ignore cell-free systems) since we are more interested in the principles (whether microscopic [11] or macroscopic [12]) than the minutiae.

The possible number of discrete manipulations one can perform on a given system is referred to as the ‘search space’. The overriding issue is that the number of changes one might make scales exponentially with the number of those considered, and is simply astronomical; the trick is to navigate the search space intelligently [13]. Modern methods, especially those recognising the potential of synthetic biology and host engineering to make ‘anything’ (e.g. [14–23]), are improving both computational [24] and experimental approaches. The main means of making such navigation more effective is by seeking to recognise those areas that are most ‘important’ or ‘difficult’ for the problem of interest, and focusing on them; this is generally true of combinatorial search problems (and to illustrate this, a nice example is given by the means by which the Eternity puzzle https://en.wikipedia.org/wiki/Eternity_puzzle was solved).

We find it useful to classify problems into ‘forward’ and ‘inverse’ problems, because (Figure 1) this is in fact how they are commonly presented [13]. In areas such as drug discovery, a typical forward problem might be represented by starting with a set of structures and paired quantitative properties or activities (QSAR/QSPR) with which one can set up a model or a nonlinear mapping in which the molecular structures are the inputs and the activities are the outputs. A good model will be able to ‘generalise’, in the sense that it can give accurate predictions for novel molecular structure. On a good day (such as here [25]), it will even be able to extrapolate, to make predictions of activities larger than it ever saw when it was being trained. This can be seen as a ‘forward’ problem (‘have molecule, want to predict properties’), nowadays known as a ‘discriminative’ problem [26–28]. However, what we really wish to solve [29,30] is the normally far harder inverse problem (‘have desired properties, want molecules’), nowadays referred to as a ‘generative’ problem [26,27,31–35] since the output is (the generation of) the solution of interest. Such generative methods are now well known in the image and natural language processing, and are becoming available in all kinds of related areas of present interest such as drug discovery [36] and protein sequence generation (e.g. [37–40]). Thus, while we have (and can more or less easily create [41]) reasonable whole-genome models of all kinds of microbes (and see below), what we effectively need to solve here again is the ‘inverse problem’ [30,42]. For organism optimisation this is mainly ‘have desired flux, need to optimise the gene sequences and expression profiles of my producer organism to create it’.

### The cycle of knowledge in directed protein evolution.

Figure 1.
The cycle of knowledge in directed protein evolution.

It is useful to contrast the worlds of (i) mental constructs including ideas and hypotheses from (ii) more physical worlds that include data and ‘observations’. Their interrelations are iterative but their nature depends on their directionality. In the post-genomic era, there has rightly been a trend away from the primacy in the biology of hypothesis-dependent deductive reasoning towards data-driven biology in which the best explanations are induced from available data.

Figure 1.
The cycle of knowledge in directed protein evolution.

It is useful to contrast the worlds of (i) mental constructs including ideas and hypotheses from (ii) more physical worlds that include data and ‘observations’. Their interrelations are iterative but their nature depends on their directionality. In the post-genomic era, there has rightly been a trend away from the primacy in the biology of hypothesis-dependent deductive reasoning towards data-driven biology in which the best explanations are induced from available data.

Close modal

The genetic search space in biology is enormous; even considering just a 30 mer of the standard nucleic acid bases can produce 430 (∼1018) different sequences. The enormity of this number can be illustrated by the fact that if each such sequence was arrayed as a 5 μm spot the array would take up an area of ∼29 km2 [43]. Obviously, the number of bases in just a smallish bacterial genome such as that of Escherichia coli MG1655 is some 105 times greater than 30, and it remains the case that we still know next to nothing about approximately one-third of the genes encoded therein [44], the so-called y-genes [45]. The expression levels of the identical protein sequence can vary several 100-fold just by changing the codons used [46], mainly because of the expression levels [47,48] and the stability of mRNA [49,50], as well as because of codon bias [51] and for other reasons [52]. Also, note that obtaining the best expression of the active protein is not simply a question of using the commonest codons [53], since (i) over-usage of an individual codon will necessarily deplete its tRNA, and (ii) sometimes it is necessary to slow down protein expression so as to avoid inclusion body formation [54]. Consequently, the problem is not made easier by substituting the term ‘nucleic acid bases’ in the above reasoning by the words ‘amino acids’ or ‘codons’. Indeed, the control of gene expression is distributed over the whole genome [55].

Considering just the 20 ‘common’ amino acids, the number of sequence variants for M substitutions in a given protein of N amino acids is $19M.N!(N−M)!M!$ [56]. For a protein of 300 amino acids with random changes in just 1, 2, or 3 amino acids in the whole protein this evaluates to 5700, ca 16 million, and ca 30 billion, while even for a small protein of N = 165 amino acids (smaller than half that of the average protein length in Uniprot), the number of variants exceeds 1015 when M = 8. If we wish to include insertions and deletions, they can be considered as simply increasing the length of N and the number of variants to 21 (with a ‘gap’ being coded as a 21st amino acid). Obviously, if we just consider a fixed number of positions N the number of possibilities scales as 20N if any amino acid substitution is allowed. At all events, the dimensionality of the problem is equal to the number of things that can be varied, and it is the exponent in a relationship where the base (in the mathematical sense) is the number of values that it can take. In contrast, for a 165-, 350-, or 700-residue protein, although the number of ways of finding ‘the best’ five amino acids to vary is, respectively ∼108, ∼2.1010, and 1012, exhaustive search of those five amino acids always involves ‘just’ 205 = 3.2 million variants. Thus strategies (such as ProSAR [57–60]) that seek the best elements to mutate at all, even in the necessary absence of epistatic analyses (see below), have considerable merit.

Overall, the solution to such a combinatorial search problem, as is used by the biology of course, is not to try to make these massive numbers of changes ‘in one go’ but to build on earlier successes in a generational or evolutionary manner, known in protein engineering as the design–build–test–learn (DBTL) cycle (Figure 2).

### The design–build–test–learn (DBTL) paradigm for engineering biology.

Figure 2.
The design–build–test–learn (DBTL) paradigm for engineering biology.

Although usually considered solely at the level of protein directed evolution (and on which this diagram is based [13]), the DBTL strategy applies equally to host engineering.

Figure 2.
The design–build–test–learn (DBTL) paradigm for engineering biology.

Although usually considered solely at the level of protein directed evolution (and on which this diagram is based [13]), the DBTL strategy applies equally to host engineering.

Close modal

Algorithmic strategies for doing so in general (including in physical sciences and engineering) are known variously as ‘genetic algorithms’ or ‘evolutionary computing’, and come in a variety of flavours (e.g. [61–69]). They have become well known for individual proteins in the form of directed evolution, as popularised in particular by Frances Arnold (e.g. [70–73]). Some recent reviews include [74–81]. Increasingly, the use of ‘deep mutational scanning’ [82–90], sometimes coupled to FACS-based sorting [91] (‘sort-seq’ [52,82,83,92]), is making available large amounts of sequence-activity pairs [85]. (We ourselves made available one million paired aptamer activity sequences in 2010 [93].) The general structure of an evolutionary algorithm is outlined in Figure 3. As (to some degree [94]) with natural evolution and organism breeding, it is up to the experimenter to select individuals to mutate or to recombine, specifically as one seeks combinations of traits that overall provide the desired phenotype.

### Generalized evolutionary algorithms.

Figure 3.
Generalized evolutionary algorithms.

The elements of an evolutionary algorithm, in which a population of candidate solutions are mutated and recombined, iteratively with selection, to develop improved variants. Based in part on [95].

Figure 3.
Generalized evolutionary algorithms.

The elements of an evolutionary algorithm, in which a population of candidate solutions are mutated and recombined, iteratively with selection, to develop improved variants. Based in part on [95].

Close modal

Notwithstanding the numerical combinatorial problems, the biggest problem in natural evolution involves that of epistasis, i.e. the very common circumstances in which the ‘best’ amino acid at a certain location depends on the precise nature of the amino acid at one or more other residues. The commonest way to think of these problems is in terms of the fitness landscape metaphor [96], as illustrated in Figure 4. In this representation, ‘where’ one is in the multidimensional search space is encoded via the X- and Y- co-ordinates, while the value of the (composite desired) property of interest, or the fitness, is represented as the height. Epistasis manifests as a sort of ruggedness in the landscape, and is more-or-less inevitable when residues that are ‘distant’ in the primary sequence are in contact; indeed their covariance provides an importance strategy for detecting such contacts from sequences alone (e.g. [97–100]). In particular, ‘sign’ epistasis occurs when A is better than B at location one when C is at location 2, but B is better than A at location one when D is at location two. It is easy to understand this in simple biophysical terms with respect to the likelihood of contact formation, in this case via ion pairs, if A, B, C, and D are, respectively, glutamate, lysine, arginine and aspartate. Indeed, the covariation of residues in protein families is widely used as a means of predicting 3D structure from sequence alone [97,98,101–103].

### The landscape metaphor for understanding genotype–phenotype relationships in directed evolution and similar protein optimisation experiments.

Figure 4.
The landscape metaphor for understanding genotype–phenotype relationships in directed evolution and similar protein optimisation experiments.

The XY co-ordinates indicate where one is in the sequence space, while the height indicates the value of the desired objective function(s). Reproduced from an open-access publication at [13].

Figure 4.
The landscape metaphor for understanding genotype–phenotype relationships in directed evolution and similar protein optimisation experiments.

The XY co-ordinates indicate where one is in the sequence space, while the height indicates the value of the desired objective function(s). Reproduced from an open-access publication at [13].

Close modal

The ‘ruggedness’ of the fitness landscape is widely taken to reflect the ease with which it may be searched, and which kinds of search algorithms may be optimal [62,93,104]. Biological landscapes tend to be somewhat rugged, but not pathologically so [105,106]. The so-called NK landscapes (e.g. [107–110]) are convenient models, and are completely non-rugged when K = 0; experimentally, we found K ∼ 1 for protein binding to DNA sequences [93]. Ruggedness necessarily increases as protein length L increases, and reasonable routes joining everything upwards or neutrally so as to escape local minima decrease [111], though they do exist in high dimensions [112].

Such sign epistasis is both common and highly important [113–116], and is especially responsible for ruggedness and local isolation under selection. From what we know (e.g. [117]), while pairwise epistasis of this type is indeed very common [118], including in adjacent residues [119], higher-order epistasis is somewhat less so (plausibly for steric reasons). Armed with paired sequence and activity values, all one can do is to seek to interpolate between the few positions with known values and those without. However, if one simply keeps climbing locally one is inevitably likely to be trapped in a local minimum (or maximum in the landscape metaphor) from which it is very hard to escape by mutation alone. Weak mutation and strong selection are commonplace in natural evolution [120–127]) and consequently tend to disfavour lower fitnesses [128], exacerbating the problem of being trapped in a local minimum. This largely constrains natural evolution, and means that we can anticipate great improvements in organisms by seeking previously unknown sequences distant from known ‘peaks’.

Overall, then, the concept of epistasis implies that there is no monotonic ordering of the utility or performance of individual residues within a complete fitness landscape, and that depending on what else is going on there is some kind of a bell-shaped curve relating the utility of a given amino acid in the performance of a protein, to the rest of the protein landscape when that is allowed to be varied. As we shall see, and it is in fact inevitable, this is commonly mirrored more generally.

As presaged, the basic combinatorial problem of host engineering [129] is largely equivalent to that of protein engineering. We have P enzymes, each of which might be expressed at Q levels (to make life more reasonable in practice we would let these levels vary logarithmically, so 20 levels of a twofold increment gives a range of just over 106 (220 = 1 048 576)). Our problem comes (this is easily checked using the ‘COMBIN’ function in a spreadsheet programme such as MS-Excel, or online) because again the number of combinations NC explodes as P increases (for Q = 20, NC ∼ 1014 for P = 50, NC > 1020 for P = 100, and NC > 1030 for P = 300, which is lower than one-tenth of the number of gene products in E. coli). However, for P = 100, NC is only 100 and 4950 when just one or two variant levels are introduced that differ from those of the wild type (WT), respectively. As with any combinatorial search problem, appropriate application of modern Bayesian, machine learning, and design of experiment principles can assist with finding optimal combinations (e.g. [130–133].

Much evidence exists (modulo ‘bet hedging’ [134–138]), that the majority of organisms in a population seek to maximise their instantaneous growth rate (the flux to biomass) [12,139], and thus understanding the control of flux is a core issue, whatever the flux of interest. The typical relationship between the flux through a metabolic pathway or network and the concentration of an individual enzyme typically follows some kind of curve like a rectangular hyperbola (similar to that relating activity to substrate concentration in the simplest Michaelis–Menten equation). An illustration is given in Figure 5.

### Some potential relationships between the activity of an individual enzyme in a metabolic ‘pathway’ and a flux of interest.

Figure 5.
Some potential relationships between the activity of an individual enzyme in a metabolic ‘pathway’ and a flux of interest.

For these purposes, we consider that enzyme concentration and activity are proportionate. The broadly expected result is similar to that of curve 1, where there is a monotonic increase in flux as the enzyme's activity is increased. At lower enzyme activities (i) the slope of the tangent is reasonably high, while at higher activities (ii) a further increase in enzyme activity has little effect on pathway flux and the slope is correspondingly low. In some circumstances (curve 2), whether because of pleiotropic effects or because of the negative effects of an increased protein burden (see text), increases in enzyme activity beyond an optimum (marked in red) lead to decreases in pathway flux. The flux-control coefficient is the normalised slope relating pathway flux to enzyme activity at the operating point of interest.

Figure 5.
Some potential relationships between the activity of an individual enzyme in a metabolic ‘pathway’ and a flux of interest.

For these purposes, we consider that enzyme concentration and activity are proportionate. The broadly expected result is similar to that of curve 1, where there is a monotonic increase in flux as the enzyme's activity is increased. At lower enzyme activities (i) the slope of the tangent is reasonably high, while at higher activities (ii) a further increase in enzyme activity has little effect on pathway flux and the slope is correspondingly low. In some circumstances (curve 2), whether because of pleiotropic effects or because of the negative effects of an increased protein burden (see text), increases in enzyme activity beyond an optimum (marked in red) lead to decreases in pathway flux. The flux-control coefficient is the normalised slope relating pathway flux to enzyme activity at the operating point of interest.

Close modal

Understanding why this is so is the territory of metabolic control analysis (MCA), which originated in the work of Kacser and Burns [140–142] and of Heinrich and Rapoport [143,144]. We refer readers to some reviews (e.g. [2,3,5–8,145–147]) and an online tutorial http://dbkgroup.org/metabolic-control-analysis/. MCA can be seen as a kind of local sensitivity analysis [148] (cf. [149]), in which the sensitivities (known as control coefficients, illustrated for a flux-control coefficient in Figure 5) add up either to zero or to one. The chief points of MCA for our purposes are that (i) every enzyme can contribute to the control of flux, but because their flux-control coefficients add up to zero the contribution of individual enzymes is mostly small, (ii) the distribution of control varies as the activity of an individual enzyme is increased (this is somewhat equivalent to epistasis), (iii) because it is activities that matter and because enzyme concentrations cannot be increased without limit, it is better to increase them directly (through increasing individual kcat values) rather than by increasing enzyme expression levels (in terms of kinetics this might only differ during transients, not in the steady state [42]), (iv) the best way to increase fluxes is to modulate multiple enzyme activities simultaneously [5,150], (v) because the sum of concentration-control coefficients is zero, individual steps can and do have substantial effects on the concentrations of metabolic intermediates (this is precisely why the metabolome can serve to amplify comparatively changes in the transcriptome or the proteome [151–153]). However, normally it is fluxes to the product in which one is interested for biotechnology, and in terms of increasing metabolic fluxes, one has to make choices from a combinatorial space [154], because there are necessarily fairly strong limitations on the total amount of protein that can be made by a given organism.

The fact that most enzymes have small flux-control coefficients (because they must add up to one) necessarily means that they must tend to have ‘spare capacity’; this is simply another way of saying that increasing or lowering their activity has relatively little effect of a pathway flux. Such spare capacity also allows for rapid responses in the face of changes in the environment [155–159]. This spare capacity of itself implies that there is plenty of ‘room for manoeuvre’ in host engineering. Indeed, the ‘spare capacity’ has been identified explicitly in a variety of systems [160], for instance in mitochondrial respiration (e.g. [161,162]) and others discussed below. This said, some other respiratory systems are barely able to keep pace with the need to oxidise reducing equivalents that can be produced at high rates (e.g. [12,163–166]). Depending on one's point of view of the desirability of forming the relevant products, a failure of spare capacity in some pathways might also be seen as contributing to so-called overflow metabolism [167], as may be evidenced by ‘metabolic footprinting’ [168,169] or ‘exometabolomics’ [170–173]. Ultimately, of course, the ‘adaptability’ of an organism typically depends on the environments in which it has naturally evolved [174,175].

Transporters are both massively important for biotechnology (e.g. [176–178]) and variably promiscuous (and see later for transporter engineering). In one recent approach [179,180], we have used flow cytometry to assess the ability of single-gene knockouts of some 530 genes E. coli (mostly transporter genes; the full list and datsets are in [179,180]) to take up a variety of fluorescent dyes. Some of the data, for SYBR Green uptake (whose fluorescence is massively enhanced upon binding to DNA), are replotted in Figure 6. The range is some 70-fold, reflecting the ability of multiple transporters to influence the uptake and efflux of the dye. Shown are the value for the WT and lines representing one half of and twofold that uptake (encompassing 361 of the 531 knockouts, ca 68%, studied). As expected, most manipulations have comparatively little effect, but there is a ‘long tail’ [181] (here in either direction) of a few that do. This is quite typical of biology, where it is also worth noting that the flow cytometric analysis of clonal cultures indicates a massive heterogeneity therein, presumably as a result of the differential expression of many hundreds of different enzymes. In some ways, the only surprise is that this variation is so small, and that is likely a result of evolution's necessary selection for robustness (e.g. [182–191]).

### The modal extent of uptake of SYBR Green in single-gene knockouts of E. coli.

Figure 6.
The modal extent of uptake of SYBR Green in single-gene knockouts of E. coli.

The data reflect flow cytometric analysis of the uptake of the fluorescent/ fluorogenic dye SYBR Green into 530 different single-gene knockouts of E. coli from the Keio collection. Data are replotted from the supplementary table given in [179]. Red symbols indicate y-genes (genes of ‘unknown’ function). The wild type (WT) is marked in green. The horizontal lines indicated uptakes of one half or double that of the WT.

Figure 6.
The modal extent of uptake of SYBR Green in single-gene knockouts of E. coli.

The data reflect flow cytometric analysis of the uptake of the fluorescent/ fluorogenic dye SYBR Green into 530 different single-gene knockouts of E. coli from the Keio collection. Data are replotted from the supplementary table given in [179]. Red symbols indicate y-genes (genes of ‘unknown’ function). The wild type (WT) is marked in green. The horizontal lines indicated uptakes of one half or double that of the WT.

Close modal

As is well known, microbes adjust their ribosomal content to match their growth rate [12,192,193]. Thus, when the opportunity arises, they can funnel excess amino acids into ribosomal biosynthesis [194]. Indeed, increasing amino acid availability, as in the ‘terrific broth’ [195], does indeed assist protein production. Equally, it has been known for many years that, although there is considerable flexibility [196], cells do, as they must, have limitations on the total amount of protein that they can make [12,197], both as a flux to protein biosynthesis ‘as a whole’ and as a percentage of total biomass. To this extent, then, especially in laboratory cultures in rich media, the ability to biosynthesise protein is essentially a zero-sum game [198,199]: increasing the amount (concentration) of some proteins necessarily means decreasing the concentration of others. Since consensus metabolic networks have become established in standard organisms such as E. coli [200] and yeast [201] (and indeed humans [202,203]), attention has thus begun to shift to the wider proteome [204–207]. Experimentally, while the cell may ‘wish’ to retain some spare capacity, sometimes it is simply not possible. Thus, an early study [208] showed that increases in the expression of a variety of glycolytic enzymes in Zymomonas mobilis actually decreased the glycolytic flux; it was as though the cells were already at an optimal point (as marked in Figure 5 with a red symbol). Other studies [155,209,210] are reviewed by Bruggeman et al. [12]. A recent analysis in baker's yeast [211] by Nielsen and colleagues covers many of the issues. In this work [211], it was found that under nitrogen-limiting conditions, 75% of the total transcriptome and 50% of the proteome were produced in excess of what is necessary to maintain growth. This necessarily implies that there is scope for improving the host for the purposes of the biotechnologist.

Given both the spare capacity, and the fact that many proteins are commonly expressed that are not essential either for cell growth or for assisting fluxes to the product of interest [212,213], it is obvious that attention might usefully be applied to proteome engineering [214–216] as part of host engineering. This becomes especially obvious when it is recognised that the expression levels of even the commonest proteins span some 5 orders of magnitude (in a roughly log-normal distribution) when assessed in baker's yeast using methods that provide absolute numbers [217]. While transcript levels were an important contributor to this variation, the number of protein molecules per transcript in the same study also showed an impressive range, from 40 to 180 000 proteins per transcript [217], implying considerable post-transcriptional control. A more recent study in the same organism detailed absolute protein abundances under some 42 conditions [218], with data replotted therefrom in Figure 7. This serves to show the relatively limited range available in both the total transcriptome and the total proteome in growing cell cultures. Such kinds of datasets (and others such as those in [219] with simple growth as the output) will be of massive value in the future for the purposes of host engineering, as they at once allow one to understand the conditions under which genes are expressed, the strength of the relevant promoters, as well as other features such as genes whose expression varies little and might thus be used for purposes of normalisation [220,221].

### Relationship between absolute proteome and absolute transcriptome in baker's yeast grown under different conditions.

Figure 7.
Relationship between absolute proteome and absolute transcriptome in baker's yeast grown under different conditions.

Data are replotted from supplementary table S1a of [218]. The dilution rate from 0.1 to 0.35 h−1 is encoded by the style of symbol, and the nitrogen source by the colour indicated (optimised for colour-blind legibility via the palettes at http://colorbrewer.org/). Numbers refer to the sample numbers in that table.

Figure 7.
Relationship between absolute proteome and absolute transcriptome in baker's yeast grown under different conditions.

Data are replotted from supplementary table S1a of [218]. The dilution rate from 0.1 to 0.35 h−1 is encoded by the style of symbol, and the nitrogen source by the colour indicated (optimised for colour-blind legibility via the palettes at http://colorbrewer.org/). Numbers refer to the sample numbers in that table.

Close modal

It is sometimes considered that because biotechnologists often aim to grow cells in rich media, one might usefully delete a lot of the biosynthetic capacity of a cell to ‘streamline’ a genome to make a ‘minimal’ genome. Actually, because of redundancy (A or B is required but not both), the number of redundant pairs n scales exponentially (as 2n) so the concept of the minimal genome is quite inaccurate. Anyway, any real ‘burden’ comes from the expression, not the possession, of a particular gene, so we focus on strategies that optimise expression. More interesting are essential genes, and a very nice genome-wide study provided a clever method for assessing them [222].

To vary the amount of a particular protein one can act at the transcriptional or translational levels (or of course both). The former might involve truncations, knockouts, transcription factor engineering (see later), mRNA stability, RNA polymerase engineering, transcription factor binding sites, and direct promoter engineering. A suite of such approaches has been referred to as global transcription machinery engineering (gTME) [223–227]. Translation engineering will tend to have a focus on translation initiation, elongation, and codon optimisation [228,229]. Table 1 gives some examples. We are not aware of any studies that would point experimenters towards a general preference for one or the other (from the MCA analysis above, more likely both), implying that such studies would be of value; it may be, of course, that every problem is more or less bespoke. What is certain, however, is that very little of the search space has ever been covered by natural evolution. A nice example of this is given by the work of Wu et al. [230] who used a transformer model (see [39,231–240]) to assess the ability of various signal peptide sequences (SPs) to induce protein secretion. They found that successful, model-generated SPs were diverse in sequence, sharing as little as 58% sequence identity with the closest known native signal peptide and possessed just 73 ± 9% on average. Unsurprisingly, given the tiny population of sequence space accessed during natural evolution, this is more generally true [241]. Given the many recent advances in deep learning for solving a variety of biological problems (e.g. [31,37–39,242–255]), it is clear that these data-driven [29] strategies will be in the vanguard of the ‘learn’ part of the DBTL cycle.

Table 1.
Some examples of yield improvement by transcription or translation engineering
Transcription
mRNA stability Contributes as much to transcription as does codon usage [46,47
Promoter engineering Guided, empirical strategies [265–268
Inducible Trp-T7 for serine production [269
Novel tet-based use of machine learning [270
Random variation on a trc promoter, allowing 60-fold variation in  expression levels [263
Promoter library module combinatorics for use in threonine  production [271
Reviews [215,272–279
Sigma-factor-specific promoters [280
Transcription factor engineering See below
DNA/RNA polymerase  engineering  [281–284
Chromosomal integration site Increased isobutanol production from E. coli, involving  chromosomal integration at random sites, selection by cell  sorting [285
β-carotene synthesis in Yarrowia lipolytica by simultaneous  integration of a 3-module biosynthetic pathway plus selection  by colony colour. [286
Riboswitches Can provide effective control [287
σ-factor engineering Major transcriptional control point [288
Improved antibody production in E. coli [289
Extracytoplasmic σ factors [290
Use in cyanobacteria [291
Terminator engineering (acts both  transcriptionally and  translationally) Increased protein expression through reduced read-through,  including Itaconic acid and betaxathin production. [292–297
Translation
Codon usage Strong selection in S. cerevisiae leads to plasmid copy variation [298,299
Role in regulating protein folding [300
Review of codon usage tables [301
Ribosome binding sites (RBS) RBS calculator [302
Machine learning in E. coli [303
Multiprotein RBS optimisation in various bacteria [304
Review of RBS calculator [305
Phenotypic recording with deep learning, using more than 2.7 M  sequence-function pairs [306
Translation initiation optimisation Reviews [228,307,308
Significant increase in serine overproduction [309
5-Methylpyrazine-2-carboxylic acid production [310
2,5-furandicarboxylic acid production [311
tRNA engineering Admits novel codons, of which some can encode non-canonical  amino acids [312
tRNA synthetase engineering [313
Transcription
mRNA stability Contributes as much to transcription as does codon usage [46,47
Promoter engineering Guided, empirical strategies [265–268
Inducible Trp-T7 for serine production [269
Novel tet-based use of machine learning [270
Random variation on a trc promoter, allowing 60-fold variation in  expression levels [263
Promoter library module combinatorics for use in threonine  production [271
Reviews [215,272–279
Sigma-factor-specific promoters [280
Transcription factor engineering See below
DNA/RNA polymerase  engineering  [281–284
Chromosomal integration site Increased isobutanol production from E. coli, involving  chromosomal integration at random sites, selection by cell  sorting [285
β-carotene synthesis in Yarrowia lipolytica by simultaneous  integration of a 3-module biosynthetic pathway plus selection  by colony colour. [286
Riboswitches Can provide effective control [287
σ-factor engineering Major transcriptional control point [288
Improved antibody production in E. coli [289
Extracytoplasmic σ factors [290
Use in cyanobacteria [291
Terminator engineering (acts both  transcriptionally and  translationally) Increased protein expression through reduced read-through,  including Itaconic acid and betaxathin production. [292–297
Translation
Codon usage Strong selection in S. cerevisiae leads to plasmid copy variation [298,299
Role in regulating protein folding [300
Review of codon usage tables [301
Ribosome binding sites (RBS) RBS calculator [302
Machine learning in E. coli [303
Multiprotein RBS optimisation in various bacteria [304
Review of RBS calculator [305
Phenotypic recording with deep learning, using more than 2.7 M  sequence-function pairs [306
Translation initiation optimisation Reviews [228,307,308
Significant increase in serine overproduction [309
5-Methylpyrazine-2-carboxylic acid production [310
2,5-furandicarboxylic acid production [311
tRNA engineering Admits novel codons, of which some can encode non-canonical  amino acids [312
tRNA synthetase engineering [313

It is generally agreed that while the use of extrachromosomal plasmids is useful for high-throughput screening applications, integration of pathways into the chromosomal DNA of the host organism is ultimately preferable in most production strains due to the well-established instability of plasmids during continuous growth [256,257]. Recently advances in CRISPR and other molecular biology techniques have allowed the integration of reporter genes into a high number of defined genomic sites. Significant variations in expression levels of reporter proteins by the site of genomic integration have been demonstrated in Saccharomyces [258], E. coli [259,260], Bacillus subtilis [261], Pseudomonas putida [262], and Acinetobacter baylyi [263]. Generally higher levels of expression are seen for integration at sites closer to the origin of replication, as during replication there is in essence a higher copy number of genes that are on the DNA strands replicated first [264]. As discussed in detail above it is rarely a sensible goal to maximise the expression level of all proteins in a relevant pathway, so the genomic integration site of heterologous proteins is an axis on which optimisation can be performed.

Much of the systems biology agenda (e.g. [30,145,146,314–318]) has recognised that to understand complex, nonlinear systems such as biochemical networks it is wise to model them in parallel with analysing them experimentally. This allows the performance in silico of ‘what if?’ kinds of experiments in a manner far less costly than doing them all, allowing one to choose a subset of the most promising. This is also sometimes referred to as e-science [319–321], or having a ‘digital twin’ [322,323] of the process of interest. It is, of course, very well established in fields such as chemical or electronic engineering, where it would be inconceivable to design a process plant or a new chip without modelling it in parallel.

In part, the success of those fields is because we know (because we have designed them) both the wiring diagram of how components or modules interact, and in addition, we know, quantitatively, the input–output characteristics of each module. This allows one to produce what amounts to a series of ordinary differential equations that, given a starting set of conditions, can model the time evolution of the system (by integrating the ordinary differential equations). Such models can be set up in biochemistry-friendly systems such as Copasi [324,325] (http://copasi.org/), CellDesigner [326–329] (http://www.celldesigner.org/), and Cytoscape [330–332] (https://cytoscape.org/). However, prerequisite to this being done accurately is that one has knowledge of the expression levels, kinetic rate equations, and rate constants for each of the steps. This is only rarely achieved (e.g. [333]), even when such details are not known and generalised equations that cover a wide range of force–flux conditions are used [334–336]. Consequently, so-called constraint-based methods have come to the fore. Chief among these is flux balance analysis (FBA) [146,315,317,337–341].

FBA recognises and exploits the massively important ‘stoichiometric’ constraints engendered by the fact that mass must be conserved [342–344], leading to atomic and molecular constraints reflected in reaction stoichiometries, and that consequently only certain kinds of fluxes and flux rations are possible in a known metabolic network. This simple but exceptionally powerful idea, equivalent to Kirchoff's laws in electrical circuit theory, comes into its own when one seeks to optimise fluxes to the desired end [345–348] (as in host engineering).

Software for performing FBA is also more or less widely available [340,349], the generic COBRA toolboxes [347,350–353] being especially popular. Such software is much aided by the development of various kinds of linguistic standards for describing systems biology models, such as BioPAX [354–357] (http://www.biopax.org/) and SBML [358] (http://sbml.org/Main_Page).

An especially potent implementation comes from the recognition that if the expression level of a given enzyme is treated as a surrogate for (or an approximation to) the actual flux through that step, then methods that maximise the correlation between predicted and real fluxes, while still admitting mass conservation, can, in fact, predict real fluxes astonishingly well (e.g. [359,360]). Given such a base model, it is then just a question of navigating the space of expression profiles to see those (combinations of) changes that have the greatest effect on the flux of interest. Note, however, that FBA (i) is blind to regulatory effects and (ii) cannot predict metabolite concentrations (only fluxes). Finally, here, it is worth remarking that advanced analyses based on molecular dynamics simulations are beginning to allow the calculation of enzymatic activities and epistatic interactions de novo (or at least to account for them) (e.g. [116,361–363]); as with other areas [364–366], the increasing availability of cheap computing will continue to make such methods both more potent and more accurate.

As seen in early work in E. coli [367], promoter engineering allows one to vary the amount of target enzymes both smoothy and extensively. Of course, nowadays this can be done on a genome-wide scale using methods such as CRISPR–Cas [368,369]. Thus Alper and colleagues [370] assessed the effects of the expression level of all 969 genes that comprise the ‘ito977’ model of Saccharomyces cerevisiae metabolism, with overproduction of betaxanthins as one of the objective functions. A particularly important finding was that in a good many cases knockdown rather than complete knockout was preferable, and that there was almost always an optimal level (as per Figure 5) in the range considered. This optimality has been widely reported (e.g. [371–375]), and interestingly (presumably for evolutionary reasons) typically corresponds to the expression level seen in the WT [12]! RNAi engineering can also be used to modulate expression levels [376].

Of the various means of genetic manipulation widely available (transformation, transduction, mating, etc.), transformation by exogenous DNA remains the most popular. This said, transformation using libraries of DNA is far less efficient than one would like [377,378], and it varies considerably with the organism of interest. Some cells [379] such as certain bacilli [380], streptococci [381], acinetobacters [382], and Vibrio spp. [383–386] are more-or-less ‘naturally’ competent, which others require considerable optimisation to achieve acceptable rates [387]. A veritable witches’ brew of cocktail components have been considered; at this stage, it seems that an empirical approach is needed for every organism (e.g. [388–393]). There is also the question of whether the vector to be used is intended to be or remain episomal or to integrate by recombination into the host chromosome. These are areas that will require especial attention for improved host engineering.

The arrival of CRISPR–Cas9 and related genome editing tools [394,395] is well enough known as not to need detailed review (and many are available, e.g. [369,396–410]).

A recent advance incorporates the ability to incorporate a simple (barcoded) coupling between the gRNA that might have had an effect and its nature as encoded via a barcode. This is the CRISPR-enabled trackable genome engineering (CREATE) technology developed by Gill and colleagues (e.g. [396,404,411–413]). CREATE uses array-based oligos to synthesise and clone 100s of 1000s of cassettes containing a genome-targeting gRNA covalently linked to a dsDNA repair cassette encoding a designed mutation. After CRISPR/Cas9 genome editing, the frequency of each designed mutant can be tracked by high-throughput sequencing using the CREATE plasmid as a barcode. (A commercial version of this approach is now available as the Onyx™ instrument (https://www.inscripta.com/technology).)

A biotechnological example of the CREATE technology is that for lysine production (a mature, multi-billion \$US market [414]) in E. coli [412]. Here the authors [412] designed over 16 000 mutations to perturb lysine production, and mapped their contributions toward resistance to a lysine antimetabolite (toxic amino acid analogue). They thereby identified a variety of different routes that can alter pathway function and flux, uncovering mechanisms that would have been difficult to design rationally — many were, in fact, unknown! In the event, mutations in genes linked to transport, biosynthesis, regulation, and degradation were uncovered, with some being as expected (showing the virtue of the strategy) and others—especially in DapF acting as a regulator—being entirely novel. Overall, this strategy provides an exceptionally potent, efficient and effective approach to the principled discovery of ‘novel’ genes involved in any bioprocess of interest that can be run at different ‘levels’ or in different ‘states’.

It is a curious fact that much of the community that studies plants has focused on the control of flux via transcription factors (TFs, e.g. for pigment production [415–418]), while microbiologists have tended historically to focus more directly on metabolic networks per se. This is starting to change.

The transcription factor-based regulatory network of E. coli is probably the best studied (e.g. [419–421], with over 200 TFs [422–424] organised into some 150 regulons [423,425,426]. Independent components analysis (ICA) is a useful, convenient, multivariate linear, and well-established technique for separating mixed signals into orthogonal contributions; it has been used to group these differential gene expression changes into over 300 iModulons [427,428]. One may suppose that semi-supervised methods of deep learning [31] will prove even more rewarding in terms of understanding coregulation.

A related study in yeast manipulated some 47 TFs (via a library containing over 83 000 mutations) affecting over 3000 genes, leading to a substantial improvement in both isopropanol and n-butanol tolerance. An analysis of the relevant gene expression changes showed that genes related to glycolysis played a role in the tolerance to isobutanol, while changes in mitochondrial respiration and oxidative phosphorylation were significant for tolerance to both isobutanol and isopropanol.

The number and nature of the genes regulated by TFs can vary considerably, and in a nice strategy Lastiri-Pancardo et al. [429] worked out those whose removal would provide maximal flexibility for the reorganisation of allocation of the rest of the proteome. For instance, feast/famine regulatory proteins/transcription factors [430,431] are common to both archaea and eubacteria; Lrp, in particular, is especially responsive to the concentration of leucine as an indicator of the cell's nutritional status. Overall, TFs seem a particularly useful target for intelligent host engineering (e.g. [432–434], including in biosensors [435–444]).

In addition to changing the expression levels of target genes, we will also wish to change their activities, and one obvious means is via mutation. The kinds of diversity creation that can effect mutation are summarised in Figure 8.

### Types of diversity creation and genome engineering.

Figure 8.
Types of diversity creation and genome engineering.

Different strategies for creating strain diversity as part of host engineering, set out as a ‘Boston matrix’ reflecting the variation between difference strategies in terms of the number of variants created and the number of genomic locations tested. Based on the material at https://www.youtube.com/watch?v=tb97SghfL_8&t=256s.

Figure 8.
Types of diversity creation and genome engineering.

Different strategies for creating strain diversity as part of host engineering, set out as a ‘Boston matrix’ reflecting the variation between difference strategies in terms of the number of variants created and the number of genomic locations tested. Based on the material at https://www.youtube.com/watch?v=tb97SghfL_8&t=256s.

Close modal

Many of the genetic variations that improve the performance of microbial cell factories are not currently possible to design rationally, despite the large degree of genetic knowledge around many platform strains [378]. This is in large part due to the high degree of epistasis and the combinatorial problems discussed in detail above. While advances in AI are rapidly changing this (see above), improvements in microbial cell factories are presently still in many cases being found by wet laboratory techniques (Table 2) that introduce more-or-less random mutations across the genome and then select for strains with desired properties. These strains can be used directly, or with the plummeting costs of next-generation sequencing, beneficial mutations can be identified revealing new mechanisms and targets for further rational design. A further advantage of random mutagenesis relevant to some applications is that strains generated through random mutagenesis are considered ‘GMO free’, which allows one to avoid legal regulations that have been set up around some kinds of so-called genetically modified organisms [445,446].

Table 2.
Example applications of techniques to introduce genome-wide mutations
TechniqueSpeciesPurposeNotesReferences
UV Kluyveromyces  marxianus Improved ethanol  production Used an automated  platform  incorporating UV  mutagenesis. [480
Yarrowia lipolytica Improved oil production  [481
Chemical  mutagenesis Chlorella vulgaris Light tolerance  [482
Brettanomyces  bruxellensis Reduced production of  4-ethylphenol, an  undesirable by-product in  wine fermentation  [483
Yarrowia lipolytica Increased lipid production  [484
Lipomyces starkeyi Increased production of  triacylglycerol  [485
Atmospheric and  room temperature  Plasma  mutagenesis Zymomonas mobilis Acetic acid tolerance  [486
Spirulina platensis Astaxanthin production  [487
Escherichia coli L-lysine production Incorporated a  biosensor for cell  sorting [488
Actinosynnema  pretiosum Production of the antibiotic  Ansamitocin Used in  combination with  genome shuffling [489
Streptomyces  mobaraensis Production of the enzyme  transglutaminase
epWGA S. cerevisiae Ethanol tolerance  [451
Lactobacillus  pentosus Lactic acid production  [490
Zymomonas mobilis Furfural tolerance  [491
E. coli Butanol tolerance.  [492
Serialised ALE Saccharomyces  cerevisiae β-caryophyllene production  [493
Corynebacterium  glutamicum Glutarate production  [494
E. coli Ionic liquid tolerance  [454
Continuous ALE Methylobacterium  extorquens Methanol tolerance  [495
E. coli Conversion to generate all  its biomass from CO2  [496
GREACE E. coli Lysine production  [478
E. coli Butanol tolerance  [476
S. cerevisiae Acetic acid tolerance,  reduced acetaldehyde  production  [479
TechniqueSpeciesPurposeNotesReferences
UV Kluyveromyces  marxianus Improved ethanol  production Used an automated  platform  incorporating UV  mutagenesis. [480
Yarrowia lipolytica Improved oil production  [481
Chemical  mutagenesis Chlorella vulgaris Light tolerance  [482
Brettanomyces  bruxellensis Reduced production of  4-ethylphenol, an  undesirable by-product in  wine fermentation  [483
Yarrowia lipolytica Increased lipid production  [484
Lipomyces starkeyi Increased production of  triacylglycerol  [485
Atmospheric and  room temperature  Plasma  mutagenesis Zymomonas mobilis Acetic acid tolerance  [486
Spirulina platensis Astaxanthin production  [487
Escherichia coli L-lysine production Incorporated a  biosensor for cell  sorting [488
Actinosynnema  pretiosum Production of the antibiotic  Ansamitocin Used in  combination with  genome shuffling [489
Streptomyces  mobaraensis Production of the enzyme  transglutaminase
epWGA S. cerevisiae Ethanol tolerance  [451
Lactobacillus  pentosus Lactic acid production  [490
Zymomonas mobilis Furfural tolerance  [491
E. coli Butanol tolerance.  [492
Serialised ALE Saccharomyces  cerevisiae β-caryophyllene production  [493
Corynebacterium  glutamicum Glutarate production  [494
E. coli Ionic liquid tolerance  [454
Continuous ALE Methylobacterium  extorquens Methanol tolerance  [495
E. coli Conversion to generate all  its biomass from CO2  [496
GREACE E. coli Lysine production  [478
E. coli Butanol tolerance  [476
S. cerevisiae Acetic acid tolerance,  reduced acetaldehyde  production  [479

The ability of UV radiation [447] and certain chemicals [448] to cause mutation has been established since the 1930s and 1940s, respectively. While there have been massive advances in the tools available for metabolic engineering and strain generation in the subsequent decades (some of which are outlined below), several recent papers illustrate that there is still utility in using UV radiation and mutagenic chemicals to introduce genetic diversity. These techniques are especially relevant when working with novel or poorly characterised strains for which other tools to introduce variation are lacking, since UV and chemical mutagens cause mutations efficiently in nearly all species.

Atmospheric and room temperature plasma mutagenesis (ARTM) is a novel technique for introducing random mutagenesis. The application of plasma as a mutagenic agent was first described by Li and colleagues in a 2012 paper [449], in which it was used to generate a mutant library of Methylosinus trichosporium. In ARTM, a jet of helium, ionised by an electric field, is blown onto a sample, which (through a yet to be fully elucidated mechanism) causes DNA damage and mutations.

The ARTM technique has, according to a recent review [445], been applied to industrially relevant improvements in over 20 species including both Gram-positive and -negative bacteria, filamentous fungi, yeasts, algae, and cyanobacteria [445]. It has been shown in the umu test on Salmonella typhimurium that ARTM generates a higher rate of surviving mutated cells than do UV and chemical mutagenesis methods [450]. Despite the apparent advantages, the commercial unit is thus far only available in China, and the publications using ARTM appear to be exclusively from Chinese institutions.

Another technique to introduce mutations across the genome is error-prone whole-genome amplification (epWGA). In this, genomic DNA from the strain of interest is extracted and subjected to error-prone PCR, then retransformed into the initial strain [451]. The transformed cells are subjected to relevant selective pressure, for instance, to isolate strains that have improved property such as a tolerance to an inhibitor or increased product titre. This process can be performed iteratively, and with full genome sequencing beneficial mutations can be identified and isolated to quantify their effects.

One of the most widely used and well-established techniques to introduce (i.e. select) beneficial mutations is adaptive laboratory evolution (ALE) [452,453]. ALE is in principle a very simple technique in which cells are cultured under some form of selective pressure, such as the presence of a toxic substance. Cultures are generally serially propagated into media with incremental increases in selective pressure. During this, mutations that confer a fitness advantage accumulate and become fixed in the population. These mutations can then be discovered by sequencing and reintroduced explicitly into a strain of interest, or the evolved strain can be used directly as a platform in downstream applications (e.g. [454,455]).

In the most straightforward use case, tolerance ALE (TALE), cells are propagated in increasing concentrations of some compound that normally inhibits growth in order to improve tolerance (‘tolerance engineering’, Figure 9). Tolerance to toxic environments is still a major limiting factor in achievable yields from microbial cell factories [456,457]. This may be for example toxicity of the desired product (as is the case in fermentative butanol production [458]) or toxic inhibitors present in feed stocks (which is a major challenge in attempts to process lignocellulose hydrolysates [459–461]. ALE may also be used to improve utilisation of a preferred energy source, or to increase product titre directly (although the latter usually requires more advanced experimental design in order to couple production to a fitness advantage [462,463]). A very extensive recent review covering the applications of ALE in more detail is found in [464].

### Adaptive laboratory evolution (ALE), illustrated here for tolerance engineering.

Figure 9.
Adaptive laboratory evolution (ALE), illustrated here for tolerance engineering.

Cultures are grown in batch mode under conditions in which a stress leads their overall growth rate or yield to be suboptimal. As mutants that are more tolerant to the stress emerge they are selected for and take over the culture, with concomitant increases in growth rate or yield. The magnitude of the stress can then be increased and the process repeated as often as desired.

Figure 9.
Adaptive laboratory evolution (ALE), illustrated here for tolerance engineering.

Cultures are grown in batch mode under conditions in which a stress leads their overall growth rate or yield to be suboptimal. As mutants that are more tolerant to the stress emerge they are selected for and take over the culture, with concomitant increases in growth rate or yield. The magnitude of the stress can then be increased and the process repeated as often as desired.

Close modal

While ALE in its simplest form involves serial propagation of cells, continuous evolution techniques utilise variations on what are commonly referred to as ostat bioreactors. These use some form of detection from a growth chamber (commonly OD but such x-stats may also detect pH, dissolved oxygen and many other parameters). Cultures are maintained in a constant state of growth under steady conditions by dilution through automated addition of fresh media along with other supplements or inhibitors. In this way, a constant growth rate and smooth evolution curve can be achieved, compared with the more ‘punctuated equilibrium’ that is the hallmark of serialised ALE [465]. Traditionally cost has been something of a barrier in the use of turbidostats (albeit far from insurmountable [198,199,466–468]), which unlike serialised ALE require specialised detection probes and feedback systems [469]. Recently, however, several open source and low-cost chemostats have become available, reducing the financial barriers to entry at the cost of a requirement for significantly greater hands-on expertise [470–472].

The adaptive mutations that appear in both serialised and continuous ALE occur through the natural mutations occurring during DNA replication in growing cell populations. Although DNA replication in microbes is generally of very high fidelity (estimated to be on the order of 10−10 errors per base pair per generation [473] in wild-type strains), the high density of cells during cultivation (108–1010 per ml) still means that enough mutations will occur to generate strains with a fitness advantage. A higher mutation rate may be desirable, however [474], to increase the rate of adaptation or to allow adaptations towards more specialised phenotypes. Indeed, the mutation rate is itself adaptive [475].

The mutation rate in E. coli has been increased in a principled way through a technique called genome replication engineering assisted continuous evolution (GREACE) [476–478]. In this approach, a plasmid carrying a modified DNA proofreading element (the dnaQ gene) is transformed into the initial strain of interest, and then the transformed cells are subject to continuous ALE. Cells carrying the modified PE plasmid have deficiencies in proofreading ability and, therefore, accumulate mutations at a higher rate than do untransformed cells. Under strong selective pressure, higher mutation rates themselves confer a fitness advantage and the cells carrying the plasmid outcompete those that lose the plasmid. As the deficient proofreading machinery is present on a plasmid as opposed to the genome, once this is removed a strain with the accumulated mutations but a native DNA proofreading system can be recovered, allowing direct use in downstream industrial applications.

Since the initial demonstration of GREACE to generate tolerance to butanol [476], the GREACE methodology has also been extended to S. cerevisiae, substituting the dnaQ gene with an error-prone DNA polymerase from S. cerevisiae. Here it was successfully used to increase the tolerance to acetic acid and reduce the production of acetaldehyde in an ethanol-producing strain [479].

The activity of an enzyme, as expressed in the term Vmax, is the product of two terms, viz. the concentration of the enzyme E and its catalytic turnover rate kcat. Consequently, there are, broadly, two ways to speed up an individual step in a metabolic network: (i) increase the amount of catalyst (Vmax) or increase the activity of each catalyst molecule (kcat). While the former is the more common via well-established promoter engineering methods, we have long taken the view that the latter should be more effective. The reason is simple, i.e. to increase an enzyme concentration 10-fold requires the production of 10-fold more protein, and this is not always possible (see above). Indeed, especially for membrane proteins, the available real estate may be especially limited [498,499]. In contrast, an increase in kcat of a 100-fold, which is often easily obtainable in directed evolution programmes, means that one could increase the rate of an individual step by 10-fold while using even ten times less of the relevant protein. One example where massive overexpression of a target protein has been used in the overexpression of the efflux transporter for serine [499].

Although we are aiming not to focus excessively on specific areas, we mention transporter engineering because (i) transporters normally exhibit considerable flux control for both substrate influx and product efflux, and (ii) they illustrate more generally how an often-neglected scientific area may benefit from the significant study [500]. In addition, it is (somewhat astonishingly [501]) widely still believed (or at least assumed) that all kinds of substrates simply cross biological membrane via passage through any bilayer that may be present. The facts are otherwise [176,502–509]. Those references rehearse the fact that even tiny molecules like water [510,511] do not pass unhindered through phospholipid bilayers in real biological membranes (whose protein : lipid ratio by mass is often 3 : 1), but require transporters. Recent examples of transporter engineering for biotechnological purposes include glycolipid surfactants [512] and fatty acids [513]. Flow cytometry can provide a convenient means of assessing the activities of certain transporters [179,180].

Classically, the predominance of diploidy in organisms such as penicillia has been seen as a significant disadvantage, as it prevents the emergence of traits that rely on similar activities in both genomes for expression. Indeed, MCA serves to explain the molecular basis of genetic dominance [141], and the use of haploid cells can provide a much great signal : noise in genetic competition experiments [199,514–521]. Yeasts such as those of the genus Saccharomyces are of special interest here, since they can sporulate as haploid forms of different mating types that can then interbreed, including interspecifically [522]. Perhaps surprisingly, the effects of this on transcription can be quite modest [523].

As noted above, natural evolution tends to select for growth rate rather than growth yield [139]. However, typically if the product is not directly growth-associated (e.g. as with ‘secondary’ metabolites in idiophase [10,524]) or with two-stage fed-batch regimes where a growth phase is followed by a production phase, one is wanting cells not to grow at the expense of making product [525,526]. Certainly, ‘dormant’ (non-replicating) cells can be quite active metabolically [527–530]. Consequently, although not usually a focus of biotechnology, it remains the case that the more time cells spend in a fermentor non-productively the less good the process. This has led to the consideration of hosts such as Vibrio natriegens [50,531–541], whose optimal doubling time can be as little as 7 min, some threefold quicker than the widely quoted 20 min for E. coli in rich media. Whether or not organisms such as V. natriegens turn out to be valuable production hosts, there is no doubt that understanding how to make cell growth quicker might help enhance the rates of recombinant protein production. Turbidostats [542,543] could be seen as a ‘revved-up’ version of ALE in that they too select for (and demonstrate the levels of any) growth rate enhancement. However, they remain a surprisingly under-utilized system for manipulating microbial physiology, despite many advantages [466,467,544]. Continuing the theme of comparative ‘growth-omics’, the growth rates of yeasts are significantly slower than those of bacteria, the record (in terms of rate of biomass doubling) apparently being Kluyveromyces marxianus with a doubling time of some 52 min [498] (a growth rate approximately twice that of S. cerevisiae [541,545–547]). This was achieved [498] via a different kind of growth rate selection in a kind of ‘turbidostat’ called a pHauxostat [548–550]).

Consequently, selection for faster growth rates and the concomitant analysis of gene expression changes [546,551–555] would seem to be a powerful means of understanding how to improve cellular performance.

At one level, the fact that growth rates [219] and expression profiles [198,556–558] differ as different enzymes are expressed at different levels in different growth media is trivial, and essentially describes the whole of microbial physiology. At another level, it is far from trivial because medium optimisation represents yet another combinatorial search problem [559]. If the optimal concentration is considered to be within a known two orders of magnitude and to sit adequately therein within a twofold concentration range, each constituent could take at least 6 values (simply because 100 lies between 26 and 27). With 20 medium constituents, there is then a ‘search space’ of some 620 (∼4.1015) recipes to find the optimum. Even just taking metal ions, and noting that approximately half of all enzymes are metalloenzymes [560–562], it is clear that organisms have significant preferences for particular levels of metal ions [563]. In an early example, Weuster-Botz and Wandrey [564] used a genetic algorithm to increase the productivity of formate dehydrogenase in an established fermentation by more than 50%, finding that Ca++, Mn++, Zn++, Cu++, and Co++ had all been used at excessive levels previously. Obviously, the optimum can also change with the host genotype, so is not fixed even for a given species. Consequently, we feel that automated medium optimisation algorithms should also be at the heart of any host engineering programme. As a classical combinatorial optimisation problem [1,95], this is arguably best attacked by evolutionary algorithms (e.g. [61,62,565–567]); Link and Weuster-Botz [559] give an excellent summary of their applications in medium optimisation, including the rather infrequent cases (e.g. [568]) in which multiple objectives are to be optimised.

The variation in expression of individual proteins, even within a nominally homogeneous or axenic culture of an isogenic organism, can vary considerably, leave alone those explicitly differentiated (e.g. [569–572]). This is also becoming ever clearer in differentiated organism via the emerging cell map projects (e.g. [573]) As single-cell transcriptomics, proteomics, and metabolomics become possible, and individual cells are easily sorted in a fluorescence-activated cell sorter, one can contemplate studies in which the expression profiles even of large numbers of nominally isogenic cells are compared with their productivity simultaneously. This may even include understanding of the spatial distribution of proteins within individual cells [574]. One can also imagine a far greater use of the methods of chemical genomics in affecting and understanding cellular behaviour; in this regard, the strategy of chemically induced selective protein degradation [575–580] seems likely to be of significant value.

We have purposely avoided focus on the production of any specific target molecules, since our aim is to help develop the BioEconomy generally. This said, the growth of ‘AI’ and deep learning alluded to above has already shown profound benefits in identifying chemical (e.g. [581–585]) and biosynthetic pathways (e.g. [586,587]), while our own work has developed deep learning methods for molecular generation [588] and molecular similarity [589], for navigating chemical space in a principled way [238], and in particular for predicting the structure of small molecules from their high-resolution mass spectra [239]. In this latter work, we developed a deep neural network with some 400 million interconnections [239], a number that just 3 years ago (writing in July 2021) would have been the largest published. Such has been the growth of large networks (approaching 1% of the interconnections in the human brain) that that number is now too low by a factor of more than 1000-fold [234], necessitating the development of specialist hardware and software to deal with it. Innovations in such kinds of computer engineering, including e.g. in optical computing, will be of considerable benefit. With these large networks has come the question of interpreting precisely how they are doing what they do so well (so-called ‘explainable AI’ or XAI [590–594]). XAI will of necessity lead both to better understanding and to sparser networks, and is an important part of the automation [595–598] (not covered here) that will help to speed up the DBTL cycle enormously.

Classically, electronic circuits were and are predictable because the input/output characteristics of the components are known, and because their wiring diagrams are expertly and precisely controlled by their designers. None of these facts is presently true of biology [599,600], and much of the future in both ‘pure’ organismal bioscience and in biotechnology will thus be about ‘making biology predictable’ [30].

This has been a purposely high-level overview of some of the possibilities in host engineering predicated on genome-wide analyses. Our main aim has been to draw attention to these developments, and to some of the means by which readers who are only loosely acquainted with them can incorporate these methods into their own work.

Take-home messages include

• Host engineering, like directed protein evolution, is a combinatorial search problem.

• Every enzyme potentially has an optimal expression level for every process.

• This is not normally its maximal level, since the maximum amount of protein a cell can produce is fixed, including for a given growth rate; protein synthesis is largely a zero-sum game.

• Changes in the individual concentrations of most enzymes at their operating point necessarily have little effect on fluxes.

• Some areas of transcription and translation effect a more global control and thus can have greater effects and hence serve as better targets for host engineering.

• kcat is a much better target for host and protein engineering than is Vmax.

• Modern methods of modelling, including deep learning, are beginning to provide the ability to assess desirable changes in silico, as a prelude to developing a fully predictive biology.

The success of these messages will be judged by the rapidity with which the strategies they contain are adopted.

Open access for this article was enabled by the participation of University of Liverpool in an all-inclusive Read & Publish pilot with Portland Press and the Biochemical Society under a transformative agreement with JISC.

The authors declare that there are no competing interests associated with the manuscript.

L.J.M. and D.B.K. are funded by the Novo Nordisk Foundation (grant NNF NNF20CC0035580). Present funding also includes the UK BBSRC projects BB/R014744/1 (with GSK) and BB/T017481/1. We apologise to authors whose contributions were not included due to lack of space.

• ALE

•
• ARTM

atmospheric and room temperature plasma mutagenesis

•
• DBTL

design–build–test–learn

•
• epWGA

error-prone whole-genome amplification

•
• FBA

flux balance analysis

•
• MCA

metabolic control analysis

•
• SPs

signal peptide sequences

•
• TFs

transcription factors

•
• WT

wild type

1
Kell
,
D.B.
(
2012
)
Scientific discovery as a combinatorial optimisation problem: how best to navigate the landscape of possible experiments?
Bioessays
34
,
236
244
2
Kell
,
D.B.
and
Westerhoff
,
H.V.
(
1986
)
Metabolic control theory: its role in microbiology and biotechnology
.
FEMS Microbiol. Rev.
39
,
305
320
3
Kell
,
D.B.
and
Westerhoff
,
H.V.
(
1986
)
Towards a rational approach to the optimization of flux in microbial biotransformations
.
Trends Biotechnol.
4
,
137
142
4
Brown
,
G.C.
(
1991
)
Total cell protein concentration as an evolutionary constraint on the metabolic control distribution in cells
.
J. Theor. Biol.
153
,
195
203
5
Cornish-Bowden
,
A.
,
Hofmeyr
,
J.-H.S.
and
Cárdenas
,
M.L.
(
1995
)
Strategies for manipulating metabolic fluxes in biotechnology
.
Bioorg. Chem.
23
,
439
449
6
Heinrich
,
R.
,
Schuster
,
S.
and
Holzhütter
,
H.G.
(
1991
)
Mathematical analysis of enzymatic reaction systems using optimization principles
.
Eur. J. Biochem.
201
,
1
21
7
Heinrich
,
R.
and
Schuster
,
S.
(
1996
)
The Regulation of Cellular Systems
,
Chapman & Hall
,
New York, NY
8
Fell
,
D.A.
(
1998
)
Increasing the flux in metabolic pathways: a metabolic control analysis perspective
.
Biotechnol. Bioeng.
58
,
121
124
9
Zhang
,
M.M.
,
Wang
,
Y.
,
Anga
,
E.L.
and
Zhao
,
H.
(
2015
)
Engineering microbial hosts for production of bacterial natural products
.
Nat. Prod. Rep.
33
,
963
10
Wang
,
G.
,
Kell
,
D.B.
and
Borodina
,
I.
(
2021
)
Harnessing the yeast Saccharomyces cerevisiae for the production of fungal secondary metabolites
.
Essays Biochem.
65
,
277
291
11
Jun
,
S.
,
Si
,
F.
,
Pugatch
,
R.
and
Scott
,
M.
(
2018
)
Fundamental principles in bacterial physiology-history, recent progress, and the future with focus on cell size control: a review
.
Rep. Prog. Phys.
81
,
056601
12
Bruggeman
,
F.J.
,
Planqué
,
R.
,
Molenaar
,
D.
and
Teusink
,
B.
(
2020
)
Searching for principles of microbial physiology
.
FEMS Microbiol. Rev.
44
,
821
844
13
Currin
,
A.
,
Swainston
,
N.
,
Day
,
P.J.
and
Kell
,
D.B.
(
2015
)
Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently
.
Chem. Soc. Rev.
44
,
1172
1239
14
Jensen
,
M.K.
and
Keasling
,
J.D.
(
2018
)
Synthetic Metabolic Pathways: Methods and Protocols
,
Humana Press
,
New York, NY
15
Nielsen
,
J.
and
Keasling
,
J.D.
(
2011
)
Synergies between synthetic biology and metabolic engineering
.
Nat. Biotechnol.
29
,
693
695
16
Way
,
J.C.
,
Collins
,
J.J.
,
Keasling
,
J.D.
and
Silver
,
P.A.
(
2014
)
Integrating biological redesign: where synthetic biology came from and where it needs to go
.
Cell
157
,
151
161
17
Redden
,
H.
,
Morse
,
N.
and
Alper
,
H.S.
(
2015
)
The synthetic biology toolbox for tuning gene expression in yeast
.
FEMS Yeast Res.
15
,
1
10
18
de Lorenzo
,
V.
,
Prather
,
K.L.
,
Chen
,
G.Q.
,
O'Day
,
E.
,
von Kameke
,
C.
,
Oyarzun
,
D.A.
et al (
2018
)
The power of synthetic biology for bioproduction, remediation and pollution control: the UN's sustainable development goals will inevitably require the application of molecular biology and biotechnology on a global scale
.
EMBO Rep.
19
,
e45658
19
Katz
,
L.
,
Chen
,
Y.Y.
,
Gonzalez
,
R.
,
Peterson
,
T.C.
,
Zhao
,
H.
and
Baltz
,
R.H.
(
2018
)
Synthetic biology advances and applications in the biotechnology industry: a perspective
.
J. Ind. Microbiol. Biotechnol.
45
,
449
461
20
Freemont
,
P.S.
(
2019
)
Synthetic biology industry: data-driven design is creating new opportunities in biotechnology
.
Emerg. Top. Life Sci.
3
,
651
657
21
Clarke
,
L.
and
Kitney
,
R.
(
2020
)
Developing synthetic biology for industrial biotechnology applications
.
Biochem. Soc. Trans.
48
,
113
122
22
Zhang
,
Y.
,
Ding
,
W.
,
Wang
,
Z.
,
Zhao
,
H.
and
Shi
,
S.
(
2021
)
Development of host-orthogonal genetic systems for synthetic biology
.
5
,
e2000252
23
Wang
,
T.
,
Ma
,
X.
,
Du
,
G.
and
Chen
,
J.
(
2012
)
Overview of regulatory strategies and molecular elements in metabolic engineering of bacteria
.
Mol. Biotechnol.
52
,
300
308
24
Zielinski
,
D.C.
,
Patel
,
A.
and
Palsson
,
B.O.
(
2020
)
The expanding computational toolbox for engineering microbial phenotypes at the genome scale
.
Microorganisms
8
,
2050
25
Goodacre
,
R.
,
Trew
,
S.
,
Wrigley-Jones
,
C.
,
Saunders
,
G.
,
Neal
,
M.J.
,
Porter
,
N.
et al (
1995
)
Rapid and quantitative analysis of metabolites in fermentor broths using pyrolysis mass spectrometry with supervised learning: application to the screening of Penicillium chryosgenum fermentations for the overproduction of penicillins
.
Anal. Chim. Acta
313
,
25
43
26
Ng
,
A.Y.
and
Jordan
,
M. I.
(
2001
)
On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes
.
Prc NIPS
14
,
841
848
27
Baggenstoss
,
P.M.
(
2020
)
The Projected Belief Network Classfier: both Generative and Discriminative. arXiv 2008.06434
28
Verma
,
V.K.
,
Liang
,
K.J.
,
Mehta
,
N.
,
Rai
,
P.
and
Carin
,
L
. (
2021
)
Efficient feature transformations for discriminative and generative continual learning. arXiv, 2103.13558
29
Kell
,
D.B.
and
Oliver
,
S.G.
(
2004
)
Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era
.
Bioessays
26
,
99
105
30
Kell
,
D.B.
and
Knowles
,
J.D
. (
2006
) The role of modeling in systems biology. In
System Modeling in Cellular Biology: From Concepts to Nuts and Bolts
(
Szallasi
,
Z.
,
Stelling
,
J.
and
Periwal
,
V.
, eds), pp.
3
18
,
MIT Press
,
Cambridge, U.K.
31
Kell
,
D.B.
,
Samanta
,
S.
and
Swainston
,
N.
(
2020
)
Deep learning and generative methods in cheminformatics and chemical biology: navigating small molecule space intelligently
.
Biochem. J.
477
,
4559
4580
32
Abid
,
M.A.
,
Hedhli
,
I.
and
Gagné
,
C
. (
2021
)
A generative model for hallucinating diverse versions of super resolution images. arXiv, 2102.06624
33
Dupont
,
E.
,
Teh
,
Y.W.
and
Doucet
,
A
. (
2021
)
Generative models as distributions of functions. arXiv, 2102.04776
34
Lamb
,
A
. (
2021
)
A brief introduction to generative models. arXiv, 2103.00265
35
Ruthotto
,
L.
and
Haber
,
E
. (
2021
)
An introduction to deep generative modeling. arXiv, 2103.05180
36
Jiménez-Luna
,
J.
,
Grisoni
,
F.
,
Weskamp
,
N.
and
Schneider
,
G.
(
2021
)
Artificial intelligence in drug discovery: recent advances and future perspectives
.
Expert Opin. Drug Discov.
16
,
949
959
37
Biswas
,
S.
,
Khimulya
,
G.
,
Alley
,
E.C.
,
Esvelt
,
K.M.
and
Church
,
G.M.
(
2021
)
Low-N protein engineering with data-efficient deep learning
.
Nat. Methods
18
,
389
396
38
Wu
,
Z.
,
Johnston
,
K.E.
,
Arnold
,
F.H.
and
Yang
,
K.K.
(
2021
)
Protein sequence design with deep generative models. arXiv, 2104.04457
39
Hie
,
B.L.
and
Yang
,
K.K
. (
2021
)
Adaptive machine learning for protein engineering. arXiv, 2106.05466
40
Li
,
G.
,
Qin
,
Y.
,
Fontaine
,
N.T.
,
Ng Fuk Chong
,
M.
,
Maria-Solano
,
M.A.
,
Feixas
,
F.
et al (
2021
)
Machine learning enables selection of epistatic enzyme mutants for stability against unfolding and detrimental aggregation
.
Chembiochem
22
,
904
914
41
Swainston
,
N.
,
Smallbone
,
K.
,
Mendes
,
P.
,
Kell
,
D.B.
and
Paton
,
N.W.
(
2011
)
The SuBliMinaL Toolbox: automating steps in the reconstruction of metabolic networks
.
Integrative Bioinf.
8
,
186
, PMID:
[PubMed]
42
Kell
,
D.B.
(
2006
)
Metabolomics, modelling and machine learning in systems biology: towards an understanding of the languages of cells. The 2005 Theodor Bücher lecture
.
FEBS J.
273
,
873
894
43
Knight
,
C.G.
,
Platt
,
M.
,
Rowe
,
W.
,
Wedge
,
D.C.
,
Khan
,
F.
,
Day
,
P.
et al (
2009
)
Array-based evolution of DNA aptamers allows modelling of an explicit sequence-fitness landscape
.
Nucleic Acids Res.
37
,
e6
44
Ghatak
,
S.
,
King
,
Z.A.
,
Sastry
,
A.
and
Palsson
,
B.O.
(
2019
)
The y-ome defines the 35% of Escherichia coli genes that lack experimental evidence of function
.
Nucleic Acids Res.
47
,
2446
2454
45
Rudd
,
K.E.
(
1998
)
Linkage map of Escherichia coli K-12, edition 10: the physical map
.
Microbiol. Mol. Biol. Rev.
62
,
985
1019
46
Kudla
,
G.
,
Murray
,
A.W.
,
Tollervey
,
D.
and
Plotkin
,
J.B.
(
2009
)
Coding-sequence determinants of gene expression in Escherichia coli
.
Science
324
,
255
258
47
Plotkin
,
J.B.
and
Kudla
,
G.
(
2011
)
Synonymous but not the same: the causes and consequences of codon bias
.
Nat. Rev. Genet.
12
,
32
42
48
Lu
,
P.
,
Vogel
,
C.
,
Wang
,
R.
,
Yao
,
X.
and
Marcotte
,
E.M.
(
2007
)
Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation
.
Nat. Biotechnol.
25
,
117
124
49
Boël
,
G.
,
Letso
,
R.
,
Neely
,
H.
,
Price
,
W.N.
,
Wong
,
K.H.
,
Su
,
M.
et al (
2016
)
Codon influence on protein expression in E. coli correlates with mRNA levels
.
Nature
529
,
358
363
50
Eichmann
,
J.
,
Oberpaul
,
M.
,
Weidner
,
T.
,
Gerlach
,
D.
and
Czermak
,
P.
(
2019
)
Selection of high producers from combinatorial libraries for the production of recombinant proteins in Escherichia coli and Vibrio natriegens
.
Front. Bioeng. Biotechnol.
7
,
254
51
Tuller
,
T.
,
Waldman
,
Y.Y.
,
Kupiec
,
M.
and
Ruppin
,
E.
(
2010
)
Translation efficiency is determined by both codon bias and folding energy
.
107
,
3645
3650
52
Schmitz
,
A.
and
Zhang
,
F.
(
2021
)
Massively parallel gene expression variation measurement of a synonymous codon library
.
BMC Genom.
22
,
149
53
Swainston
,
N.
,
Currin
,
A.
,
Day
,
P.J.
and
Kell
,
D.B.
(
2014
)
Genegenie: optimised oligomer design for directed evolution
.
Nucleic Acids Res.
12
,
W395
W400
54
Caspers
,
M.
,
Brockmeier
,
U.
,
Degering
,
C.
,
Eggert
,
T.
and
Freudl
,
R.
(
2010
)
Improvement of Sec-dependent secretion of a heterologous model protein in Bacillus subtilis by saturation mutagenesis of the N-domain of the AmyE signal peptide
.
Appl. Microbiol. Biotechnol.
86
,
1877
1885
55
Zrimec
,
J.
,
Börlin
,
C.S.
,
Buric
,
F.
,
,
A.S.
,
Chen
,
R.
,
Siewers
,
V.
et al (
2020
)
Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure
.
Nat. Commun.
11
,
6141
56
Moore
,
J.C.
,
Jin
,
H.M.
,
Kuchner
,
O.
and
Arnold
,
F.H.
(
1997
)
Strategies for the in vitro evolution of protein function: enzyme evolution by random recombination of improved sequences
.
J. Mol. Biol.
272
,
336
347
57
Berland
,
M.
,
Offmann
,
B.
,
Andre
,
I.
,
Remaud-Simeon
,
M.
and
Charton
,
P.
(
2014
)
A web-based tool for rational screening of mutants libraries using ProSAR
.
Protein Eng. Des. Sel.
27
,
375
381
58
Chen
,
H.
,
Borjesson
,
U.
,
Engkvist
,
O.
,
Kogej
,
T.
,
Svensson
,
M.A.
,
Blomberg
,
N.
et al (
2009
)
ProSAR: a new methodology for combinatorial library design
.
J. Chem. Inf. Model.
49
,
603
614
59
Fox
,
R.J.
,
Davis
,
S.C.
,
Mundorff
,
E.C.
,
Newman
,
L.M.
,
Gavrilovic
,
V.
,
Ma
,
S.K.
et al (
2007
)
Improving catalytic function by ProSAR-driven enzyme evolution
.
Nat. Biotechnol.
25
,
338
344
60
Zaugg
,
J.
,
Gumulya
,
Y.
,
Gillam
,
E.M.
and
Boden
,
M.
(
2014
)
Computational tools for directed evolution: a comparison of prospective and retrospective strategies
.
Methods Mol. Biol.
1179
,
315
333
61
Bäck
,
T.
,
Fogel
,
D.B.
and
Michalewicz
,
Z.
(
1997
)
Handbook of Evolutionary Computation,
IOPPublishing/Oxford University Press
,
Oxford, U.K.
62
Reeves
,
C.R.
and
Rowe
,
J.E.
(
2002
)
Genetic Algorithms - Principles and Perspectives: A Guide to GA Theory
,
,
Dordrecht, the Netherlands
63
Onwubolu
,
G.C.
and
Davendra
,
D.
(
2010
)
Differential Evolution: A Handbook for Global Permutation-Based Combinatorial Optimization
,
Springer
,
Berlin, Germany
64
Ashlock
,
D.
(
2006
)
Evolutionary Computation for Modeling and Optimization
,
Springer
,
New York, NY
65
Coello
,
C.
,
van Veldhuizen
,
C.A.
,
and Lamont
,
D.A.
and
B
,
G.
(
2002
)
Evolutionary Algorithms for Solving Multi-Objective Problems
,
,
New York, NY
66
Fogel
,
D.B.
(
2000
)
Evolutionary Computation: Toward A new Philosophy of Machine Intelligence
,
IEEE Press
,
Piscataway, NJ
67
Iba
,
H.
and
Noman
,
N.
(
2020
)
Deep Neural Evolution: Deep Learning with Evolutionary Computation
,
Springer
,
Berlin, Germany
68
O'Hagan
,
S.
,
Knowles
,
J.
and
Kell
,
D.B.
(
2012
)
Exploiting genomic knowledge in optimising molecular breeding programmes: algorithms from evolutionary computing
.
PLoS One
7
,
e48862
69
Ventura
,
S.
and
Luna
,
J.M.
(
2016
)
Pattern Mining with Evolutionary Algorithms
,
Springer
,
Berlin, Germany
70
Arnold
,
F.H.
and
Volkov
,
A.A.
(
1999
)
Directed evolution of biocatalysts
.
Curr. Opin. Chem. Biol.
3
,
54
59
71
Bloom
,
J. D.
and
Arnold
,
F. H
. (
2009
)
In the light of directed evolution: pathways of adaptive protein evolution
.
106
Suppl 1
,
9995
10000
72
Renata
,
H.
,
Wang
,
Z.J.
and
Arnold
,
F.H.
(
2015
)
Expanding the enzyme universe: accessing non-natural reactions by mechanism-guided directed evolution
.
Angew. Chem. Int. Ed. Engl.
54
,
3351
3367
73
Tracewell
,
C.A.
and
Arnold
,
F.H.
(
2009
)
Directed enzyme evolution: climbing fitness peaks one amino acid at a time
.
Curr. Opin. Chem. Biol.
13
,
3
9
74
Bunzel
,
H.A.
,
Anderson
,
J.L.R.
and
Mulholland
,
A.J.
(
2021
)
Designing better enzymes: insights from directed evolution
.
Curr. Opin. Struct. Biol.
67
,
212
218
75
Engqvist
,
M.K.M.
and
Rabe
,
K.S.
(
2019
)
Applications of protein engineering and directed evolution in plant research
.
Plant Physiol.
179
,
907
917
76
Frey
,
R.
,
Hayashi
,
T.
and
Buller
,
R.M.
(
2019
)
Directed evolution of carbon-hydrogen bond activating enzymes
.
Curr. Opin. Biotechnol.
60
,
29
38
77
Kan
,
A.
and
Joshi
,
N.S.
(
2019
)
Towards the directed evolution of protein materials
.
MRS Commun.
9
,
441
455
78
Morrison
,
M.S.
,
Podracky
,
C.J.
and
Liu
,
D.R.
(
2020
)
The developing toolkit of continuous directed evolution
.
Nat. Chem. Biol.
16
,
610
619
79
Qu
,
G.
,
Li
,
A.
,
Acevedo-Rocha
,
C.G.
,
Sun
,
Z.
and
Reetz
,
M.T.
(
2020
)
The crucial role of methodology development in directed evolution of selective enzymes
.
Angew. Chem. Int. Ed. Engl.
59
,
13204
13231
80
Yang
,
K.K.
,
Wu
,
Z.
and
Arnold
,
F.H.
(
2019
)
Machine-learning-guided directed evolution for protein engineering
.
Nat. Methods
16
,
687
694
81
Zeymer
,
C.
and
Hilvert
,
D.
(
2018
)
Directed evolution of protein catalysts
.
Annu. Rev. Biochem.
87
,
131
157
82
Araya
,
C.L.
and
Fowler
,
D.M.
(
2011
)
Deep mutational scanning: assessing protein function on a massive scale
.
Trends Biotechnol.
29
,
435
442
83
Bloom
,
J.D.
(
2015
)
Software for the analysis and visualization of deep mutational scanning data
.
BMC Bioinform.
16
,
168
84
Fowler
,
D.M.
,
Stephany
,
J.J.
and
Fields
,
S.
(
2014
)
Measuring the activity of protein variants on a large scale using deep mutational scanning
.
Nat. Protoc.
9
,
2267
2284
85
Fowler
,
D.M.
and
Fields
,
S.
(
2014
)
Deep mutational scanning: a new style of protein science
.
Nat. Methods
11
,
801
807
86
Klesmith
,
J.R.
,
Bacik
,
J.P.
,
Wrenbeck
,
E.E.
,
Michalczyk
,
R.
and
,
T.A.
(
2017
)
Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning
.
114
,
2265
2270
87
Livesey
,
B.J.
and
Marsh
,
J.A.
(
2020
)
Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations
.
Mol. Syst. Biol.
16
,
e9380
88
Mehlhoff
,
J.D.
and
Ostermeier
,
M.
(
2020
)
Biological fitness landscapes by deep mutational scanning
.
Methods Enzymol.
643
,
203
224
89
Munro
,
D.
and
Singh
,
M.
(
2020
)
Demask: a deep mutational scanning substitution matrix and its use for variant impact prediction
.
Bioinformatics
6
,
5322
5329
90
Reeb
,
J.
,
Wirth
,
T.
and
Rost
,
B.
(
2020
)
Variant effect predictions capture some aspects of deep mutational scanning experiments
.
BMC Bioinform.
21
,
107
91
Behrendt
,
L.
,
Stein
,
A.
,
Shah
,
S.A.
,
Zengler
,
K.
,
Sørensen
,
S.J.
,
Lindorff-Larsen
,
K.
et al (
2018
)
Deep mutational scanning by FACS-sorting of encapsulated E. coli micro-colonies. bioRxiv, 274753
92
Katz
,
N.
,
Tripto
,
E.
,
Granik
,
N.
,
Goldberg
,
S.
,
Atar
,
O.
,
Yakhini
,
Z.
et al (
2021
)
Overcoming the design, build, test bottleneck for synthesis of nonrepetitive protein-RNA cassettes
.
Nat. Commun.
12
,
1576
93
Rowe
,
W.
,
Platt
,
M.
,
Wedge
,
D.
,
Day
,
P.J.
,
Kell
,
D.B.
and
Knowles
,
J.
(
2010
)
Analysis of a complete DNA-protein affinity landscape
.
J. R. Soc. Interface
7
,
397
408
94
Kell
,
D.B.
(
2017
)
Evolutionary algorithms and synthetic biology for directed evolution: commentary on ‘on the mapping of genotype to phenotype in evolutionary algorithms’ by Peter A. Whigham, Grant Dick, and James Maclaurin
.
Genet. Program. Evol. Mach.
18
,
373
378
95
Kell
,
D.B.
and
Lurie-Luke
,
E.
(
2015
)
The virtue of innovation: innovation through the lenses of biological evolution
.
J. R. Soc. Interface
12
,
20141183
96
Wright
,
S.
(
1932
) The roles of mutation, inbreeding, crossbreeding and selection in evolution. In
Proceedings of the Sixth International Congress of Genetics
(
Jones
,
D.F.
, ed.), pp.
356
366
,
Genetics Society of America
,
Austin/Ithaca, TX/NY
97
Marks
,
D.S.
,
Hopf
,
T.A.
and
Sander
,
C.
(
2012
)
Protein structure prediction from sequence variation
.
Nat. Biotechnol.
30
,
1072
1080
98
Kosciolek
,
T.
and
Jones
,
D.T.
(
2014
)
De novo structure prediction of globular proteins aided by sequence variation-derived contacts
.
PLoS One
9
,
e92197
99
Kosciolek
,
T.
and
Jones
,
D.T.
(
2016
)
Accurate contact predictions using covariation techniques and machine learning
.
Proteins
84
Suppl 1
,
145
151
100
Hopf
,
T.A.
,
Green
,
A.G.
,
Schubert
,
B.
,
Mersmann
,
S.
,
Scharfe
,
C.P.I.
,
Ingraham
,
J.B.
et al (
2019
)
The EVcouplings python framework for coevolutionary sequence analysis
.
Bioinformatics
35
,
1582
1584
101
Jones
,
D.T.
,
Buchan
,
D.W.
,
Cozzetto
,
D.
and
Pontil
,
M.
(
2012
)
PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments
.
Bioinformatics
28
,
184
190
102
Rawi
,
R.
,
Mall
,
R.
,
Kunji
,
K.
,
El Anbari
,
M.
,
Aupetit
,
M.
,
Ullah
,
E.
et al (
2016
)
COUSCOus: improved protein contact prediction using an empirical Bayes covariance estimator
.
BMC Bioinform.
17
,
533
103
Jumper
,
J.
,
Evans
,
R.
,
Pritzel
,
A.
,
Green
,
T.
,
Figurnov
,
M.
,
Ronneberger
,
O.
et al (
2021
)
Highly accurate protein structure prediction with AlphaFold
.
Nature
596
,
583
589
104
Richter
,
H.
and
Engelbrecht
,
A.P.
(
2014
)
Recent Advances in the Theory and Application of Fitness Landscapes
,
Springer
,
Berlin, Germany
105
Aguilar-Rodríguez
,
J.
,
Payne
,
J.L.
and
Wagner
,
A.
(
2017
)
A thousand empirical adaptive landscapes and their navigability
.
Nat. Ecol. Evol.
1
,
45
106
Blanco
,
C.
,
Janzen
,
E.
,
Pressman
,
A.
,
Saha
,
R.
and
Chen
,
I.A.
(
2019
)
Molecular fitness landscapes from high-coverage sequence profiling
.
Annu. Rev. Biophys.
48
,
1
18
107
Kauffman
,
S.A.
(
1993
)
The Origins of Order
,
Oxford University Press
,
Oxford, U.K
108
Kauffman
,
S.A.
and
,
W.G.
(
1995
)
Search strategies for applied molecular evolution
.
J. Theor. Biol.
173
,
427
440
109
Aita
,
T.
,
Hayashi
,
Y.
,
Toyota
,
H.
,
Husimi
,
Y.
,
Urabe
,
I.
and
Yomo
,
T.
(
2007
)
Extracting characteristic properties of fitness landscape from in vitro molecular evolution: a case study on infectivity of fd phage to E. coli
.
J. Theor. Biol.
246
,
538
550
110
Mater
,
A.C.
,
Sandhu
,
M.
and
Jackson
,
C
. (
2020
)
The NK landscape as a versatile benchmark for machine learning driven protein engineering. bioRxiv, 2020.2009.2030.319780
111
Hwang
,
S.
,
Schmiegelt
,
B.
,
Ferretti
,
L.
and
Krug
,
J.
(
2018
)
Universality classes of interaction structures for NK fitness landscapes
.
J. Stat. Phys.
172
,
226
278
112
Zagorski
,
M.
,
Burda
,
Z.
and
Waclaw
,
B.
(
2016
)
Beyond the hypercube: evolutionary accessibility of fitness landscapes with realistic mutational networks
.
PLoS Comput. Biol.
12
,
e1005218
113
Weinreich
,
D.M.
,
Watson
,
R.A.
and
Chao
,
L.
(
2005
)
Perspective: sign epistasis and genetic constraint on evolutionary trajectories
.
Evolution
59
,
1165
1174
, PMID:
[PubMed]
114
de Visser
,
J.A.G.M.
,
Cooper
,
T.F.
and
Elena
,
S.F.
(
2011
)
The causes of epistasis
.
Proc. Biol. Sci.
278
,
3617
3624
115
de Visser
,
J.A.G.M.
and
Krug
,
J.
(
2014
)
Empirical fitness landscapes and the predictability of evolution
.
Nat. Rev. Genet.
15
,
480
490
116
Acevedo-Rocha
,
C.G.
,
Li
,
A.
,
D'Amore
,
L.
,
Hoebenreich
,
S.
,
Sanchis
,
J.
,
Lubrano
,
P.
et al (
2021
)
Pervasive cooperative mutational effects on multiple catalytic enzyme traits emerge via long-range conformational dynamics
.
Nat. Commun.
12
,
1621
117
Weinreich
,
D.M.
,
Lan
,
Y.H.
,
Jaffe
,
J.
and
Heckendorn
,
R.B.
(
2018
)
The influence of higher-order epistasis on biological fitness landscape topography
.
J. Stat. Phys.
172
,
208
225
118
,
R.M.
,
Kinney
,
J.B.
,
Walczak
,
A.M.
and
Mora
,
T.
(
2019
)
Epistasis in a fitness landscape defined by antibody-antigen binding free energy
.
Cell Syst.
8
,
86
93.e83
119
Gonzalez
,
C.E.
and
Ostermeier
,
M.
(
2019
)
Pervasive pairwise intragenic epistasis among sequential mutations in TEM-1 beta-lactamase
.
J. Mol. Biol.
431
,
1981
1992
120
Gillespie
,
J.H.
(
1983
)
A simple stochastic gene substitution model
.
Theor. Popul. Biol.
23
,
202
215
121
Gillespie
,
J.H.
(
1984
)
Molecular evolution over the mutational landscape
.
Evolution
38
,
1116
1129
122
Orr
,
H.A.
(
2005
)
The genetic theory of adaptation: a brief history
.
Nat. Rev. Genet.
6
,
119
127
123
Orr
,
H.A.
(
2006
)
The population genetics of adaptation on correlated fitness landscapes: the block model
.
Evolution
60
,
1113
1124
124
Orr
,
H.A.
(
2006
)
The distribution of fitness effects among beneficial mutations in Fisher's geometric model of adaptation
.
J. Theor. Biol.
238
,
279
285
125
Orr
,
H.A.
(
2009
)
Fitness and its role in evolutionary genetics
.
Nat. Rev. Genet.
10
,
531
539
126
Unckless
,
R.L.
and
Orr
,
H.A.
(
2009
)
The population genetics of adaptation: multiple substitutions on a smooth fitness landscape
.
Genetics
183
,
1079
1086
127
Szendro
,
I.G.
,
Franke
,
J.
,
de Visser
,
J.A.G.M.
and
Krug
,
J.
(
2013
)
Predictability of evolution depends nonmonotonically on population size
.
110
,
571
576
128
Carneiro
,
M.
and
Hartl
,
D.L.
(
2011
)
.
107
Suppl 1
,
1747
1751
129
Naseri
,
G.
and
Koffas
,
M.A.G.
(
2020
)
Application of combinatorial optimization strategies in synthetic biology
.
Nat. Commun.
11
,
2446
130
Swainston
,
N.
,
Dunstan
,
M.
,
Jervis
,
A.J.
,
Robinson
,
C.J.
,
Carbonell
,
P.
,
Williams
,
A.R.
et al (
2018
)
Partsgenie: an integrated tool for optimising and sharing synthetic biology parts
.
Bioinformatics
34
,
2327
2329
131
,
M.
,
Chao
,
R.
,
Weisberg
,
S.
,
Lian
,
J.
,
Sinha
,
S.
and
Zhao
,
H.
(
2019
)
Towards a fully automated algorithm driven platform for biosystems design
.
Nat. Commun.
10
,
5150
132
,
T.
,
Costello
,
Z.
,
Workman
,
K.
and
Garcia Martin
,
H.
(
2020
)
A machine learning automated recommendation tool for synthetic biology
.
Nat. Commun.
11
,
4879
133
Zhang
,
J.
,
Petersen
,
S.D.
,
,
T.
,
Ramirez
,
A.
,
Pérez-Manríquez
,
A.
,
Abeliuk
,
E.
et al (
2020
)
Combining mechanistic and machine learning models for predictive engineering and optimization of tryptophan metabolism
.
Nat. Commun.
11
,
4880
134
Beaumont
,
H.J.E.
,
Gallie
,
J.
,
Kost
,
C.
,
Ferguson
,
G.C.
and
Rainey
,
P.B.
(
2009
)
Experimental evolution of bet hedging
.
Nature
462
,
90
93
135
Kaldalu
,
N.
,
Hauryliuk
,
V.
and
Tenson
,
T.
(
2016
)
Persisters-as elusive as ever
.
Appl. Microbiol. Biotechnol.
100
,
6545
6553
136
Levy
,
S.F.
,
Ziv
,
N.
and
Siegal
,
M.L.
(
2012
)
Bet hedging in yeast by heterogeneous, age-correlated expression of a stress protectant
.
PLoS Biol.
10
,
e1001325
137
Kell
,
D.B.
,
Potgieter
,
M.
and
Pretorius
,
E.
(
2015
)
Individuality, phenotypic differentiation, dormancy and ‘persistence’ in culturable bacterial systems: commonalities shared by environmental, laboratory, and clinical microbiology
.
F1000Research
4
,
179
138
Salcedo-Sora
,
J.E.
and
Kell
,
D.B.
(
2020
)
A quantitative survey of bacterial persistence in the presence of antibiotics: towards antipersister antimicrobial discovery
.
Antibiotics
9
,
508
139
Westerhoff
,
H.V.
,
Hellingwerf
,
K.J.
and
van Dam
,
K.
(
1983
)
Thermodynamic efficiency of microbial growth is low but optimal for maximal growth rate
.
80
,
305
309
140
Kacser
,
H.
and
Burns
,
J.A.
(
1973
) The control of flux. In
Rate Control of Biological Processes. Symposium of the Society for Experimental Biology
(
Davies
,
D.D.
, ed.),
vol. 27,
pp.
65
104
,
Cambridge University Press
,
Cambridge, U.K
141
Kacser
,
H.
and
Burns
,
J.A.
(
1981
)
The molecular basis of dominance
.
Genetics
97
,
639
666
142
Kacser
,
H.
(
1983
)
The control of enzyme systems in vivo: elasticity analysis of the steady state
.
Biochem. Soc. Trans.
11
,
35
40
143
Heinrich
,
R.
and
Rapoport
,
T.A.
(
1973
)
Linear theory of enzymatic chains: its application for the analysis of the crossover theorem and of the glycolysis of human erythrocytes
.
Acta Biol. Med. Ger.
31
,
479
494
144
Heinrich
,
R.
and
Rapoport
,
T.A.
(
1974
)
A linear steady-state treatment of enzymatic chains. General properties, control and effector strength
.
Eur. J. Biochem.
42
,
89
95
145
Klipp
,
E.
,
Herwig
,
R.
,
Kowald
,
A.
,
Wierling
,
C.
and
Lehrach
,
H.
(
2005
)
Systems Biology in Practice: Concepts, Implementation and Clinical Application
,
Wiley/VCH
,
Berlin, Germany
146
Palsson
,
B.Ø
. (
2006
)
Systems Biology: Properties of Reconstructed Networks
,
Cambridge University Press
,
Cambridge, U.K
147
Moreno-Sánchez
,
R.
,
Saavedra
,
E.
,
Rodríguez-Enríquez
,
S.
and
Olin-Sandoval
,
V.
(
2008
)
Metabolic control analysis: a tool for designing strategies to manipulate metabolic pathways
.
J. Biomed. Biotechnol.
2008
,
597913
148
Rand
,
D.A.
(
2008
)
Mapping global sensitivity of cellular network dynamics: sensitivity heat maps and a global summation law
.
J. R. Soc. Interface
5
Suppl 1
,
S59
S69
149
Saltelli
,
A.
,
Ratto
,
M.
,
Andres
,
T.
,
Campolongo
,
F.
,
Cariboni
,
J.
,
Gatelli
,
D.
et al  (
2008
)
Global Sensitivity Analysis: the Primer
,
WileyBlackwell
,
New York, NY
150
Fell
,
D.A.
and
Thomas
,
S.
(
1995
)
Physiological control of metabolic flux: the requirement for multisite modulation
.
Biochem. J.
311
,
35
39
151
Oliver
,
S.G.
,
Winson
,
M.K.
,
Kell
,
D.B.
and
Baganz
,
F.
(
1998
)
Systematic functional analysis of the yeast genome
.
Trends Biotechnol.
16
,
373
378
152
Raamsdonk
,
L.M.
,
Teusink
,
B.
,
,
D.
,
Zhang
,
N.
,
Hayes
,
A.
,
Walsh
,
M.
et al (
2001
)
A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations
.
Nat. Biotechnol.
19
,
45
50
153
Kell
,
D.B.
and
Oliver
,
S.G.
(
2016
)
The metabolome 18 years on: a concept comes of age
.
Metabolomics
12
,
148
154
Karim
,
A.S.
,
Dudley
,
Q.M.
,
Juminaga
,
A.
,
Yuan
,
Y.
,
Crowe
,
S.A.
,
,
J.T.
et al (
2020
)
In vitro prototyping and rapid optimization of biosynthetic enzymes for cell design
.
Nat. Chem. Biol.
16
,
912
919
155
Scott
,
M.
,
Gunderson
,
C.W.
,
Mateescu
,
E.M.
,
Zhang
,
Z.
and
Hwa
,
T.
(
2010
)
Interdependence of cell growth and gene expression: origins and consequences
.
Science
330
,
1099
1102
156
Shoval
,
O.
,
Sheftel
,
H.
,
Shinar
,
G.
,
Hart
,
Y.
,
Ramote
,
O.
,
Mayo
,
A.
et al (
2012
)
Evolutionary trade-offs, pareto optimality, and the geometry of phenotype space
.
Science
336
,
1157
1160
157
Schuetz
,
R.
,
Zamboni
,
N.
,
Zampieri
,
M.
,
Heinemann
,
M.
and
Sauer
,
U.
(
2012
)
Multidimensional optimality of microbial metabolism
.
Science
336
,
601
604
158
Mori
,
M.
,
Schink
,
S.
,
Erickson
,
D.W.
,
Gerland
,
U.
and
Hwa
,
T.
(
2017
)
Quantifying the benefit of a proteome reserve in fluctuating environments
.
Nat. Commun.
8
,
1225
159
Basan
,
M.
,
Honda
,
T.
,
Christodoulou
,
D.
,
Horl
,
M.
,
Chang
,
Y.F.
,
Leoncini
,
E.
et al (
2020
)
A universal trade-off between growth and lag in fluctuating environments
.
Nature
584
,
470
474
160
Peebo
,
K.
,
Valgepea
,
K.
,
Maser
,
A.
,
Nahku
,
R.
,
,
K.
and
Vilu
,
R.
(
2015
)
Proteome reallocation in Escherichia coli with increasing specific growth rate
.
Mol. Biosyst.
11
,
1184
1193
161
Desler
,
C.
,
Hansen
,
T.L.
,
Frederiksen
,
J.B.
,
Marcker
,
M.L.
,
Singh
,
K.K.
and
Juel Rasmussen
,
L.
(
2012
)
Is there a link between mitochondrial reserve respiratory capacity and aging?
J. Aging Res.
2012
,
192503
162
Marchetti
,
P.
,
Fovez
,
Q.
,
Germain
,
N.
,
Khamari
,
R.
and
Kluza
,
J.
(
2020
)
Mitochondrial spare respiratory capacity: mechanisms, regulation, and significance in non-transformed and cancer cells
.
FASEB J.
34
,
13106
13124
163
Rieger
,
M.
,
Käppeli
,
O.
and
Fiechter
,
A.
(
1983
)
The role of limited respiration in the incomplete oxidation of glucose by Saccharomyces cerevisiae
.
J. Gen. Microbiol.
129
,
653
661
164
Sonnleitner
,
B.
and
Käppeli
,
O.
(
1986
)
Growth of Saccharomyces cerevisiae is controlled by its limited respiratory capacity - formulation and verification of a hypothesis
.
Biotechnol. Bioeng.
28
,
927
937
165
van Hoek
,
P.
,
van Dijken
,
J.P.
and
Pronk
,
J.T.
(
1998
)
Effect of specific growth rate on fermentative capacity of baker's yeast
.
Appl. Environ. Microbiol.
64
,
4226
4233
166
Basan
,
M.
,
Hui
,
S.
,
Okano
,
H.
,
Zhang
,
Z.
,
Shen
,
Y.
,
Williamson
,
J.R.
et al (
2015
)
Overflow metabolism in Escherichia coli results from efficient proteome allocation
.
Nature
528
,
99
104
167
Neijssel
,
O.M.
and
Tempest
,
D.W.
(
1976
)
The role of energy-splitting reactions in the growth of Klebsiella aerogenes NCTC 418 in aerobic chemostat culture
.
Arch. Microbiol.
110
,
305
311
168
Allen
,
J.K.
,
Davey
,
H.M.
,
,
D.
,
Heald
,
J.K.
,
Rowland
,
J.J.
,
Oliver
,
S.G.
et al (
2003
)
High-throughput characterisation of yeast mutants for functional genomics using metabolic footprinting
.
Nat. Biotechnol.
21
,
692
696
169
Kell
,
D.B.
,
Brown
,
M.
,
Davey
,
H.M.
,
Dunn
,
W.B.
,
Spasic
,
I.
and
Oliver
,
S.G.
(
2005
)
Metabolic footprinting and systems biology: the medium is the message
.
Nat. Rev. Microbiol.
3
,
557
565
170
Dörries
,
K.
and
Lalk
,
M.
(
2013
)
Metabolic footprint analysis uncovers strain specific overflow metabolism and D-isoleucine production of Staphylococcus aureus COL and HG001
.
PLoS One
8
,
e81500
171
Paczia
,
N.
,
Nilgen
,
A.
,
Lehmann
,
T.
,
Gätgens
,
J.
,
Wiechert
,
W.
and
Noack
,
S.
(
2012
)
Extensive exometabolome analysis reveals extended overflow metabolism in various microorganisms
.
Microb. Cell Fact.
11
,
122
172
Schmitz
,
K.
,
Peter
,
V.
,
Meinert
,
S.
,
Kornfeld
,
G.
,
Hardiman
,
T.
,
Wiechert
,
W.
et al (
2013
)
Simultaneous utilization of glucose and gluconate in Penicillium chrysogenum during overflow metabolism
.
Biotechnol. Bioeng.
110
,
3235
3243
173
Szenk
,
M.
,
Dill
,
K.A.
and
de Graff
,
A.M.R.
(
2017
)
Why do fast-growing bacteria enter overflow metabolism? Testing the membrane real estate hypothesis
.
Cell Syst.
5
,
95
104
174
Koch
,
A.L.
(
1971
)
The adaptive responses of Escherichia coli to a feast and famine existence
.
6
,
147
217
175
Poindexter
,
J.
(
1981
)
Oligotrophy: fast and famine existence
.
5
,
63
89
176
Kell
,
D.B.
,
Swainston
,
N.
,
Pir
,
P.
and
Oliver
,
S.G.
(
2015
)
Membrane transporter engineering in industrial biotechnology and whole-cell biocatalysis
.
Trends Biotechnol.
33
,
237
246
177
Kell
,
D.B.
(
2019
) Control of metabolite efflux in microbial cell factories: current advances and future prospects. In
Fermentation Microbiology and Biotechnology
, 4th edn (
El-Mansi
,
E.M.T.
,
Nielsen
,
J.
,
Mousdale
,
D.
,
Allman
,
T.
and
Carlson
,
R.
, eds), pp.
117
138
,
CRC Press
,
Boca Raton, FL
178
Wang
,
G.
,
Møller-Hansen
,
I.
,
Babaei
,
M.
,
D'Ambrosio
,
V.
,
Christensen
,
H.B.
,
Darbani
,
B.
et al (
2021
)
Transportome-wide engineering of Saccharomyces cerevisiae
.
Metab. Eng.
64
,
52
63
179
Jindal
,
S.
,
Yang
,
L.
,
Day
,
P.J.
and
Kell
,
D.B.
(
2019
)
Involvement of multiple influx and efflux transporters in the accumulation of cationic fluorescent dyes by Escherichia coli
.
BMC Microbiol.
19
,
195
.
also bioRxiv 603688v603681
180
Salcedo-Sora
,
J.E.
,
Jindal
,
S.
,
O'Hagan
,
S.
and
Kell
,
D.B.
(
2021
)
A palette of fluorophores that are differentially accumulated by wild-type and mutant strains of Escherichia coli: surrogate ligands for bacterial membrane transporters
.
Microbiology
167
,
001016
181
Anderson
,
C.M.
(
2006
)
The Long Tail: how Endless Choice is Creating Unlimited Demand
,
Random House
,
London, U.K.
182
Bornholdt
,
S.
and
Sneppen
,
K.
(
2000
)
Robustness as an evolutionary principle
.
Proc. R. Soc. B-Biol. Sci.
267
,
2281
2286
183
Morohashi
,
M.
,
Winn
,
A.E.
,
Borisuk
,
M.T.
,
Bolouri
,
H.
,
Doyle
,
J.
and
Kitano
,
H.
(
2002
)
Robustness as a measure of plausibility in models of biochemical networks
.
J. Theor. Biol.
216
,
19
30
184
Kitano
,
H.
(
2004
)
Biological robustness
.
Nat. Rev. Genet.
5
,
826
837
185
Stelling
,
J.
,
Sauer
,
U.
,
Szallasi
,
Z.
,
Doyle
, III,
F.J.
and
Doyle
,
J.
(
2004
)
Robustness of cellular functions
.
Cell
118
,
675
685
186
Wagner
,
A.
(
2005
)
Robustness and Evolvability in Living Systems
,
Princeton University Press
,
Princeton, NJ
187
Ma
,
W.
,
Lai
,
L.
,
Ouyang
,
Q.
and
Tang
,
C.
(
2006
)
Robustness and modular design of the Drosophila segment polarity network
.
Mol. Syst. Biol.
2
,
70
188
Lehár
,
J.
,
Krueger
,
A.
,
Zimmermann
,
G.
and
Borisy
,
A.
(
2008
)
High-order combination effects and biological robustness
.
Mol. Syst. Biol.
4
,
215
189
Gong
,
Z.
,
Nielsen
,
J.
and
Zhou
,
Y.J.
(
2017
)
Engineering robustness of microbial cell factories
.
Biotechnol. J.
12
,
1700014
.
190
Klug
,
A.
,
Park
,
S.C.
and
Krug
,
J.
(
2019
)
Recombination and mutational robustness in neutral fitness landscapes
.
PLoS Comput. Biol.
15
,
e1006884
191
Donati
,
S.
,
Kuntz
,
M.
,
Pahl
,
V.
,
Farke
,
N.
,
Beuter
,
D.
,
Glatter
,
T.
et al (
2021
)
Multi-omics analysis of CRISPRi-knockdowns identifies mechanisms that buffer decreases of enzymes in E. coli metabolism
.
Cell Syst.
12
,
56
67
192
Schaechter
,
M.
,
Maaløe
,
O.
and
Kjeldgaard
,
N.O.
(
1958
)
Dependency on medium and temperature of cell size and chemical composition during balanced grown of Salmonella typhimurium
.
J. Gen. Microbiol.
19
,
592
606
193
Neidhart
,
F.C.
,
Ingraham
,
J.L.
and
Schaechter
,
M.
(
1990
)
Physiology of the Bacterial Cell: A Molecular Approach
,
Sinauer Associates
,
Sunderland, MA
194
Björkeroth
,
J.
,
Campbell
,
K.
,
Malina
,
C.
,
Yu
,
R.
,
Di Bartolomeo
,
F.
and
Nielsen
,
J.
(
2020
)
Proteome reallocation from amino acid biosynthesis to ribosomes enables yeast to grow faster in rich media
.
117
,
21804
21812
195
Tartof
,
K.D.
and
Hobbs
,
C.A.
(
1987
)
Improved media for growing plasmid and cosmid clones
.
Bethseda Res. Labs Focus.
9
,
12
196
Hill
,
W.G.
(
2005
)
A century of corn selection
.
Science
307
,
683
684
197
de Groot
,
D.H.
,
Hulshof
,
J.
,
Teusink
,
B.
,
Bruggeman
,
F.J.
and
Planqué
,
R.
(
2020
)
Elementary growth modes provide a molecular description of cellular self-fabrication
.
PLoS Comput. Biol.
16
,
e1007559
198
Castrillo
,
J.I.
,
Zeef
,
L.A.
,
Hoyle
,
D.C.
,
Zhang
,
N.
,
Hayes
,
A.
,
Gardner
,
D.C.J.
et al (
2007
)
Growth control of the eukaryote cell: a systems biology study in yeast
.
J. Biol.
6
,
4
199
Delneri
,
D.
,
Hoyle
,
D.C.
,
Gkargkas
,
K.
,
Cross
,
E.J.
,
Rash
,
B.
,
Zeef
,
L.
et al (
2008
)
Identification and characterization of high-flux-control genes of yeast through competition analyses in continuous cultures
.
Nat. Genet.
40
,
113
117
200
Reed
,
J.L.
and
Palsson
,
B.Ø
. (
2003
)
Thirteen years of building constraint-based in silico models of Escherichia coli
.
J. Bacteriol.
185
,
2692
2699
201
Herrgård
,
M.J.
,
Swainston
,
N.
,
Dobson
,
P.
,
Dunn
,
W.B.
,
Arga
,
K.Y.
,
Arvas
,
M.
et al (
2008
)
A consensus yeast metabolic network obtained from a community approach to systems biology
.
Nat. Biotechnol.
26
,
1155
1160
202
Thiele
,
I.
,
Swainston
,
N.
,
Fleming
,
R.M.T.
,
Hoppe
,
A.
,
Sahoo
,
S.
,
Aurich
,
M.K.
et al (
2013
)
A community-driven global reconstruction of human metabolism
.
Nat. Biotechnol.
31
,
419
425
203
Thiele
,
I.
,
Sahoo
,
S.
,
Heinken
,
A.
,
Hertel
,
J.
,
Heirendt
,
L.
,
Aurich
,
M.K.
et al (
2020
)
Personalized whole-body models integrate metabolism, physiology, and the gut microbiome
.
Mol. Syst. Biol.
16
,
e8982
204
Covert
,
M.W.
,
Xiao
,
N.
,
Chen
,
T.J.
and
Karr
,
J.R.
(
2008
)
Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli
.
Bioinformatics
24
,
2044
2050
205
Hou
,
J.
,
Tyo
,
K.E.
,
Liu
,
Z.
,
Petranovic
,
D.
and
Nielsen
,
J.
(
2012
)
Metabolic engineering of recombinant protein secretion by Saccharomyces cerevisiae
.
FEMS Yeast Res.
12
,
491
510
206
Chen
,
Y.
and
Nielsen
,
J.
(
2019
)
Energy metabolism controls phenotypes by protein efficiency and allocation
.
116
,
17592
17597
207
Chen
,
K.
,
Anand
,
A.
,
Olson
,
C.
,
Sandberg
,
T.E.
,
Gao
,
Y.
,
Mih
,
N.
et al (
2021
)
Bacterial fitness landscapes stratify based on proteome allocation associated with discrete aero-types
.
PLoS Comput. Biol.
17
,
e1008596
208
Snoep
,
J.L.
,
Yomano
,
L.P.
,
Westerhoff
,
H.V.
and
Ingram
,
L.O.
(
1995
)
Protein burden in Zymomonas mobilis - negative flux and growth- control due to overproduction of glycolytic enzymes
.
Microbiology
141
,
2329
2337
209
Bentley
,
W.E.
,
Mirjalili
,
N.
,
Andersen
,
D.C.
,
Davis
,
R.H.
and
Kompala
,
D.S.
(
1990
)
Plasmid-encoded protein: the principal factor in the ‘metabolic burden’ associated with recombinant bacteria
.
Biotechnol. Bioeng.
35
,
668
681
210
Dong
,
H.
,
Nilsson
,
L.
and
Kurland
,
C.G.
(
1995
)
Gratuitous overexpression of genes in Escherichia coli leads to growth inhibition and ribosome destruction
.
J. Bacteriol.
177
,
1497
1504
211
Yu
,
R.
,
Campbell
,
K.
,
Pereira
,
R.
,
Bjorkeroth
,
J.
,
Qi
,
Q.
,
Vorontsov
,
E.
et al (
2020
)
Nitrogen limitation reveals large reserves in metabolic and translational capacities of yeast
.
Nat. Commun.
11
,
1881
212
Holms
,
W.H.
,
Hamilton
,
I.D.
and
Mousdale
,
D.
(
1991
)
Improvements to microbial productivity by analysis of metabolic fluxes
.
J. Chem. Technol. Biotechnol.
50
,
139
141
213
Holms
,
H.
(
1996
)
Flux analysis and control of the central metabolic pathways in Escherichia coli
.
FEMS Microbiol. Rev.
19
,
85
116
214
Valgepea
,
K.
,
Peebo
,
K.
,
,
K.
and
Vilu
,
R.
(
2015
)
Lean-proteome strains - next step in metabolic engineering
.
Front. Bioeng. Biotechnol.
3
,
11
215
Glasscock
,
C.J.
,
Lucks
,
J.B.
and
DeLisa
,
M.P.
(
2016
)
Engineered protein machines: emergent tools for synthetic biology
.
Cell Chem. Biol.
23
,
45
56
216
Guirimand
,
G.
,
Kulagina
,
N.
,
Papon
,
N.
,
Hasunuma
,
T.
and
Courdavault
,
V.
(
2021
)
Innovative tools and strategies for optimizing yeast cell factories
.
Trends Biotechnol.
39
,
488
504
217
Lawless
,
C.
,
Holman
,
S.W.
,
Brownridge
,
P.
,
Lanthaler
,
K.
,
Harman
,
V.M.
,
Watkins
,
R.
et al (
2016
)
Direct and absolute quantification of over 1800 yeast proteins via selected reaction monitoring
.
Mol. Cell. Proteom.
15
,
1309
1322
218
Yu
,
R.
,
Vorontsov
,
E.
,
Sihlbom
,
C.
and
Nielsen
,
J.
(
2021
)
Quantifying absolute gene expression profiles reveals distinct regulation of central carbon metabolism genes in yeast
.
eLife
10
,
e65722
219
Nichols
,
R.J.
,
Sen
,
S.
,
Choo
,
Y.J.
,
Beltrao
,
P.
,
Zietek
,
M.
,
Chaba
,
R.
et al (
2011
)
Phenotypic landscape of a bacterial cell
.
Cell
144
,
143
156
220
O'Hagan
,
S.
,
Wright Muelas
,
M.
,
Day
,
P.J.
,
Lundberg
,
E.
and
Kell
,
D.B.
(
2018
)
Genegini: assessment via the Gini coefficient of reference ‘‘housekeeping’’ genes and diverse human transporter expression profiles
.
Cell Syst.
6
,
230
244
221
Muelas
,
W.
,
Mughal
,
M.
,
O'Hagan
,
F.
,
Day
,
S.
,
and Kell
,
P.J.
and
B
,
D.
(
2019
)
The role and robustness of the Gini coefficient as an unbiased tool for the selection of Gini genes for normalising expression profiling data
.
Sci. Rep.
9
,
17960
222
Gallagher
,
L.A.
,
Bailey
,
J.
and
Manoil
,
C.
(
2020
)
Ranking essential bacterial processes by speed of mutant death
.
117
,
18010
18017
223
Alper
,
H.
,
Moxley
,
J.
,
Nevoigt
,
E.
,
Fink
,
G.R.
and
Stephanopoulos
,
G.
(
2006
)
Engineering yeast transcription machinery for improved ethanol tolerance and production
.
Science
314
,
1565
1568
224
Alper
,
H.
and
Stephanopoulos
,
G.
(
2007
)
Global transcription machinery engineering: a new approach for improving cellular phenotype
.
Metab. Eng.
9
,
258
267
225
Liu
,
H.
,
Yan
,
M.
,
Lai
,
C.
,
Xu
,
L.
and
Ouyang
,
P.
(
2010
)
gTME for improved xylose fermentation of Saccharomyces cerevisiae
.
Appl. Biochem. Biotechnol.
160
,
574
582
226
Tan
,
F.
,
Wu
,
B.
,
Dai
,
L.
,
Qin
,
H.
,
Shui
,
Z.
,
Wang
,
J.
et al (
2016
)
Using global transcription machinery engineering (gTME) to improve ethanol tolerance of Zymomonas mobilis
.
Microb. Cell Fact.
15
,
4
227
El-Rotail
,
A.A.M.M.
,
Zhang
,
L.
,
Li
,
Y.
,
Liu
,
S.P.
and
Shi
,
G.Y.
(
2017
)
A novel constructed SPT15 mutagenesis library of Saccharomyces cerevisiae by using gTME technique for enhanced ethanol production
.
AMB Express
7
,
111
228
Knight
,
J.R.P.
,
Garland
,
G.
,
Poyry
,
T.
,
,
E.
,
Vlahov
,
N.
,
Sfakianos
,
A.
et al (
2020
)
Control of translation elongation in health and disease
.
Dis. Model. Mech.
13
,
dmm043208
229
Masoudi
,
M.
,
Teimoori
,
A.
,
Tabaraei
,
A.
,
Shahbazi
,
M.
,
Divbandi
,
M.
,
Lorestani
,
N.
et al (
2021
)
Advanced sequence optimization for the high efficient yield of human group A rotavirus VP6 recombinant protein in Escherichia coli and its use as immunogen
.
J. Med. Virol.
93
,
3549
3556
230
Wu
,
Z.
,
Yang
,
K.K.
,
Liszka
,
M.
,
Lee
,
A.
,
Batzilla
,
A.
,
Wernick
,
D.
et al (
2020
)
Signal peptides generated by attention-based neural networks
.
ACS Synth. Biol.
9
,
2154
2161
231
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L.
,
Gomez
,
A.N.
et al  (
2017
)
Attention is all you need. arXiv, 1706.03762
232
Devlin
,
J.
,
Chang
,
M.-W.
,
Lee
,
K.
and
Toutanova
,
K
. (
2018
)
BERT: pre-training of deep bidirectional transformers for language understanding. arXiv, 1810.04805
233
Bepler
,
T.
and
Berger
,
B.
(
2021
)
Learning the protein language: evolution, structure, and function
.
Cell Syst.
12
,
654
669.e653
234
Hutson
,
M.
(
2021
)
The language machines
.
Nature
591
,
22
25
235
Kreutter
,
D.
,
Schwaller
,
P.
and
Reymond
,
J.-L.
(
2021
)
Predicting enzymatic reactions with a molecular transformer
.
Chem. Sci.
12
,
8648
8659
236
Lin
,
T.
,
Wang
,
Y.
,
Liu
,
X.
and
Qiu
,
X
. (
2021
)
A survey of transformers. arXiv, 2106.04554
237
Singh
,
S.
and
Mahmood
,
A
. (
2021
)
The NLP cookbook: modern recipes for transformer based deep learning architectures. arXiv, 2104.10640
238
Shrivastava
,
A.D.
and
Kell
,
D.B.
(
2021
)
Fragnet, a contrastive learning-based transformer model for clustering, interpreting, visualising and navigating chemical space
.
Molecules
26
,
2065
239
Shrivastava
,
A.D.
,
Swainston
,
N.
,
Samanta
,
S.
,
Roberts
,
I.
,
Wright Muelas
,
M.
and
Kell
,
D.B
. (
2021
)
MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra. bioRxiv, 2021.2006.2025.449969
240
Wu
,
Z.
,
Johnston
,
K.E.
,
Arnold
,
F.H.
and
Yang
,
K.K.
(
2021
)
Protein sequence design with deep generative models
.
Curr. Opin. Chem. Biol.
65
,
18
27
241
Repecka
,
D.
,
Jauniskis
,
V.
,
Karpus
,
L.
,
Rembeza
,
E.
,
Rokaitis
,
I.
,
Zrimec
,
J.
et al (
2021
)
Expanding functional protein sequence space using generative adversarial networks
.
Nat. Mach. Intell.
3
,
324
333
242
LeCun
,
Y.
,
Bengio
,
Y.
and
Hinton
,
G.
(
2015
)
Deep learning
.
Nature
521
,
436
444
243
Schmidhuber
,
J.
(
2015
)
Deep learning in neural networks: an overview
.
Neural Netw.
61
,
85
117
244
Elton
,
D.C.
,
Boukouvalas
,
Z.
,
Fuge
,
M.D.
and
Chung
,
P.W.
(
2019
)
Deep learning for molecular design: a review of the state of the art
.
Mol. Syst. Des. Eng.
4
,
828
849
245
Gupta
,
A.
,
Harrison
,
P.J.
,
Wieslander
,
H.
,
Pielawski
,
N.
,
Kartasalo
,
K.
,
Partel
,
G.
et al (
2019
)
Deep learning in image cytometry: a review
.
Cytometry A
95
,
366
380
246
Islam
,
M.M.
,
Karray
,
F.
,
Alhajj
,
R.
and
Zeng
,
J
. (
2020
)
A review on deep learning techniques for the diagnosis of novel coronavirus (COVID-19). arXiv, 2008.04815
247
Langkvist
,
M.
,
Karlsson
,
L.
and
Loutfi
,
A.
(
2014
)
A review of unsupervised feature learning and deep learning for time-series modeling
.
Pattern Recgnit. Lett.
42
,
11
24
248
Minaee
,
S.
,
Kalchbrenner
,
N.
,
Cambria
,
E.
,
,
N.
,
Chenaghlu
,
M.
and
Gao
,
J.
(
2020
)
Deep learning based text classification: a comprehensive review. arXiv, 2004.03705
249
Paliwal
,
K.
,
Lyons
,
J.
and
Heffernan
,
R.
(
2015
)
A short review of deep learning neural networks in protein structure prediction problems
.
3
,
3
250
Tripathi
,
N.
,
Goshisht
,
M.K.
,
Sahu
,
S.K.
and
Arora
,
C.
(
2021
)
Applications of artificial intelligence to drug design and discovery in the big data era: a comprehensive review
.
Mol. Divers
25
,
1643
1664
251
Zhang
,
H.-M.
and
Dong
,
B.
(
2020
)
A review on deep learning in medical image reconstruction
.
J. Oper. Res. Soc. China
8
,
311
340
252
Zhou
,
S. K.
,
Greenspan
,
H.
,
Davatzikos
,
C.
,
Duncan
,
J.S.
,
Ginneken
,
B.V.
,
,
A.
et al (
2020
)
A review of deep learning in medical imaging: Image traits, technology trends, case studies with progress highlights, and future promises. arXiv, 2008.09104
253
Le
,
N.Q.K.
,
Ho
,
Q.T.
,
Nguyen
,
T.T.
and
Ou
,
Y.Y.
(
2021
)
A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information
.
Brief. Bioinform.
22
,
bbab005
254
Song
,
B.
,
Li
,
Z.
,
Lin
,
X.
,
Wang
,
J.
,
Wang
,
T.
and
Fu
,
X.
(
2021
)
Pretraining model for biological sequence data
.
Brief. Funct. Genom.
20
,
181
195
255
Wittmann
,
B.J.
,
Johnston
,
K.E.
,
Wu
,
Z.
and
Arnold
,
F.H.
(
2021
)
Advances in machine learning for directed evolution
.
Curr. Opin. Struct. Biol.
69
,
11
18
256
Imanaka
,
T.
and
Aiba
,
S.
(
1981
)
A perspective on the application of genetic engineering: stability of recombinant plasmid
.
369
,
1
14