Synthetic biology routes to bio-artificial intelligence

The design of synthetic gene networks (SGNs) has advanced to the extent that novel genetic circuits are now being tested for their ability to recapitulate archetypal learning behaviours first defined in the fields of machine and animal learning. Here, we discuss the biological implementation of a perceptron algorithm for linear classification of input data. An expansion of this biological design that encompasses cellular ‘teachers’ and ‘students’ is also examined. We also discuss implementation of Pavlovian associative learning using SGNs and present an example of such a scheme and in silico simulation of its performance. In addition to designed SGNs, we also consider the option to establish conditions in which a population of SGNs can evolve diversity in order to better contend with complex input data. Finally, we compare recent ethical concerns in the field of artificial intelligence (AI) and the future challenges raised by bio-artificial intelligence (BI).


Introduction
Artificial intelligence (AI) can be defined as the decision-making capabilities of machines [1]. Machines are most commonly regarded as designed, multi-part objects that perform predetermined mechanical tasks. To date all commercial machines are constructed using electronics directed by sets of instructions (algorithms) encoded within circuits patterned onto semiconductor materials such as silicon. Machine learning can occur when the algorithms that control the machine are written such that the algorithms themselves can independently use prior data sets to inform future decisions.
Human learning, by contrast, is understood to be a phenomenon that emerges in part from the dynamic and adaptive exchange of information between neurons in the brain and within individual neuronal cells. Individual cells can adapt to and anticipate environmental signals, such as the onset of stress and the availability of nutrients. For example, cells of the mammalian immune system can acquire memory of previous pathogen invasions and prepare for future infections. In experiments, the slime mould Physarum polycephalum, a single-celled organism, found the shortest path between two points in a labyrinth [2,3], and anticipated future events that it had previously experienced on a periodic basis [4].
Networks of genes [5] and enzymes [6] have been described in terms of their ability to support adaptive behaviours. However, research is ongoing into the network topologies and behaviours that underlie single cell learning. A key question is whether learning in single cells occurs in a manner analogous to multicellular systems, or with an architecture that is predetermined by genetically encoded programs. The development of synthetic biology in recent years provides a novel avenue to address this question from a biological engineering perspective.
Association and classification of external stimuli are two fundamental concepts used to define learning in the field of AI. A number of theoretical studies indicate that single cells can exhibit these types of learning [7][8][9]. Cell-free biological systems have also been established which exploit DNA strand hybridisation and displacement to perform neural network computations [10]. To date, however, no artificial single-cell-based learning system has been realised experimentally. Rapid development of synthetic biology in the previous decade has now made engineering of such a system feasible.
In this review, we discuss a selection of synthetic biologists' efforts to design, model, build and test synthetic gene networks (SGNs) that enable living cells to associate and classify external stimuli. In doing so we hope to stimulate researchers to consider and debate how synthetic biology could be used to implement AI using biological material as an alternative to the silicon, metal and plastic materials used in conventional AI.
While mathematical models were applied in the development and analysis of the SGNs discussed here, this review focuses on the biological aspects of SGN. As such, a complete description of the relevant models is not necessary to understand the concepts presented here. Readers who do wish to examine the mathematical models further should refer to the cited literature and reviews by Bates et al. [11] and Borg et al. [12]. Specific technical details can be provided via the corresponding author.

Supervised learning in artificial intelligence: students, teachers and classification
A key goal of machine learning is the development of algorithms that can infer a set of rules from a predetermined 'training' data set. Once the training data have been analysed, the algorithm should ideally be able to correctly sort previously unseen data sets into correct categories [13] in what is termed 'supervised learning' . One mode of this sorting, also known as 'classification' , is to classify all data inputs into one of two states -for instance being above or below a given linear threshold. This type of supervised learning is known as linear classification and a number of algorithms have been developed to achieve this task. The perceptron is one of the earliest linear classification algorithms and has been used to identify translation initiation sites in Escherichia coli mRNA molecules [14]. In a perceptron algorithm, a given input signal is classed as being above or below a line (or threshold). The position of this threshold is altered as part of the learning process until all data points have been successfully classified as being above or below a linear threshold. Figure 1 sets out a scheme for biological implementation of a perceptron in which a toggle switch ( Figure 3A) classifies the sum of two input signals being one side or another of a given threshold, resulting in expression of either RFP or GFP. The position of the threshold is determined by a central element, 'node 0' . The nodes in this context represent one or more genes that function to repress or stimulate other nodes.

Supervised learning in synthetic biology: student cells and teacher cells
Algorithms and mathematical models for perceptron-based supervised learning can encompass a 'teacher' element that provides data sets and determines responses to those data, and a 'student' element, whose learning is directed by the teacher [15]. The biological student-teacher (BST) network consists of sets of genes within teacher and student cells that interact via promoting or repressing outputs. Taken individually, each network can be considered as a switch, with either RFP or GFP output as an indirect response to levels of a small molecule that can traverse cell membranes ( Figure 2A). The classification threshold of the teacher can be adjusted externally by designing the O T node to be influenced by an inducer molecule such as isopropyl β-d-1-thiogalactopyranoside (IPTG). The O S node classification threshold in student cells would be set by the level of a second small molecule inducer, not IPTG, the concentration of which is influenced most strongly by teacher cells. In this way, students are effectively 'taught' the position of the classification threshold to use by the teachers.
Within industrial biotechnology this supervised learning could be used to optimise performance of a biotransformation step. For instance, in nature, material such as agricultural waste often consists of a diversity of substances that, collectively, are most commonly decomposed by consortia of different microbial species [16]. In conventional biotechnology, a lone species, typically E. coli, is engineered to express recombinant enzymes encoded by transgenes controlled by exogenous, strong, constitutive promoters, IPTG-inducible promoters or promoters present in a locus ported en bloc from another species. In future, synthetic consortia of different cell types could be designed in which a particular objective, or master instruction, would be set by controlling the classification threshold of teacher cells. Subsequent delivery of classification weighting instructions to different student cell types would be influenced by the biological status of teacher cells, providing a more dynamic and sensitive signalling. This particularly comes into play when consortia grow as 3D structures such as biofilms [17].

Mathematical modelling of a biological student-teacher network
Suzuki et al. [18] proposed a mathematical model of a network using ordinary differential equations that can be applied to the network proposed in Figure 2A. The model incorporated the ability to vary the levels of gene transcription 'noise' (unexplained variation) used in simulations of gene network behaviour. Solving the equations of the model numerically, using a range of biologically relevant parameter levels for factors such as transcriptional noise, demonstrated the sensitivity of the BST SGN when changing the threshold of the switch within the teacher ( Figure 2B). The simulation also showed that a change in the teacher is followed by a change in the student after a short delay. However, comparison with experimental observations is necessary to robustly assess the validity of this simulation.

Associative learning
Association of two stimuli is perhaps most intuitively illustrated by the classic experiments of Pavlov [19], in which a dog learned to associate the ringing of a bell with feeding time. After simultaneous application of both stimuli, the dog learned to associate them, exhibiting the same response (salivation) to either of the two stimuli alone. Such classical associative learning is advantageous because it enables an organism to anticipate and adapt to environmental changes quickly and has been observed in all animals with bilateral symmetry so far studied [20].

Building an associative perceptron with synthetic gene networks
To perform learning tasks, cells must 'remember' past stimuli, and genetically encode the memory. Basic synthetic genetic memory circuits that achieve this task have been demonstrated previously, such as the genetic toggle switch depicted in Figure 3A [21] and the transcriptional positive feedback loop in Figure 3B [22]. Both of these circuits have two stable memory states dictated by the expression of the genes in the circuits.
In the genetic toggle switch ( Figure 3A), either gene X or gene Y is switched on due to their mutual repression. These two memory states can be flip-flopped by two different input signals. In the positive feedback circuit ( Figure 3B), Node 0 for a teacher cell is labelled 0 T and node 0 for a student cell is labelled 0 S . As in Figure 1, nodes G3 and G4 comprise a toggle switch.
The output position of the toggle switch is tipped toward G3, resulting in RFP expression or G4, resulting in GFP expression, depending on the net activity level of nodes G1 and G2. In effect the G3/G4 toggle switch classifies the activities of the G1 and G2 nodes as inputs. As in Figure 1, node 0 (0 T or 0 S ) pushes the equilibrium of the toggle switch toward G3. Unlike in Figure 1, in this BST network, activity of 0 T can be controlled exogenously by addition of a small molecule inducer to the growth medium. Furthermore, in addition to RFP, node G3 also directs expression of a small molecule that can traverse cell membranes and activate node 0 S . This has the effect that, when teacher cells are in excess, the activity of 0 S in student cells is set ('learned') by the level of signal produced by teacher cells. Arrowhead connectors indicate activation of one node by another and hammerhead connectors indicate inhibition. Curled arrowhead connectors indicate auto-induction.
(B) Mathematical simulation of the BST network learning dynamics. Outputs of the student cells: red for RFP from G3, green for GFP from G4, are constantly 'learned' from changes in the teacher cells which determine the activity (threshold) of node 0 S in the student cells. This scheme is proposed here by A.Z. and D.N. and the simulation was performed by C.G. and Y.S. the expression of gene X is switched on by an input stimulus. Once activated, the ON state is self-sustaining due to positive feedback. SGNs for associative learning can be built based upon these memory circuits. Several groups have proposed such SGNs, including Lu et al. [23] who put forward an associative learning SGN based on a toggle switch. Elegant systems in which memory states are defined within DNA sequences have also been demonstrated by Farzadfard and Lu [24] and Yang et al. [25], using recombinase-mediated flipping of segments of genomic DNA. These systems represent potentially powerful basic research tools for discovering the provenance of different cell types. For instance, determining the events experienced by a given cell type as it matures from stem cell to terminally differentiated cell.
For dynamic and rapid memory establishment and erasure, SGNs have been designed to be capable of associating two different environmental signals in a manner analogous to the animal learning behaviour revealed by Pavlov. One such SGN is based on the combination of a positive feedback loop memory circuit and a negative modifier (Figure 4). This 'positive feedback/negative modified' (PFNM) network has the important advantage that it requires only a transient signal to form a sustained memory.
Memory erasure in the PFNM circuit would be achieved post-translationally via inducible protein degradation, using a system such as the auxin-inducible protein degradation [26,27]. Steps that are achieved post-translationally allow greater network responsiveness compared with steps that are mediated by transcriptional repression. The proposed network could be implemented experimentally using genetic tools that conform to the BioBrick TM synthetic biology standard, including a transcription activator, transcription repressor, fluorescent reporter protein and a small molecule regulator of protein degradation. A mathematical model, which applies four ordinary differential equations, activating and inhibiting Hill functions and mass-action law, can be used to assess the capacity of the PFNM circuit for associative learning. Simulation using this model predicted an initial low level of network response to pulses of either input 1 or input 2 when experienced separately ( Figure 4A). The network was then subjected to a pulse of both input 1 and input 2 at the same time. After this double-input pulse had been detected, the network was then predicted to give a boosted level of response to separate pulses of either input 1 or input 2 ( Figure 4B). In this way, the double-input pulse establishes a memory. This memory informs an increased level of response to single inputs relative to the level of response prior to when the memory was established. Either input 1 or 2 alone leads to a weak activation of the output y, at times t1 and t2. When both inputs 1 and 2 are applied simultaneously, a 'memory' is formed by a self-sustained expression of u due to its positive auto-regulation. Because of this memory a subsequent input 1 or input 2 alone can cause a strong induction of y. In this way the network has learned to associate inputs 1 and 2. This memory can be erased by a sufficiently large input 1 (due to the direct activation of v), bringing the system back to the default state. This scheme is proposed here by Y.S. and M.C.R. and the simulation was performed by Y.S.

Classification of complex inputs
Until now we have considered relatively simple classes of inputs of the type that can be separated by a single threshold and do not overlap. In these cases, the SGN merely classifies binary inputs that switch between the simple states such as being absent or present, or above or below a line. Biological reality, however, inevitably poses more complex situations. Classifying a more complex input, such as a concentration of a biological solute or signalling protein that falls within an upper and lower threshold, can also be addressed with SGN design ( Figure 5A). Classification of a given two-input signature, for instance 10-20 nM of solute X and 600-800 nM solute Y, can be achieved with an ab initio designed SGN ( Figure 5B) but begins to place a significant burden on the SGN designer (human or machine) to engineer or source sensor elements with the precise desired sensitivity to detect the two different solute concentration ranges. For example a given SGN design may require multiple promoters, each sensitive to different concentrations of the same, or different, solutes. In this situation, it is essential for the overall function of the network that there is no 'cross-talk' between the different inputs and the different promoters intended to be activated or repressed in response to those inputs. For instance, if solute A induces promoter A, but also induces promoters B, C and D unintentionally, the conditionality of outputs is compromised. As such, 'orthogonal' partners of inducer and promoter must be identified, in which a given inducer influences only a specific promoter type and has no effect on any other promoter. This In the case of two input ranges, X 1 and X 2 , the sensor/output modules feed into an AND gate which sums the output signals as either the presence or absence of GFP expression [30,33]. Adapted with permission from Dydovik et al. [30] and Kanakov et al. [33].
orthogonality is a non-trivial objective for synthetic biologists [28] because it is arguably a defining feature of natural biology that genes within a genome tend to influence each other's expression [29].

Ensembles of SGNs for classification of complex inputs
To meet these challenges, so-called 'ensemble' classifiers have been proposed [30,31]. The ensemble concept requires establishment of a heterogeneous population of simple classifier SGNs that encompasses a random distribution of sensitivities to input signals, each responding to only a narrow range of input levels. The overall output signal is the sum of the outputs of each SGN of the population and so can be considered as a tuneable collective response.
The SGN set out in Figure 5B features distinct ribosome binding site (RBS) elements, RBS U1 and RBS U2 , which respond to a distinct concentration of their cognate input molecules, X 1 and X 2 , respectively. High throughput (HTP) mutation approaches could be readily applied to generate a diverse library of RBS variants from RBS U1 to RBS Un . Once each variant has been introduced into cells, a population is generated harbouring an ensemble of SGNs with different input sensitivities. Across the ensemble population expression of the reporter would produce a bell-shaped response curve. The randomised sensitivity of the sensor RBS within each SGN of the ensemble is key. This distribution of sensitivities controls the position of the maximal signal output produced in response the concentration of a chemical input signal.
The ensemble of SGNs could be trained by selective deletion of the cells hosting SGNs that produce an incorrect response to positive or negative control signals. Total ensemble size, in terms of cell numbers, can be maintained by addition of new cells or by proliferation of the remaining non-deleted cells. Furthermore, probabilistic deletion, whereby incorrectly responding cells would have a finite probability of persisting within the ensemble population, would enable the 'soft learning' required for classification of input signals that have regions of overlap.
The sharp bell-shaped output of single synthetic circuits makes it possible to meet the challenge of distinguishing input classes that have a complex structure in the signal space. Effectively, training reshapes the distribution of individual sensitivities in the population, allowing them to cover the signal subspace corresponding to one of the classes by a union of 'pixel' responses. As a result, the SGN ensemble can be trained to classify inputs that are not linearly separable ( Figure 6).
As SGN size and complexity increases so challenges in biological implementation also tend to increase, such as the availability of orthogonal genetic constructs. Excellent work by Nielsen et al. [32] demonstrated a robust system, Cello, for design and assembly of up to 45 SGNs with intended function. For ensemble SGNs (Figures 5 and 6), selection

Intercellular communication between synthetic gene networks
Further sophistication in ensemble SGN design is likely to be achieved by integration with engineered intercellular communication. A study by Kanakov et al. [31] demonstrated that quorum sensing could be used to coordinate the function of designed genetic elements that have been distributed across different sub-groups of cells. They showed that toggle switch and oscillator functions could arise from these distributed, coordinated SGNs in a predictable and controllable manner. These distributed, coordinated SGNs were sensitive to modulation by external chemical signalling and the growth dynamics of the host cell population. This opens exciting possibilities for implementing dynamical decision making using distributed SGNs. Terrel et al. [34] also took a major step toward experimental implementation of distributed SGNs capable of classification. They demonstrated a system in which the presence of input was reported by a nanoparticle binding event that could occur only when two different cell types detected input signal.

Applications and implications of bio-artificial intelligence
Synthetic biology has the potential to disruptively reconfigure goods and services that are today bio-based, such as vaccines, and those that are today mainly non-biological, such as sensor devices and computation [35]. The 'designed' biology envisioned by this approach will remain only a vision until basic research enables engineers to build and test sophisticated biological devices that perform predictably within parameters accurately described by mathematical modelling [35]. A major challenge for this vision is to build well-characterised, SGNs that go beyond discrete functions (sensing, oscillating) to incorporate the learning SGNs discussed here and networks of networks that provide new functions in cells and consortia of multiple cell types.
The steadily increasing number of advances in DNA synthesis and assembly make combinatorial assembly of large SGNs accessible and practical. Generating large libraries of DNA fragments of differing sequence allows selection of variants that function well and deletion of variants that perform poorly -in effect an evolutionary approach. Furthermore, modular assembly of large DNA molecules allows specific subsections to be removed and replaced with different variants, while the rest of the molecule is unchanged. Together these approaches mean evolutionary strategies can be used to find optimal solutions to circuit design, while modular approaches are used for debugging and error correction.
For example, multiple fragments composing a biosynthetic or signalling pathway can be assembled using a variety of methods for parallel ligation of multiple DNA fragments. Readers interested in detailed discussion of these DNA assembly methods should consult reports by Engler et al. [36] of the 'MoClo' method and by Weber et al. [37] of the 'Golden Gate' assembly method. Several methods have also been developed specifically for manipulation of very large (100 kilo base pairs and larger) DNA fragments [38][39][40]. Methods such as these have ultimately enabled assembly of entire bacterial and eukaryotic chromosomes [41,42].
Possible industrial applications of SGNs that can learn include designing cells that can respond to large, small, intended and unintended perturbations in bioprocess environments while maintaining optimal productivity, such as biotherapeutic production, resource utilisation or biosynthesis of high value chemicals. Smart cells that can respond to the physiological status of the patient in a sophisticated manner could also expand the application and robustness of whole cell therapeutic approaches.
Advances in conventional AI have raised concerns around the use of AI technologies in ways that would not be acceptable to wider society. Examples include the use of voice recognition in public spaces for surveillance purposes or deploying autonomous robots to work as counsellors, soldiers, carers or judges [43]. Bio-artificial intelligence (BI) could enable pheromone recognition or detection of a person's unique signature of volatile biological molecules. Of course these are purely long-term considerations, but we suggest it is prudent to monitor development in the field of AI as an indicator of the possible challenges BI might pose in future. A recent example of such precautionary oversight is the appointment of an ethics board at AI company Lucid (Austin, Texas, USA).
To date no reports exist of the application of SGNs in a commercial biomanufacturing process. As such, the current boundaries of synthetic biology must be pushed in order to deliver enhanced capabilities and a new era of 'intelligent bio-manufacturing' . This might include deployment of 'smart' cells that can adapt to dynamic changes in their production (e.g. bioreactor) or application (e.g. organ, tissue) environments. As the global synthetic biology market grows, developing such capabilities will become a key challenge that will require the development of techniques across an increasingly broad palette of SGN architectures.

Summary
• The classic Perceptron function for linear classification of inputs could in theory be implemented using 'Teacher' and 'Student' SGNs. • SGNs have been designed to perform Pavlovian associative learning.
• Simulations in silico have provided preliminary confirmation that Pavlovian associative learning and Perceptron-based linear classification could be encoded in SGNs. • SGNs and experimental schemes have been proposed that could be capable of evolving increased levels of diversity, enabling classification of complex input data. • In future 'bio-artificial intelligence' may eventually pose ethical concerns that parallel those raised by recent developments in conventional artificial intelligence.