Life science research frequently involves optimizing experimental conditions to maximize product yield, ensuring the research pipeline can progress efficiently. For instance, in biochemistry, many processes in the protein production and characterization pipeline need optimizing such as cloning, protein expression, protein purification, and a range of structural biology techniques and functional assays. Design of Experiments (DoE) is an approach that designs a set of experiments that varies all factors at the same time for rapid process optimization, however, this approach presents a challenge due to statistical complexities and high software costs. To address this, DoE is introduced with Spreadsheet DoE, an open access user-friendly educational factorial design tool that provides step-by-step guidance and transparent spreadsheet calculations. Protein expression optimization in Escherichia coli is used as an example, and other biochemical optimizations are also explored. This guide along with Spreadsheet DoE aims to encourage researchers with minimal statistical background to effectively employ DoE methodologies in their life science studies.
Introduction
Biochemical research often involves complex biological systems that are affected by multiple factors (explanatory variables) which need optimizing to maximize product yield (response variable). It is a human nature to jump into a trial-and-error approach to try and optimize factors one at a time as this requires little planning that yields quick results. However, a little extra planning at the start using a statistical design of experiments approach known as “Design of Experiments” (DoE) significantly saves time/resources as well as provides deeper system insights. DoE is a method that plans a set of experiments to explore how different factors and their interactions influence a specific outcome. DoE will find the optimal factor levels to maximize the outcome much quicker than studying the effect of one factor at a time (OFAT), as seen in Figure 1.
DoE tests several variables at once helping researchers understand the relative impact of each factor and how they work together to generate a response, providing a deeper insight into the system compared with OFAT. This approach is commonly used in industries like manufacturing, pharmaceuticals, and agriculture to optimize processes, improve product quality, and reduce costs. It provides a practical way to gather insights and make informed decisions, leading to more reliable and faster results. However, the use of DoE approaches presents a challenge to understand and implement, particularly for non-statisticians and has not been widely adopted in academia. To address this, this article demystifies DoE with two-level factorial designs and showcases some current DoE applications in biochemical research with outputs generated by Spreadsheet DoE
As an example, protein expression in Escherichia coli is chosen as it often requires optimization to maximize protein yield (the response variable) by adjusting different factors (the explanatory variables). Four common factors (growth temperature, growth time after induction, inducer concentration, and cell type) will be explored with the widely used two-level fractional factorial design. For any given run, each factor either takes a “high” or “low” value and certain combinations are tested in a series of eight experimental runs (Table 1). For any given factor, there are four runs where this factor is at the low setting and four runs where it is at the high setting, allowing a comparison of means to be calculated for all factors.
Standard Run (SR) . | Factor A . | Factor B . | Factor C . | Factor D . | Response . |
---|---|---|---|---|---|
Number . | Temp. (°C) . | Time (hours) . | IPTG (mM) . | Cell Type . | Protein (mg) . |
SR 1 | 25 | 4 | 0.1 | BL21 (DE3)pLysS | 85.50 |
SR 2 | 35 | 4 | 0.1 | BL21 (DE3) | 69.20 |
SR 3 | 25 | 12 | 0.1 | BL21 (DE3) | 91.30 |
SR 4 | 35 | 12 | 0.1 | BL21 (DE3)pLysS | 9.30 |
SR 5 | 25 | 4 | 0.5 | BL21 (DE3) | 83.60 |
SR 6 | 35 | 4 | 0.5 | BL21 (DE3)pLysS | 73.10 |
SR 7 | 25 | 12 | 0.5 | BL21 (DE3)pLysS | 97.40 |
SR 8 | 35 | 12 | 0.5 | BL21 (DE3) | 12.60 |
Standard Run (SR) . | Factor A . | Factor B . | Factor C . | Factor D . | Response . |
---|---|---|---|---|---|
Number . | Temp. (°C) . | Time (hours) . | IPTG (mM) . | Cell Type . | Protein (mg) . |
SR 1 | 25 | 4 | 0.1 | BL21 (DE3)pLysS | 85.50 |
SR 2 | 35 | 4 | 0.1 | BL21 (DE3) | 69.20 |
SR 3 | 25 | 12 | 0.1 | BL21 (DE3) | 91.30 |
SR 4 | 35 | 12 | 0.1 | BL21 (DE3)pLysS | 9.30 |
SR 5 | 25 | 4 | 0.5 | BL21 (DE3) | 83.60 |
SR 6 | 35 | 4 | 0.5 | BL21 (DE3)pLysS | 73.10 |
SR 7 | 25 | 12 | 0.5 | BL21 (DE3)pLysS | 97.40 |
SR 8 | 35 | 12 | 0.5 | BL21 (DE3) | 12.60 |
An equivalent trial-and-error or “OFAT” approach to compare each factor mean (with four replicates) at the high and low level would require 20 runs. Furthermore, any interactions between factors cannot be detected, which is one of the most powerful aspects of the DoE approach. For example, if you were investigating two drugs for a positive health response then just knowing the effect of each (A and B) is not as informative as knowing if there are any synergies or anti-synergies (+AB or −AB). An interaction effect means that the effect of a high level of A is different depending on the level of B, that is, A and B are not independent of each other. DoE approaches are therefore an efficient way to perform a relatively small number of experiments yet obtain maximum amount of information and a potentially predictive linear regression model (Box 1) about the system under study. This allows the response to be maximized by selecting the factor levels in the correct combination according to the sign and magnitude of the coefficients in the model. DoE is used widely in the pharmaceutical industry, where efficiency equals profitability, however, by demystifying the approach and providing some easy-to-use educational DoE spreadsheets; others can greatly benefit from this approach too.
The effects of each factor or interaction generated by the DoE approach are included as terms in a final linear equation to predict the response. The sign and magnitude of the coefficients of each term help researchers determine their relative importance and whether each factor should be high or low to maximize the response. For example, a model may take the form below, where each letter is replaced by +1 or −1 depending on its level for that run.
Response = 7.5 + 5.2*A + 2.4*AB − 1.1*B
The “-1.1*B” term has a negative coefficient, which may suggest factor B should be kept at its low level to increase the response. However, the “2.4*AB” term indicates that there is a “positive“ interaction between factors A and B. In other words, the effect of a high level in A increases 2.4 to the response when B is simultaneously at its high level and decreases 2.4 to the response when B is at its low level. Therefore, changing the levels of A and B at the same time produces an effect on the response that is greater than the effect on the response from the B term alone. In this case, we would choose A to be at its high level and B to be at its high level to maximize the response (which would be predicted to be 7.5 + 5.2 + 2.4 − 1.1 = 14) even though the effect of the “−1.1*B” term alone reduces the response by 1.1, although this reduction is smaller than the 2.4 gain from the interaction effect. Linear regression models are a core component of many statistical analyses.
Methodology
Intro to Spreadsheet DoE
As an educational tool to explain/implement DoE to nonstatisticians, a series of five Spreadsheet DoEs is available. Each spreadsheet corresponds to optimizing systems affected by 3–6 factors with the goal of producing a predictive linear regression model using two-level factorial designs. They are named by the number of factors and the minimum number of runs required to explore the system. For example, DoE spreadsheet 6.16 would be used for exploring a process affected by six factors in a minimum of 16 runs. Within each spreadsheet, there are five design tabs labelled A–E, design C is the standard design with the minimal number of runs intended to be performed at the same time and not replicated. Design A can be used when all factors under investigation are continuous and numerical such as temperature (also known as a quantitative factor) and includes four additional replicate runs where all factor values are midway between the high and low level. This is the most powerful design as it provides an independent measure of both confidence in the model and confidence in predicting a linear trend between and beyond the factor levels tested. Design B can also provide an additional independent error estimate by duplicating Design C, although increasing the number of runs much more than Design B. Designs D and E allow you to perform your experiments in two blocks (i.e., half the runs at one time and the other half at a different time) and estimate any unintended effect between blocks. The spreadsheets are standalone educational tools with instructions guiding you throughout and, in many cells, there are comments and equations for more explanation. We will go through the key steps to analyze protein expression data from a minimal number of runs collected at the same time using the 4.8 spreadsheet as outlined in (Table 2). DoE encompasses other more complex designs besides the two-level factorial designs although the principals in Table 2 provides the foundation for these using the materials found in the further reading/resources section.
Step . | Statistical analysis approach . |
---|---|
1 | Calculate factor and interaction term effects |
2 | Estimate which terms to include in regression model |
3 | Form a linear regression model |
4 | Use analysis of variance to determine a measure of confidence in the model |
5 | Check model assumption by analyzing residuals |
Step . | Statistical analysis approach . |
---|---|
1 | Calculate factor and interaction term effects |
2 | Estimate which terms to include in regression model |
3 | Form a linear regression model |
4 | Use analysis of variance to determine a measure of confidence in the model |
5 | Check model assumption by analyzing residuals |
Step 1: Calculate factor and interaction term effects
Formally, the effect of a factor is the change in response produced by the change in the level (from high to low) of that factor. Mathematically, it is the difference between the mean of the subset of responses at the high level of a given factor and the mean of the subset of responses at its low level. The same is true for an interaction effect as well (Table 3). This is the power of DoE, the same set of relatively few collected responses from your design can be grouped into several different means to calculate a range of effects for each term. A term is defined as a factor or factor interaction and is usually represented by one or more letters (top row of Table 3).
Run . | AC . | AD . | D . | C . | B . | AB . | A . | Response . |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | −1 | −1 | −1 | 1 | −1 | 85.50 |
2 | −1 | 1 | 1 | −1 | −1 | −1 | 1 | 69.20 |
3 | 1 | −1 | 1 | −1 | 1 | −1 | −1 | 91.30 |
4 | −1 | −1 | −1 | −1 | 1 | 1 | 1 | 9.30 |
5 | −1 | −1 | 1 | 1 | −1 | 1 | −1 | 83.60 |
6 | 1 | −1 | −1 | 1 | −1 | −1 | 1 | 73.10 |
7 | −1 | 1 | 1 | 1 | 1 | −1 | −1 | 97.40 |
8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 12.60 |
Avhigh- Avlow (Effect) | 0.75 | 1.85 | −2.15 | 2.85 | −25.20 | −35.00 | −48.40 | |
Sum of Squared Deviations | 1.13 | 6.85 | 9.24 | 16.24 | 1270.08 | 2450.00 | 4685.12 |
Run . | AC . | AD . | D . | C . | B . | AB . | A . | Response . |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | −1 | −1 | −1 | 1 | −1 | 85.50 |
2 | −1 | 1 | 1 | −1 | −1 | −1 | 1 | 69.20 |
3 | 1 | −1 | 1 | −1 | 1 | −1 | −1 | 91.30 |
4 | −1 | −1 | −1 | −1 | 1 | 1 | 1 | 9.30 |
5 | −1 | −1 | 1 | 1 | −1 | 1 | −1 | 83.60 |
6 | 1 | −1 | −1 | 1 | −1 | −1 | 1 | 73.10 |
7 | −1 | 1 | 1 | 1 | 1 | −1 | −1 | 97.40 |
8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 12.60 |
Avhigh- Avlow (Effect) | 0.75 | 1.85 | −2.15 | 2.85 | −25.20 | −35.00 | −48.40 | |
Sum of Squared Deviations | 1.13 | 6.85 | 9.24 | 16.24 | 1270.08 | 2450.00 | 4685.12 |
Step 2: Estimate which terms to include in regression model
A half-normal probability plot helps consider which terms are important and likely to contribute to a predictive regression model and represent “signal” above those terms that represent background “noise”. When the assumption is that everything is noise, all terms will lie on the projected line made by the early points. Any terms off the line are considered “signal” and may be useful to include in the regression model (Figure 2).
It is not expected that many terms will deviate from the line as the “sparsity of effects” rule states that only 20% of the main effects and two-factor interactions will be significant in most systems and that the other terms will vary only to the extent of normal error.
Step 3: Form a linear regression model – a predictive equation
A linear regression model (Box 1) is a mathematical equation that uses the data from a completed design to attempt to predict any response given the coded factor level settings, which are either high (+1) or low (−1). Usually, only those terms that are significant, based on the outliers in the half-normal plot, are included in the model. However, additional diagnostic graphs, R2 values and probability percentages (steps 4 and 5) are also used to determine the final terms included in the model. The remaining terms not used in forming the model are grouped to give an estimate of error (see next section). To construct the model, we simply take the arithmetic average of responses from all runs as a starting point and then add each model term multiplied by its coefficient (calculated by dividing the term’s “effect” in two). The model from our example using the 4.8 design predicts the following response (amount of purified protein in milligrams) from three terms.
Response = 65.3–24*A - 17.5*AB – 12.6*B
Therefore, Factor A (temperature) has the biggest effect on the response followed by Factor B (growth time after induction), while Factors C (IPTG concentration) and D (cell type) make no significant impact. Furthermore, A and B have an interaction (that has a bigger effect than Factor B alone), which suggests choosing a low level for A and a high level for B would maximize the response. We would therefore grow our cells at 25°C for 12 hours after induction with either high or low levels for IPTG concentration and either cell type; furthermore, we may explore decreasing the temperature or increasing the growth time to possibly continue the trend.
Step 4: Use analysis of variance to determine a measure of confidence in the model
A linear regression model is only useful if it passes a confidence measure (this section) and all the assumptions in constructing the model hold true (next section). An important parameter to consider that helps determine confidence is called variance, which is easily calculated from the sum of squared deviations (SS). For set of simple replicates, SS is calculated by squaring the difference between each data point and the grand arithmetic average and summing these squared differences together. This provides insight into the variability or spread of the observations around this average. If we divide SS by the sample size −1, it calculates variance and the square root of variance is the standard deviation, commonly reported with averages. The SS for all responses in a factorial design can also be similarly calculated against its grand mean to reveal the spread of observations in our design, which we seek to explain from the effect of factors or factor interaction terms included in the regression model. SS calculations for each term in a DoE design is more complicated than one set of replicates, since there are replicates at high (+1) and low (−1) levels and we want to know how much the change in level of that term contributes to the variability in the response. A small effect for a term will result in a smaller SS for that term and for these balanced designs, there is a simple formula to convert each term’s effect to their SS (and therefore variance), which together, allow an analysis of variance (ANOVA) of the model.
ANOVA is based on separating the total variance of a set of responses (i.e. the spread of protein yields in expression experiments) into two parts, the proposed “signal” (explained variance) and the proposed “noise” (unexplained variance) and then comparing them. The signal is the pooled variance from the terms in the regression model, while the noise is the pooled variance from the left-over residual terms that we assume representing random variability. The F-statistic for the model is calculated by dividing the signal variance by the noise variance, the higher the ratio, the more likely the model is significant (over noise), and this is expressed in Spreadsheet DoE as a probability percentage, where greater than 95% is deemed significant (equivalent to a p-value < 0.05). F-statistics can be evaluated for any signal-to-noise comparison, for example, the significance of individual terms in the model can also be calculated (Table 4). For a model with high confidence, in addition to high probability values for the model terms and the overall model, the model R2 value should be close to one as this quantifies the proportion of the variation explained by the predictive model (calculated by dividing the model SS by the sum of residual SS and model SS). A summary of all ANOVA statistics for different models that include increasing number of model terms is found in Table 5, which shows having three terms (A, AB, and B) give the best statistics and best meets the conditions above, especially as given similar statistics, it is important to choose the model with fewer terms.
Source . | Sum of squares . | df . | Mean square . | F-statistic . | Significance . |
---|---|---|---|---|---|
Model | 8405.2 | 3 | 2801.73 | 334.94 | 100.00 |
A | 4685.12 | 1 | 4685.12 | 560.09 | 100.00 |
AB | 2450.00 | 1 | 2450.00 | 292.89 | 99.99 |
B | 1270.08 | 1 | 1270.08 | 151.83 | 99.98 |
Residual | 33.46 | 4 | 8.36 | ||
Corrected total | 8438.66 | 7 | |||
R2 | 1.00 | ||||
R2adjusted | 0.99 | ||||
R2predicted | 0.98 |
Source . | Sum of squares . | df . | Mean square . | F-statistic . | Significance . |
---|---|---|---|---|---|
Model | 8405.2 | 3 | 2801.73 | 334.94 | 100.00 |
A | 4685.12 | 1 | 4685.12 | 560.09 | 100.00 |
AB | 2450.00 | 1 | 2450.00 | 292.89 | 99.99 |
B | 1270.08 | 1 | 1270.08 | 151.83 | 99.98 |
Residual | 33.46 | 4 | 8.36 | ||
Corrected total | 8438.66 | 7 | |||
R2 | 1.00 | ||||
R2adjusted | 0.99 | ||||
R2predicted | 0.98 |
. | Model prob % . | A . | AB . | B . | C . | R2 . | R2 adj . | R2 pred . | R2 adj-pred . |
---|---|---|---|---|---|---|---|---|---|
1 term in model | 96.61 | 96.61 | 0.56 | 0.48 | 0.21 | 0.27 | |||
2 terms in model | 99.06 | 99.18 | 97.21 | 0.85 | 0.78 | 0.6 | 0.18 | ||
3 terms in model | 100.00 | 100.00 | 99.99 | 99.98 | 1.00 | 0.99 | 0.98 | 0.01 | |
4 terms in model | 99.98 | 99.99 | 99.98 | 99.93 | 80.89 | 1.00 | 1.00 | 0.99 | 0.01 |
. | Model prob % . | A . | AB . | B . | C . | R2 . | R2 adj . | R2 pred . | R2 adj-pred . |
---|---|---|---|---|---|---|---|---|---|
1 term in model | 96.61 | 96.61 | 0.56 | 0.48 | 0.21 | 0.27 | |||
2 terms in model | 99.06 | 99.18 | 97.21 | 0.85 | 0.78 | 0.6 | 0.18 | ||
3 terms in model | 100.00 | 100.00 | 99.99 | 99.98 | 1.00 | 0.99 | 0.98 | 0.01 | |
4 terms in model | 99.98 | 99.99 | 99.98 | 99.93 | 80.89 | 1.00 | 1.00 | 0.99 | 0.01 |
Step 5: Check model assumptions by analyzing residuals
Residuals are the difference between the observed and predicted (by the regression model equation) response values for each experimental run and should be normally distributed to fulfil the assumptions used to create the predictive model. Normality can be tested by plotting residuals against normally distributed theoretical z-scores for the dataset to assess if a straight line is produced as well as against the predictive values to assess if an even vertical spread of data from left to right in the plot is produced (Figure 3). If both diagnostic tests are generally passed, then the residuals are deemed normally distributed and independent with constant variance and confirm the statistical assumptions made for the ANOVA approach. Minor deviations from these assumptions are usually acceptable. Furthermore, a plot of the actual responses versus the predicted responses should give a straight line with a slope of 1, showing (in another way) how well the model fits the data. Transforming the responses before forming models can help better meet these assumptions and improve the diagnostic plots and this is discussed further in the spreadsheets.
Applications in biochemical research
This method can be applied in various fields for both quick evaluations (screening) or in-depth analysis (characterization). It often delivers 80% of the improvements in results and quality with just 20% of the total effort. When curvature is detected, full optimization (to get that final 20% improvement) may require experimentation with a response surface design, a related approach that includes additional factor levels and requires more specialist software for analysis (for other software, see below). DoE has been used during the protein production and characterization pipeline to include PCR, cloning, cell culture, protein expression, purification, crystallization, and characterization (see table 6 in Papaneophytou et al.1 for a summary of typical factors investigated).
DoE helps in the optimization of cell culture conditions in biopharmaceutical production. Factors like oxygen levels, carbon sources, and agitation rates can be adjusted to improve the growth rate and productivity of cultured cells. Protein expression is optimized (often as C-terminal GFP folding reporter fusions) using similar parameters to those given in this article example (as well as including different N-terminal fusion tags) to maximize protein yields allowing costs to be reduced and smaller volumes of cultures grown.2 Affinity-tagged protein purification using different bead types, different buffers, different chromatography method parameters and other conditions often led to greater yields3 and add-on ATKA software from Cytiva can be purchased to perform many other different chromatography designs. Other popular DoE software that can be purchased that can perform two-level factorial as well as other DoE designs include Design Expert (by Stat-Ease), JMP (by SAS), Minitab, and MODDE (by Sartorius). Furthermore, a free online resource called SAmBA (https://www.igs.cnrs-mrs.fr/samba/) will create a DoE design called an (unbalanced) incomplete factorial. This allows researchers to define any number of factors and factor levels in a required number of runs, sampling across the combinatorial space. However, these incomplete factorial designs provide fewer system insights and confidence measures, as they are not balanced and do not have equal number of experimental runs for each level, for all terms.
Protein stability in different buffers is often screened by studying factors such as pH, salts, osmolytes such as glycerol, and nondenaturing detergents such as Tween-20. The melting temperature of the protein in the different buffers can be measured to find the most important factors and interactions to keep the protein stable.4 This is essential for long-term characterization of proteins that occur in NMR and other biophysical methods, where more stable conditions minimize protein degradation during lengthy data collection periods. Protein stability is also essential in the formulation of protein therapeutics and may allow these therapeutics to be stored at higher temperatures while still being intact and an effective treatment.5 Another area involves finding buffer conditions for optimizing enzyme reactions, protein folding, and protein crystallization. 6-8 In fact, DoE can apply to almost any process where factors can be identified and responses can be measured, although traditionally used in process industries (agricultural, food, and pharmaceuticals), it has been applied in almost all fields of research and will almost certainly continue to help us better understand and efficiently optimize many more systems in the future.
To aid biochemistry researchers using DoE to optimize the protein production and characterization pipeline who are looking for any emerging optimization trends, any completed spreadsheet DoE design that researchers share with the author will be categorized and made publicly available as a community resource in the future.
Spreadsheets
The five Spreadsheet DoE workbooks and the 4.8 Spreadsheet with the example data used in this article are available at https://doi.org/10.5281/zenodo.13998442.
Further reading/resources
Reid, D. (2023) Understanding biochemistry: basic aspects of statistics for life sciences. Essays Biochem., 67, 1015–1035 DOI: 10.1042/EBC20220211
Anderson, M.J. and Whitcomb, P.J. (2017) DOE simplified – Practical Tools for Effective Experimentation, Third Edition. CRC Press, Boca Raton
Anderson, M.J. and Whitcomb, P.J. (2016) RSM Simplified – Optimizing Processes Using Response Surface Methods for Design of Experiments, Second Edition. Productivity Press, New York
DoE for chromatography resources from Cytiva. https://www.cytivalifesciences.com/en/gb/solutions/bioprocessing/products-and-solutions/accelerate-biopharmaceutical-process-development/design-of-experiments [Accessed 20 November 2024]
Author information
I work as a Biochemistry Senior Lecturer in the School of Biosciences at the University of Liverpool. I am the programme director for the Biochemistry degree. Before Liverpool, from 2010 to 2017 I worked as an Associate Professor of Biochemistry at Eastern New Mexico University, USA. I also worked as an Assistant Professor in the Chemistry/Biochemistry department at Mount Allison University, Canada (2008-2010), a Post-Doc at the Hospital for Sick Children and University of Toronto, Canada (2004-2008), a Ph.D. student at Cambridge University, UK (1999-2004), an English and Chemistry Teacher in Jalisco, Mexico (1998-1999), a Research Scientist at GlaxoSmithKline (1996-1997), where I first started using DoE and an Undergraduate student at Leeds University, UK (1994-1998). My research lab uses bioinformatics and biophysical tools to probe protein structure, dynamics and interactions to gain a deeper understanding of protein interaction specificity, and routinely uses Spreadsheet DoE to rapidly optimize the protein production and characterisation pipeline. I am also interested in pedagogical research to better understand and generate resources for student success in the lab, which inspired the development of Spreadsheet DoE. Email: [email protected].