Petabytes of increasingly complex and multidimensional live cell and tissue imaging data are generated every year. These videos hold large promise for understanding biology at a deep and fundamental level, as they capture single-cell and multicellular events occurring over time and space. However, the current modalities for analysis and mining of these data are scattered and user-specific, preventing more unified analyses from being performed over different datasets and obscuring possible scientific insights. Here, we propose a unified pipeline for storage, segmentation, analysis, and statistical parametrization of live cell imaging datasets.
Introduction
Cellular behavior is, by definition, dynamic. Cells divide, move, and proliferate, carving out physical trajectories in 2- or 3-dimensional space [1]. They also activate and deactivate signaling pathways [2], and produce RNA and proteins, over both short and long time scales [3]. Much of modern biology has focused on developing techniques to measure these processes; however, traditional fixed-cell measurements impose limitations on retrieving potentially important information.
A great deal of relevant biological information is lost when phenomena are observed with techniques that measure static snapshots of cellular processes without actively tracking them over time and space. Many early fixed-cell methods, such as western or northern blotting [4, 5] or immunostaining [6], leverage specific antibody or oligonucleotide probes to measure the abundance of transcribed RNA, translated protein, or post-translational modifications on proteins that underlie many cellular behaviors. However, such bulk-cell measurements erase information about cell-to-cell heterogeneity that can impact signaling, transcription, or translation between cells. Immunostaining and subsequent imaging can reveal differences between individual cells, but erases information about their movement and spatial dynamics within their niche.
Fixed-cell methods also only capture a single snapshot of cellular behavior, which may represent an incomplete picture of a biological process: for instance, at any given point in time, a pathway of interest may be activated in only a subset of seemingly homogeneous, clonal cells [7, 8]. While improved quantification of pathway activity at the single-cell level has helped attribute a component of this disparity to noise in pathway activation [9], we know now that at least some fixed-snapshot heterogeneity in signaling is produced not by long-lived signaling states but rather by asynchronous temporal signaling dynamics (e.g. frequency, duration) [10, 11]. For instance, screening kinase inhibitors for their effects on single cell Erk activity revealed differences in signaling dynamics — and how they impact cell fate — that could not have been inferred solely from fixed-cell measurements [12]. These dynamics are often themselves heterogeneous between cells, particularly in contexts like the emergence of rare cancer cells where dynamics may eventually help predict and define new cell types [13]. Heterogeneity within populations limits the application of common heuristics like estimating the fraction of time spent in the active state based on the active fraction of a fixed population. On the other hand, live-cell imaging allows for precise measurement of dynamic signaling properties in single cells without the need for such indirect estimates. A recent example of this involved the use of live-cell Erk biosensors combined with oncogene expression, revealing that differential Erk dynamics led to the opposing cell fates of cell cycle arrest or proliferation [13].
Live-cell imaging, which uses brightfield imaging or fluorescent reporters to track and measure behavior in living cells over time, offers a possible solution to the challenges of capturing system dynamics at single-cell resolution. Beginning with the discovery and isolation of green fluorescent protein (GFP) [14] and its subsequent chimerization with proteins to visualize their location and activity, thousands of reporters of protein activity, transcription, post-translational modifications, and even intracellular pH or calcium levels [15–17] have been deployed in both cell culture models and in vivo (for those interested, an extremely comprehensive list can be found at https://biosensordb.ucsd.edu/). Additionally, advancements in imaging resolution, microscope stage-top incubators, and data storage [18–20] have allowed for long-term measurement of cell motility [21, 22], morphology [23, 24], and intracellular activity [25–30] over extended periods of time or at high time resolution, allowing for quantitative statements to be made about dynamic processes, heterogeneity, and information processing in and between cells.
As these imaging modalities have grown more and more common, petabytes of imaging data have been generated; yet, often a single readout or a small subset of cells are used for subsequent analysis and publication, such as a single measurement of GFP across time, leaving many potential findings about cellular phenotypes and dynamics unexplored. In this sense, the live-cell imaging community lags substantially behind the genomics and transcriptomics communities in terms of accessible and easy-to-use deposition of data, and centralized analysis of those data. The current lack of resources for live-cell imaging is largely technical, as there is no standardization between the many live-cell imaging methods and file formats used to collect these data. Even after the data have been collected, researchers looking to conduct similar analyses or share their analysis pipelines are at the whim of journal-to-journal data- and code-sharing practices, which tend to encourage but not enforce data and code sharing and often say nothing about ease of access. This process can be cumbersome and in fact has led most quantitative microscopists to host their analysis pipelines on decentralized sites such as Github with research-quality code not able to generalize beyond the lab-specific data. Rarely are the data made public.
Fortunately, researchers have broadly outlined the phenotypic and dynamics information that they would like to have from live-cell imaging data. While these phenotypes depend heavily on the types of analyses envisioned, the list is both broad and adaptable. The important phenotypes include cellular morphology, population averages, single-cell trajectories, spatial information, and lineage tracing of cell division; more complicated phenotypes related to cellular dynamics include rates and motility, time averages, and pulse detection [31, 32]. Here, we outline and describe a potentially revolutionary vision for the field: a centralized repository of live-cell imaging data coupled with data analysis tools that encompass the range of readouts and downstream analyses imagined by scientists. In particular, we describe the possible uses of these data paired with data processing and analysis tools to develop a broader and deeper understanding of biochemical pathways in cells, as well as local interactions and cellular dynamics in tissues.
The resource
We propose the development of a publicly available and accessible repository for live-cell imaging datasets, and the construction of an accompanying end-to-end workflow that allows scientists to begin with imaging data and an intended set of readouts in mind, and obtain a well-founded and descriptive set of statistics on phenotypes within the data (Figure 1). This system takes as input live-cell imaging data across different experimental conditions (Figure 1a), processes them into a standardized format (Figure 1b), models and controls for biological and experimental confounding, and quantifies both static and dynamic cellular phenotypes (Figure 1c). Importantly, the workflow will include built-in methods for dimension reduction, clustering, and visualization of these experiments, differential analyses across specific conditions, and association tests using metadata (Figure 1d). This allows for both low-dimensional (e.g. 2D principal component analysis, highlighting major sources of statistical differences between two or more samples, shown as an example in the upper panel of Figure 1d) or higher-dimensional data representation (e.g. 3D scatter plots of multiple dimensions quantified from time-course data, such as cell size, movement speed, or signaling activity shown as an example in the lower panel of Figure 1d).
A centralized repository for live-cell imaging data and analysis.
The proposed repository would enable researchers to (a) access deposited imaging data from previous studies in (b) standardized formats, subjected to standardized processing (e.g. segmentation and tracing), with built-in tools to (c) extract and (d) analyze variables of interest.
The proposed repository would enable researchers to (a) access deposited imaging data from previous studies in (b) standardized formats, subjected to standardized processing (e.g. segmentation and tracing), with built-in tools to (c) extract and (d) analyze variables of interest.
The first step of this process involves the development of a public-facing data repository for live-cell imaging data. To do this, we are developing infrastructure for data and software pipeline servers at Gladstone Institutes. Approximating the space required, we assume that a multiplexed imaging experiment consisting of 1000 timepoints imaged per well in a 384-well plate will produce a file requiring roughly ∼5–10 Tb of space (scaling with data resolution and number of imaged channels). Based on the rate of publication of such imaging datasets accompanying papers, we estimate ∼30 of such collections being deposited in the database the first year and in subsequent years. However, a far greater number of publications per year (∼100–200) display post-processed data without releasing the associated video files. The creation of this database will motivate these libraries to be uploaded as well, leading to an estimated 10K live-cell imaging videos plus metadata per year before accounting for growth.
An important bottleneck in current live-cell imaging data sharing is the diversity of file formats produced using different imaging modalities. Different models of microscope and versions of imaging software output different metadata files, and accurate parsing of these files is required for accurate comparisons between conditions within a dataset, as well as between distinct datasets. Currently, open-source software such as ImageJ [33] and the associated BioFormats [34] package allow users to view a range of proprietary datasets on most personal computers. We envision a database where such software would be used upon data submission to standardize all data and associated metadata into a common, usable format (Figure 1b,c). In this way, data submissions associated with a particular project can be easily re-analyzed and assessed by the community without first having to download massive files to ensure reproducibility, thus enabling researchers to explore and build on their colleagues’ findings. Previous efforts have assembled a list of minimum microdata recommendations for biological imaging — notably the Recommended Metadata for Biological Images (REMBI) [35] guidelines — which will guide the data standardization built into our pipeline.
The most novel aspect of this enterprise is the downstream analysis of imaging data within the repository (Figure 1c,d). Currently, some of the most important steps in live-cell imaging analysis — such as the thresholding and watershed functions used in cell segmentation — are largely implemented on a case-by-case basis, and can differ in quality depending on what software is used and how stringent the segmentation parameters are. Unifying segmentation through a trusted workflow using software packages such as TrackMate [36], CellProfiler [37], and DeepCell [38], would make it possible to pass settings seamlessly between users and reduce differences between downstream results stemming from improper, inconsistent, or non-reproducible segmentation workflows. Once the data have been segmented, additional information like location, fluorescence intensity, and measurable physical cell properties (e.g. morphology, size) will be collected across the timecourse and used for downstream analysis.
For the final step of the analysis pipeline, we envision a framework that takes variables of interest as input (for example, spatial location, cell size, or GFP expression level) and computes the values of these variables over time and space using established statistical metrics. One example is the analysis of spatial ‘clusters’ of cells; rather than the researcher having to manually label groups of cells that demonstrate a similar phenotype, they could define a specific metric calculated over the entire dataset, such as the Geary's C or Moran's I spatial correlation metrics [39]. This would allow for qualitative comparisons between cellular behaviors; for instance, when we see that protein X is expressed in clusters of cells, we can then quantify how frequently such clusters are found throughout the dataset, what the average cluster size is, and what other proteins display similar spatial behavior and in what cell types. Another feature of live-cell imaging data of interest is cellular lineage tracking, which is relevant, for example, to the emergence of differentiated cells in a population of stem cells [40–42], or, more generally, for understanding differential behaviors of daughter cells post-mitosis. There are a number of existing algorithms that can be used to track cell development [43–45], which will output generational and lineage information for each cell in the video. The ability to perform sophisticated analyses on computationally identified lineages of cells can allow scientists to understand the events that precede a particular cell-state change, a goal that unifies many fields within biology.
The repository proposed here would therefore constitute a fully ‘end-to-end’ resource for biologists. We envision a future in which a researcher attempting to collect live-cell imaging data can use this resource as a hypothesis generator before embarking on experiments such as the endogenous fluorescent tagging of a gene product or the introduction of a specific signaling pathway reporter; preparation of such cell or organism lines often takes substantial time and resources. Much as existing published sequencing datasets can make the construction of a candidate gene list easier for a researcher seeking to conduct a CRISPR screen, leveraging existing data may allow scientists interested in live-cell imaging to narrow their scope and avoid pouring time and resources into the wrong experiments or cell-line development. Moreover, this will allow scientists both to ask harder questions from existing data and to address more challenging problems with future experiments.
Addressing outstanding questions in cell biology
While the repository we have described would be immensely useful for the centralized storage and easy accessibility of high-quality imaging data, leveraging this huge dataset properly using robust, reproducible statistical tools could also offer answers to wide-reaching questions in cell biology. Since the discovery of canonical metazoan signaling pathways over the course of the 20th century [46], much interest has been focused on understanding the behavior of these pathways in individual cells, as well as their impact on neighboring cells and tissues. Over the past decade or so, reporters for signaling pathway activity have enjoyed a renaissance, and newly discovered pathway dynamics have raised a slew of questions that such reporters are uniquely poised to answer. Key measurements to understand the dynamics of a given pathway include the fraction of time spent in an active state, the time that passes between periods of pathway activation in a single cell, and evidence of spatial coupling in pathway activity. With the proper reporters and sufficient cell numbers (usually hundreds of imaged cells), all of these parameters can be recovered using timecourse analysis as well as more sophisticated spatial statistics [12, 47].
A number of signaling pathways, such as the Ras/ERK pathway, can control a range of biological processes [48], and it remains an open question how information encoded in different dynamics of a single pathway can lead to different downstream outcomes. A centralized database of live-cell signaling data would prove indispensable to answering this question. Looking across all examples of signaling dynamics, across cell types, contexts, and organisms, we may ask whether there are limits on the timescales of signaling behaviors of different pathways, or on the ability of pathways to communicate signaling information between cells. A single research group typically cannot study all of the settings in which dynamics of a certain pathway arise, as complex signaling dynamics have been found in a diverse range of contexts, including cancer, development, and wound healing [49, 50 2]. Rather, using data collected by research groups around the world, we may uncover a unified set of quantified dynamics that capture the behavior of a certain pathway across cell types, contexts, and treatment conditions. Since many canonical signaling pathways are also evolutionarily conserved, insight about a pathway in a particular cell type or system may inform experiments in different contexts, and serve as hypothesis generators for future work.
Small molecule or genetic screening would also benefit from the establishment of a live-cell imaging data repository with robust analysis tools. Imaging is commonly used in assays evaluating cancer cell growth in the presence of a drug, or CAR-T cell efficiency in the presence of genetic or chemical perturbations [51, 52]. However, analysis tools for such assays are largely limited to phenotyping counts, e.g. counting the number of live cells over time based on the presence of a fluorescent dye or other such sensor. Live-cell imaging datasets from these experiments may contain rich and complex cellular phenotypes that capture changes to cell state and behavior across time in response to perturbations, including cell morphology, cell motility, and spatial distributions of cells. ‘Freeing’ these data beyond their current limited summary statistics could provide new mechanistic insights into these cellular processes, thereby informing efforts to develop better drugs and personalized cell therapies. This database would ideally parallel the rapidly growing set of repositories for primary cell genomic, transcriptomic, proteomic, and epigenomic data such as PRIDE [53], Cistrome [54], and TCGA [55], which have been widely used to inform and guide basic and translational research.
Previous initiatives have established multiple high-quality repositories for biological imaging data, including the BioImage Archive [56] and Open Microscopy Environment [57]. We anticipate that our resource will integrate seamlessly with these repositories, allowing users to directly import data from outside sources into our pipeline using dedicated plugins. Thus, while our resource would provide overlapping data-storage functions with these existing repositories, users would still be able to take advantage of the data standardization and downstream analysis tools unique to our pipeline regardless of where the data are stored.
Conclusion
As imaging modalities continue to reveal increasingly higher-resolution details about the cellular world, the push for accessible, easily processed, and statistically interpretable data has increased as well. We feel that a substantial gap in the cell biology community would be filled by the end-to-end repository and analysis resource presented here.
Competing Interests
The authors declare that there are no competing interests associated with the manuscript.
CRediT Contribution
Barbara E. Engelhardt: Conceptualization, Supervision, Writing — original draft, Writing — review and editing. Siddhartha Gautama Jena: Conceptualization, Writing — original draft, Writing — review and editing. Alexander G. Goglia: Conceptualization, Visualization, Writing — original draft, Writing — review and editing.
Abbreviations
References
Author notes
These authors contributed equally to this work.