Abstract

Recent advances in omics technologies have led to the broad applicability of computational techniques across various domains of life science and medical research. These technologies provide an unprecedented opportunity to collect the omics data from hundreds of thousands of individuals and to study the gene–disease association without the aid of prior assumptions about the trait biology. Despite the many advantages of modern omics technologies, interpretations of big data produced by such technologies require advanced computational algorithms. I outline key challenges that biomedical researches are facing when interpreting and integrating big omics data. I discuss the reproducibility aspect of big data analysis in the life sciences and review current practices in reproducible research. Finally, I explain the skills that biomedical researchers need to acquire to independently analyze big omics data.

Introduction

Recent advances in omics technologies have led to the broad applicability of modern high-throughput technologies across various domains of life sciences and medical research. These technologies are capable of generating big datasets across large-scale clinical cohorts, which allows connecting complex diseases to relevant genomic features. However, the analysis of big data requires the use of sophisticated bioinformatics algorithms capable of differentiating technical noise from the biological signal in the data. The analysis of big data in life sciences typically starts with the analysis of the raw data and concludes with data visualization and interpretation of the patterned data produced by analyses (Figure 1). Big data represented by massive datasets represent a substantial challenge for the analysis due to an increased computational footprint associated with handling, processing, and moving information [1,2]. There are various definitions of big data across various domains of science, ranging from a simple definition that big data are data that are too large and complex to be processed using traditional noncomputational approaches [3], to more complex definitions of big data which require several important features to be present before it can be classified as big data. For example, a popular 3 V definition requires volume, variety, and velocity of the data [3,4]. Applying computational methods to big data will provide the power to make novel biological, translational, and clinical discoveries and push the boundaries of current knowledge of the biology of disease, as well as phenotypic and clinical dynamics of the disease. The key foundational aspects are bioinformatics methods, which allow the extraction of relevant biological signals from noisy datasets and eventually enable discovery, translation, and actionable applications. Modern biomedical datasets contain tens of thousands of samples and petabytes of raw data (e.g. TCGA [5] and GTEx [6]) and typically satisfy the definition of big data and are considered as big biomedical data [68]. In contrast with the raw biomedical and sequencing data, summary statistics extracted from such datasets (e.g. recount2 [9]) are significantly smaller and require less computational resources to be processed and analyzed. In this review, I will discuss raw biomedical data and challenges and opportunities associated with processing and analyzing such datasets. I will also discuss the computational skills required for analysis and interpretation of big omics data. These skills include the ability to operate a command line and run bioinformatics directly on high clusters.

Figure 1.

The workflow of big data analysis and interpretation in the life sciences.

Figure 1.

The workflow of big data analysis and interpretation in the life sciences.

Acquiring such skills can be challenging to many life science and medical researchers. At present, the bioinformatics community lacks mechanisms for researchers to effectively learn to analyze big omics data.

The ability to leverage various types of omics datasets from large-scale clinical cohorts is essential to study the functional mechanisms underlying the connections between genetics, immune system, and disease etiology. For example, the availability of rich omics data generated by the TCGA consortium [5] provides an exciting opportunity for the discovery of how genetic variation affects the development, progression, and drug response of cancerous tumors [10]. Despite many studies using TCGA data for the basic research [11], the translational value of such datasets are yet to be fully unlocked [12].

In addition, bioinformatics methods for large-scale clinical cohorts promise to identify novel markers prognostic of disease risk across a variety of diseases, including cancer.

Despite the increasing size and complexity of datasets in the biological and medical sciences, many biomedical researchers today lack sufficient computational skills to analyze the large-scale data they generate. At present, the digital gap in contemporary biology limits the potential of these biomedical researchers to creatively explore their data. Furthermore, the digital gap limits the collaborative potential of these biomedical researchers and computational scientists [13]. Training of life sciences in computational techniques can potentially narrow this digital gap, expand the skills of biomedical researchers, and improve the ability of biomedical researchers to leverage the consultation services they seek with computational scientists. Such training should include both hands-on sessions and sessions covering the principle of computational methods, especially the assumptions and heuristics of bioinformatics methods. Conceptual understanding of bioinformatics algorithms will allow biomedical researchers to better understand and interpret the bioinformatics results. Skills such as automating tasks with the Unix command line are necessary to process big data on the high-performance clusters. The knowledge of Python or R is required to analyze the data on the laptop and visualize the obtained results.

Bioinformatics methods in life science research

During the past decade, the rapid advancement of omics technologies has led to the development of an enormous amount and diversity of bioinformatics algorithms across various fields of modern biology [14]. When applied to high-dimensional clinical datasets, bioinformatics algorithms allow identification of novel disease subtypes and discovery of novel markers which may be prognostic of disease risk. Such bioinformatics algorithms are usually encapsulated as computational software tools [15]. The majority of bioinformatics tools are designed for the UNIX operating system, which requires a user to operate the tools using the command line—without the benefit of a graphical user interface (GUI). In addition, the UNIX framework makes it possible to allow different bioinformatics tools to communicate without having been designed explicitly to work together using pipes. This also allows avoiding the creation of unnecessary temporary files for each bioinformatics tool of the pipeline.

Barriers in interpreting and integrating big data in the life sciences

A major barrier in interdisciplinary studies is a lack of a common communication style between trained researchers working in increasingly specialized academic disciplines [16]. Today's big data projects require that life science researchers either learn how to use command-line tools or outsource their data analysis to computational experts. Active engagement in analyzing the data generated by life science researchers is essential for advancing interdisciplinary research in the biological sciences. However, biologists and medical researchers often lack formal training in the use of computational techniques. Scholars have developed teaching models aimed to support biomedical researchers transition from using a graphical interface (e.g. Microsoft Excel) to the UNIX command line [13,17,18]. Biomedical researchers often possess various levels of computational skills; workshops require adjusting the teaching pace for trainees who have different levels of computational background [19]. In addition, the ability of trainees to switch from a graphical interface with the assistance of a mouse to a command-line interface with no mouse support varies across the life science researchers.

Training a life science research to use computational techniques poses unique challenges and requires a special approach. For example, such training does not require cultivating a deep understanding of the fundamental computer science principles. Instead, such training is limited to applied learning, requiring quick assimilation of introduced techniques and acquired skills within the context of the research project of interest. In general, flexible pedagogy is preferred. For example, instead of introducing the fundamental computational concepts formally, such concepts are introduced on an as-needed basis, are combined with numerous hands-on examples, and rely on the instructor's guidance to consolidate the learner's newly acquired knowledge.

Biomedical researchers who have completed the training in computational skills often continue to engage with the command line and with training instructors long after the training session has formally ended. Researchers who have completed the workshops are also able to better engage with computer scientists when seeking consultation services for their projects. The training model I developed [13] has helped life science researchers to learn the ‘language of computing’, a skill that allows them to better understand what analyses can be accomplished with UNIX and how to ask for specific types of help from computational scientists.

Alternatively, biomedical researchers can delegate large-scale data analyses to bioinformatics cores. However, outsourcing analyses present several challenges. Many complex issues arise during the analysis of big omics data, which are difficult to predict in advance. In such cases, life science and biomedical researchers can optimize the analysis if they are adequately trained and remain involved in the analytical workflow. In addition, research groups utilizing the core often want to move the project in different directions from what was originally proposed. Another approach is to develop a GUI that allows researchers with a limited computational background to easily create, run, and troubleshoot analytical pipelines. Although useful to researchers with a limited computational background, these interfaces may have limited computational capacity compared with high-performance clusters, and might not suitable for the analysis of big omics data generated from some large-scale clinical cohorts.

Computational resources required to analyze the big omics data

Computational resources required to analyze big omics data differ significantly depending on the step of the analysis. Analyses dealing with the raw data typically require a significant amount of computational resources to perform essential tasks and space to store the output data. Analyzing big omics data can typically be performed only on high-performance clusters. For example, the analysis of differentially expressed genes based on the RNA-Seq data starts with raw reads produced by the sequencing machines and concludes with a list of differentially expressed genes with corresponding gene expression fold changes. The computational resources and amount of storage needed to analyze and store the results of analysis significantly vary across the steps of the analysis. For example, the total space required to store the raw sequencing data of one sample is ∼15 G (Figure 2). It is possible to store such data on a regular workstation with no need for a large high-performance computing (HPC) environment. However, the increasing size of the cohorts composed of thousands of individuals makes storing data on a regular workstation impractical. In addition, the regular workstation lacks the computational power to process thousands of samples. As a result, the steps of the analysis requiring significant computational resources and space to store the data can be performed only on high-performance clusters or using cloud computing. Decreasing the cost of cloud computing makes it an attractive alternative to well-established high-performance clusters. The exact cost of cloud computing is constantly decreasing and is compared with the high-performance cluster elsewhere [20].

The amount of space needed to store the results of the sequencing analysis.

Figure 2.
The amount of space needed to store the results of the sequencing analysis.

The size of each step is shown as a grid of boxes, and each box is equivalent to storage of 10 Mb.

Figure 2.
The amount of space needed to store the results of the sequencing analysis.

The size of each step is shown as a grid of boxes, and each box is equivalent to storage of 10 Mb.

Other steps require a smaller amount of resources and thus can perform locally on the personal computer or laptop. Typically, the analysis performed on the local machine does not require the knowledge of command-line skills. Such analyses typically involve various statistical analyses and visualization steps, and these tasks can be performed using the widely popular statistical language R [21]. For example, once the gene expression levels were obtained from RNA-Seq data on the high-performance cluster, one can transfer them to a personal computer and locally perform differential expression analysis using available R packages [22].

Computational reproducibility in life science research

An astonishing number of bioinformatics software tools designed to accommodate increasingly bigger, more complex, and more specialized bio-datasets are developed each year [14]. With the increasing importance and popularity of computational and data-enabled approaches among biomedical researchers, it becomes ever more critical to ensure that the developed software is usable [23] and the uniform resource locator (URL), through which the software tool is accessible, is archivally stable. Consistently usable and accessible software provides a foundation for the reproducibility of published biomedical research, defined as the ability to replicate published findings by running the same computational tool for the data generated by the study [23,24].

Open data, open software, and reproducible research are important aspects of big data analysis in the life sciences [25]. Reproducing previously published results can be made possible by releasing all research objects, such as raw data, and publically available, archivally stable, and installable computer code. However, a lack of strict implementation or enforcement of journal policies for resource sharing harms rigor and reproducibility as some authors refuse to share the data [26] or the source code. Even when the code and the data are shared, it can be still challenging to computationally reproduce the results of the published paper [27]. One technique to enable computational reproducibility is literate programming, allowing the reader to understand how the research results were obtained by generating the documents that include the code, narratives, and the outputs including figures and tables. One such platform able to mix code with accompanying documentation and text notes is Jupyter [28]. This popular platform allows the reader to follow the documentation, run the code, and visualize results in a single notebook usually opened in the browser [29]. In addition, containers and virtual machines allow to avoid installability issues and run instantly the code across various operating systems and environments. Examples of containers include Docker [30], Vagrant [30,31], and Singularity [32]. A recent case study proposed an example of documentation and tutorials allowing to easily reproduce results using Jupyter/IPython notebook or a Docker container [29]. Other techniques and methodologies allowing computational reproducibility are discussed elsewhere [33].

Despite these challenges, consistently usable and accessible software provides a necessary foundation for rigorous and reproducible data-intensive biomedical research [24,26]. In addition, the usability—or, ‘user-friendliness’—of software tools is important, and it can affect its scientific utility. Currently, an estimated 74% of computational software resources are accessible through URLs published in the original paper [34]. Many developed tools are difficult to install, and some are impossible to install [34]. Kumar and Dudley [35] warn that poorly maintained or implemented tools will hinder progress in ‘big data’-driven fields.

Many journals now require that the omics data generated by the published study should be shared when the paper is released, which is an important step forward toward improving computational reproducibility in our field. However, the bioinformatics community still lacks the comprehensive policies on precisely how openly shared code used to perform the analysis and generate the figures should be made available. In a promising effort from eLife journal, editors suggest that the R code used to generate figures should be shared together with the figure [36].

Discussion

High-throughput technologies have changed the landscape of training, research, and education in biomedical fields [37]. Big data generated by those technologies across large-scale clinical cohorts can potentially enable a researcher to connect complex diseases to relevant omics features. As our knowledge of scientifically validated disease-trait matches increase, new opportunities emerge for the development of novel diagnostic and therapeutic tools. However, the analysis of big data requires the use of sophisticated bioinformatics algorithms, which are often packaged as command-line driven software tools. A researcher who wishes to use such tools must acquire specific computational skill sets, which are not included in the traditional life science curriculum at major universities. With the increasing size and complexity of big omics datasets in the biological and medical sciences, researchers are facing a growing dilemma of devoting the time to acquire key computational skills or outsource the analyses to computational researchers.

At present, biomedical researchers are not involved in computational training on a large-scale worldwide. In this review, I provide evidence that the computational training model, that is, when life science research groups receive training and resources to analyze the data that they generate, is a more sustainable approach. The computational training model of life science researchers, when successfully applied across many research institutions worldwide, has the potential to change the landscape of contemporary biomedical research, training, and education. If a critical mass of biomedical researchers obtains computational skills sufficient for analyzing big data, computational training will more likely become an integral part of analysis curricula at these institutions.

The computational training model offers benefits for both individual researchers and the scientific community. Life science and biomedical researchers gain a competitive skill when learning to conduct analysis in a command-line setting. Today's omics data generates file sizes too large to be opened on a personal computer. These novice computational researchers often must perform their analyses on a high-performance cluster with command-line tools, and in the process, the researchers become familiar with programming and basic system administration tasks. Such valuable skills could be leveraged to further the researcher's projects and career.

One important outcome of a comprehensively implemented computational training model is to improve the reproducibility in the big data-driven fields of life science and medical research. The standard for rigorous and reproducible analysis is an emerging topic with multiple initiatives across research groups. The scientific community has identified current challenges to ensure the reproducibility of interpreting and integrating big data analysis in the life sciences [26]. For example, eLife raised the bar of reproducibility, challenging the traditional static representation of data and results of the analysis (usually in the form of PDF or HTML formats). Instead, eLife now suggests a code-based publication, which enables data and analysis to be fully reproducible by the reader [38,39].

Summary

  • Recent advances in omics technologies have led to the broad applicability of computational techniques across various domains of life science and medical research. These technologies provide an unprecedented opportunity to collect the omics data from hundreds of thousands of individuals and to study gene–disease association without the aid of prior assumptions about the trait biology.

  • Interpreting and integrating big data produced by omics technologies require advanced computational algorithms. Despite the increasing size and complexity of datasets in the biological and medical sciences, many biomedical researchers today lack sufficient computational skills to analyze the large-scale data they generate.

  • The computational training model of life science researchers, when successfully applied across many research institutions worldwide, has the potential to change the landscape of contemporary biomedical research, training, and education. If a critical mass of life science researchers obtains computational skills sufficient for analyzing big data, computational training will more likely become an integral part of biomedical education.

  • One important outcome of a comprehensively implemented computational training model is to improve reproducibility in the big data-driven fields of life science and medical research.

Abbreviations

     
  • GUI

    graphical user interface

  •  
  • HPC

    high-performance computing

  •  
  • URL

    uniform resource locator

Funding

S.M. acknowledges support from a QCB Collaboratory Postdoctoral Fellowship, and the QCB Collaboratory community directed by Matteo Pellegrini.

Acknowledgments

We thank the anonymous reviewers for comments on an initial draft of this manuscript, which resulted in an improved publication. We thank Dr. Lana Martin for discussions and helpful comments on the manuscript.

Competing Interests

The Author declares that there are no competing interests associated with this manuscript.

References

References
1
Mattmann
,
C.A.
(
2013
)
Computing: a vision for data science
.
Nature
493
,
473
475
2
Marx
,
V.
(
2013
)
The big challenges of big data
.
Nature
498
,
255
260
3
Mehta
,
N.
and
Pandit
,
A.
(
2018
)
Concurrence of big data analytics and healthcare: a systematic review
.
Int. J. Med. Inform.
114
,
57
65
4
Mauro
,
A.D.
,
De Mauro
,
A.
,
Greco
,
M.
and
Grimaldi
,
M.
(
2016
)
A formal definition of Big Data based on its essential features
.
Library Rev.
65
,
122
135
5
Cancer Genome Atlas Research Network
,
Weinstein
,
J.N.
,
Collisson
,
E.A.
,
Mills
,
G.B.
,
Shaw
,
K.R.M.
,
Ozenberger
,
B.A.
,
Ellrott
,
K.
et al.  (
2013
)
The cancer genome atlas pan-cancer analysis project
.
Nat. Genet.
45
,
1113
1120
6
GTEx Consortium and Collaborators
(
2017
)
Genetic effects on gene expression across human tissues
.
Nature
550
,
204
213
.
7
Vivian
,
J.
,
Rao
,
A.A.
,
Nothaft
,
F.A.
,
Ketchum
,
C.
,
Armstrong
,
J.
,
Novak
,
A.
et al.  (
2017
)
Toil enables reproducible, open source, big biomedical data analyses
.
Nat. Biotechnol.
35
,
314
316
8
Siva
,
N.
(
2015
)
UK gears up to decode 100,000 genomes from NHS patients
.
Lancet
385
,
103
104
9
Collado-Torres
,
L.
,
Nellore
,
A.
,
Kammers
,
K.
,
Ellis
,
S.E.
,
Taub
,
M.A.
,
Hansen
,
K.D.
et al.  (
2017
)
Reproducible RNA-seq analysis using recount2
.
Nat. Biotechnol.
35
,
319
321
10
Thorsson
,
V.
,
Gibbs
,
D.L.
,
Brown
,
S.D.
,
Wolf
,
D.
,
Bortone
,
D.S.
,
Ou Yang
,
T.-H.
et al.  (
2018
)
The immune landscape of cancer
.
Immunity
48
,
812
830.e14
11
Park
,
Y.
and
Greene
,
C.S.
(
2018
)
A parasite's perspective on data sharing
.
Gigascience
7
,
giy129
PMID:
[PubMed]
12
Perera-Bel
,
J.
,
Leha
,
A.
and
Beißbarth
,
T.
(
2019
)
Bioinformatic methods and resources for biomarker discovery, validation, development, and integration
.
Predictive Biomarkers Oncol.
149
164
13
Mangul
,
S.
,
Martin
,
L.S.
,
Hoffmann
,
A.
,
Pellegrini
,
M.
and
Eskin
,
E.
(
2017
)
Addressing the digital divide in contemporary biology: lessons from teaching UNIX
.
Trends Biotechnol.
35
,
901
903
14
Wren
,
J.D.
(
2016
)
Bioinformatics programs are 31-fold over-represented among the highest impact scientific papers of the past two decades
.
Bioinformatics
32
,
2686
2691
15
Altschul
,
S.
,
Demchak
,
B.
,
Durbin
,
R.
,
Gentleman
,
R.
,
Krzywinski
,
M.
,
Li
,
H.
et al.  (
2013
)
The anatomy of successful computational biology software
.
Nat. Biotechnol.
31
,
894
897
16
Via
,
A.
,
Blicher
,
T.
,
Bongcam-Rudloff
,
E.
,
Brazas
,
M.D.
,
Brooksbank
,
C.
,
Budd
,
A.
et al.  (
2013
)
Best practices in bioinformatics training for life scientists
.
Brief. Bioinform.
14
,
528
537
17
Schneider
,
M.V.
and
Jimenez
,
R.C.
(
2019
)
Bioinformatics: scalability, capabilities and training in the data-driven era
.
Brief. Bioinform.
20
,
735
736
18
Mariano
,
D.
,
Martins
,
P.
,
Helene Santos
,
L.
and
de Melo-Minardi
,
R.C.
(
2019
)
Introducing programming skills for life science students
.
Biochem. Mol. Biol. Educ.
47
,
288
295
19
Guerfali
,
F.Z.
,
Laouini
,
D.
,
Boudabous
,
A.
and
Tekaia
,
F.
(
2019
)
Designing and running an advanced Bioinformatics and genome analyses course in Tunisia
.
PLoS Comput. Biol.
15
,
e1006373
20
Dudley
,
J.T.
,
Pouliot
,
Y.
,
Chen
,
R.
,
Morgan
,
A.A.
and
Butte
,
A.J.
(
2010
)
Translational bioinformatics in the cloud: an affordable alternative
.
Genome Med.
2
,
51
21
Cornillon
,
P.-A.
,
Guyader
,
A.
,
Husson
,
F.
,
Jegou
,
N.
,
Josse
,
J.
,
Kloareg
,
M.
et al.  (
2012
)
R for Statistics
,
CRC Press
22
Love
,
M.I.
,
Huber
,
W.
and
Anders
,
S.
(
2014
)
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
.
Genome Biol.
15
,
550
23
List
,
M.
,
Ebert
,
P.
and
Albrecht
,
F.
(
2017
)
Ten simple rules for developing usable software in computational biology
.
PLoS Comput. Biol.
13
,
e1005265
24
Beaulieu-Jones
,
B.K.
and
Greene
,
C.S.
(
2017
)
Reproducibility of computational workflows is automated using continuous analysis
.
Nat. Biotechnol.
35
,
342
346
25
Murphy
,
F.
(
2018
)
Open access, open data, FAIR Data and their implications for life sciences researchers
.
Emerging Top. Life Sci.
2
,
759
762
26
Stodden
,
V.
,
Seiler
,
J.
and
Ma
,
Z.
(
2018
)
An empirical analysis of journal policy effectiveness for computational reproducibility
.
Proc. Natl Acad. Sci. U.S.A.
115
,
2584
2589
27
Kenall
,
A.
,
Edmunds
,
S.
,
Goodman
,
L.
,
Bal
,
L.
,
Flintoft
,
L.
,
Shanahan
,
D.R.
et al.  (
2015
)
Better reporting for better research: a checklist for reproducibility
.
Genome Biol.
16
,
141
28
Project Jupyter
. https://www.jupyter.org
(accessed 27 May 2019)
29
Kim
,
Y.-M.
,
Poline
,
J.-B.
and
Dumas
,
G.
(
2018
)
Experimenting with reproducibility: a case study of robustness in bioinformatics
.
Gigascience
7
,
giy077
PMID:
[PubMed]
30
Enterprise Application Container Platform|Docker. Docker.
https://www.docker.com/
(accessed 27 May 2019)
31
Introduction — Vagrant by HashiCorp. Vagrant by HashiCorp.
https://www.vagrantup.com/intro/index.html
(accessed 27 May 2019)
32
Singularity | Singularity.
https://singularity.lbl.gov/
(accessed 27 May 2019)
33
Piccolo
,
S.R.
and
Frampton
,
M.B.
(
2016
)
Tools and techniques for computational reproducibility
.
GigaScience
5
,
30
34
Mangul
,
S.
,
Martin
,
L.S.
,
Eskin
,
E.
and
Blekhman
,
R.
(
2019
)
Improving the usability and archival stability of bioinformatics software
.
Genome Biol.
20
,
47
35
Kumar
,
S.
and
Dudley
,
J.
(
2007
)
Bioinformatics software for biologists in the genomics era
.
Bioinformatics
23
,
1713
1717
36
Frank
,
M.
and
Hartgerink
,
C.
(
2017
)
RMarkdown for writing reproducible scientific papers
. https://libscie.github.io/rmarkdown-workshop/handout.html
(accessed 15 June 2019)
37
Hayden
,
E.C.
(
2015
)
Genome researchers raise alarm over big data
.
Nature
,
38
Perkel
,
J.M.
(
2019
)
Pioneering ‘live-code’ article allows scientists to play with each other's results
.
Nature
567
,
17
18
PMID:
[PubMed]
39
Introducing eLife’s first computationally reproducible article
. eLife (
2019
).