Due to genetic heterogeneity across patients, the identification of effective disease signatures and therapeutic targets is challenging. Addressing this challenge, we have previously developed a network-based approach, which integrates heterogeneous sources of biological information to identify disease specific core-regulatory networks. In particular, our workflow uses a multi-objective optimization function to calculate a ranking score for network components (e.g. feedback/feedforward loops) based on network properties, biomedical and high-throughput expression data. High ranked network components are merged to identify the core-regulatory network(s) that is then subjected to dynamical analysis using stimulus–response and in silico perturbation experiments for the identification of disease gene signatures and therapeutic targets. In a case study, we implemented our workflow to identify bladder and breast cancer specific core-regulatory networks underlying epithelial–mesenchymal transition from the E2F1 molecular interaction map.
In this study, we review our workflow and described how it has developed over time to understand the mechanisms underlying disease progression and prediction of signatures for clinical decision making.
Technological advances generate a remarkable variety of data to characterize the components of molecular and cellular systems. When we started our work in cancer systems biology 20 years ago, we used these data to describe small networks, in particular signaling pathways. The argument has been that these small networks are involved in cellular functions (e.g. apoptosis) that are in turn relevant to higher level process (e.g. metastasis). Over the years, the variety of data becoming available grow exponentially, making it increasingly difficult to rely only on small models built from ordinary differential equations (ODEs). Twenty years on, the boundary between bioinformatics and systems biology approaches is blurred, and for most projects we use a variety of methodologies combining statistical analyses, graph theory, dynamical systems theory, Boolean modeling, sequence analyses and database searches.
As we progress with our understanding of the molecular mechanisms underlying diseases like cancer, our approaches are becoming more complex. Given the complexity of cancer, we are still in the business of ‘intelligent guessing’ and it would be irresponsible to see another major breakthrough right around the corner. In our work, a noticeable trend is the development of ‘workflows’ that combine various approaches and integrate a variety of data. The iterative cycle of data-driven modeling and model-driven experimentation, which characterized systems biology 20 years ago, is still there but embedded in a workflow, designed and implemented by data scientists that handle tools from machine learning to systems theory.
For the present text, we review our efforts toward mechanistic understanding and the development of workflows that combine systems biology and bioinformatics approaches to process data, including (i) tools to identify expression patterns for normal and diseased states; (ii) graph theoretical analyses of biochemical networks to identify important hub nodes and small recurring regulatory motifs; and (iii) dynamical systems theory to simulate the dynamics of core-regulatory networks. Individually these tools and methods cannot capture the entire biological complexity associated with regulatory processes involved in diseases.
For example, the expression patterns identified by data analysis tools cannot explain the underlying mechanisms and causes, the utility of graph theory is confined only to static analysis of the system, and also each of the modeling formalism offered by dynamical systems theory has its own limitations. Integration of multiple systems biology and bioinformatics tools/methods, here named as ‘integrative workflows’, are a promising tool to analyze large-scale and complex biochemical networks to provide insights into complex diseases. The integrative methods in systems biology are scarce but highly demanded [1,2].
In this context, we describe an integrative workflow that we have previously developed and published  by combining heterogeneous sources of biological information to analyze large-scale biochemical networks to reveal mechanisms underlying disease progression (see Figure 1). The workflow combines network structural analysis with high-throughput and biomedical data to identify tumor-type specific ‘core-regulatory networks’ that are amenable for analysis with dynamical systems theory. Using suitable modeling formalism, the dynamics of molecular interactions in core networks can be analyzed to detect disease specific molecular signatures (i.e. sets of network derived diagnostic/prognostic biomarkers) and therapeutic targets. The proposed molecular signatures and drug targets are validated by clinical data that are further subjected to in vitro experiments. The integrative workflow cannot provide an exact representation of cellular events but it can guide the formulation and falsification of hypotheses and suggest validation experiments.
Network-based integrative workflow
Here, we review our own workflow in detail that was applied to a large-scale E2F1 molecular interaction map, created over several years and which was then used to identify core-regulatory networks followed by distinctive molecular signatures for epithelial–mesenchymal transition (EMT) regulation and potential therapeutic targets in invasive bladder and breast cancer .
Integrative workflow for the identification of disease signatures and therapeutic targets
In order to exploit the advantages of large-scale biochemical networks, high-throughput data, in combination with mechanistic modeling, we require integrative approaches and computational workflows to identify disease specific small regulatory/function modules that can be subjected to a more detailed analysis, followed by the prediction of molecular signatures and therapeutic targets. The integrative workflow can be realized in the following steps:
Network construction, curation and annotation
Cellular functionality is realized due to a large variety and number of components interacting in various ways. To elucidate the structure and functional roles of interacting components in a given cellular (mal) function, it is essential to initially organize data and knowledge about them in a network (i.e. vertices connected to edges). Biochemical networks are the formalized representation of large number of experimental studies in a diagrammatic format, which are computer readable and can be analyzed using computational tools. These networks serve as a knowledge base and provide the foundation for computational analyses, providing hints about the causal interactions among components that realize certain functionality. Such networks contain physical associations underlying protein–protein, transcription factor-genes or signaling and metabolic pathways.
Biochemical network construction is a painstaking exercise of manual validation and encoding. Various approaches are used to map out molecular interactions underlying certain biological processes. A number of computational techniques have been developed to infer biochemical networks from high-throughput data [4,5]. Despite their importance, they face many challenges, for example false positive and false negative interactions largely influence the results of network inference. More detailed and highly focused network, centered around particular disease or cellular process, can be constructed by expert domain knowledge (functional and structural information), diligent manual search of published literature and publically available databases, e.g. STRING , Reactome , IntAct , and HPRD . To avoid the laborious manual curation for network construction, some methods are developed to automatically reconstruct networks by retrieving interactions or sub-networks from existing maps and models [10,11]. Combining automatic reconstruction with domain knowledge, manual search of literature and databases would provide a reasonable strategy to construct detailed and fully annotated large-scale biochemical networks. Depending on the context, one can also add various regulatory layers, for example, transcriptional regulation (from TRANSFAC , TRRUST  and HTRIdb ) and post-transcriptional regulation by miRNA and lncRNA (from miRTarBase , miRWalk , TriplexRNA , lncRNADisease  and lnc2Catlas ).
The standard representation of these maps in the form of Systems Biology Graphical Notation (SBGN) allows for an easy re-usability and exchange. A large number of pathway tools provide SBGN compatibility (see http://sbgn.github.io/sbgn/software_support/). The CellDesigner , SBGN-ED  and VANTED  are the widely used tools to draw them in SBGN format.
Network analysis and motif detection
After gathering and managing information regarding biological systems in the form of interactions networks, structural and dynamical analyses provide useful information about the network architecture and dynamics. The structural analysis of networks allows the identification of functional modules, regulatory motifs (including feedback and feedforward loops) and associated node properties (including node degree (ND) and betweenness centrality (BC)). Node degree is the number of edges (or links) connected to a node. For example, in Figure 2 node ‘B’ has six edges, thus its degree is six (i.e. ND (B) = 6).
Network structural properties
Similarly, BC identify gate keeper nodes in the communication between different parts in a network. More specifically, BC of a node is the number of shortest paths from all nodes to all others that pass through that node. For example, in Figure 2 node ‘A’ connects different parts of a network, so it gets the highest BC value . In a network G, a node υ that occurs in many of the shortest paths has highest betweenness centrality. The betweenness centrality BC() of a node is computed as follow:
where σij(υ) denotes the total number of shortest paths between i and j that pass through node and σij denotes the total number of shortest paths between i and j.
Further, it is established that biological networks can be described in terms of modules. A module is the aggregation of densely interconnected neighboring nodes as shown by the green box in Figure 2. It is observed that functionally akin nodes are located in close proximity and, thus, form functional modules [24,25]. These functionally related genes can be associated with the same biological pathway and can have similar effects on certain disease phenotypes and may be targeted by structurally similar drugs [26,27]. Many methods are developed to identify novel disease molecules based on this ‘guilt by association’ approach from the biological network. For example, gene prioritization (GP) is a method to identify important genes or proteins in a network based on the network structure and a set of known markers. Instead of local neighborhood, it uses a global diffusion method to identify genes based on random walk from known genes. The genes that are most frequently traversed by random walks from known genes, are assumed to be important that may be associated with a disease or a phenotype, for more detail on gene prioritization see .
Further, biological networks are enriched in recurring structural patterns called network motifs . The feedback (in Figure 2 loop form by nodes B, E and F) and feedforward (in Figure 2 loop form by nodes B, C and D) loops are important network motifs. In Figure 2 loop form by nodes B, E and F is a feedback loop and feedforward loop is formed by nodes A, C and D. These network motifs induce non-intuitive behavior and play a crucial role in system dynamics [28,29].
Network structure analysis revealed that the functioning and regulation of a network are governed by a certain set of organizing principles . To understand the mechanisms of these organizing principles, mechanistic dynamical models are used to analyze the systems. The dynamical properties of a network characterize the temporal behavior of a network under certain conditions, which can help to explain the nature of regulation of interacting components in response to stimuli/perturbations that affect the functionality of cells, and ultimately their consequences on the cellular phenotype.
Disease specific core-regulatory network prediction
We have developed a flexible and extendable integrated workflow (Figure 3) that combines network structural analysis with high-throughput data and biomedical data to identify smaller ‘module(s)’ for specific disease phenotypes, we hereafter refer this as ‘core-regulatory network’ that is amenable for analysis with dynamical systems theory [31–33].
Integrative workflow for identification of disease specific core-regulatory network
The workflow prioritizes network recurring structures (i.e. motifs including feedback/feedforward loops) using the network structural properties and high-throughput data, more specifically, it includes network structural parameters for each node (Nn) including (i) node degree (ND), (ii) betweenness centrality (BC), (iii) gene prioritization (GP) and disease specific parameters, such as (iv) relatedness to disease pathway (DP) and (v) high-throughput fold-change (FC) expression data.
The relatedness to disease pathway determines the association of motifs (Mk) with a certain disease pathway as follows:
where n is the number of nodes in a motif present in a disease pathway, and its value is (0 ≤ n ≤ length of a motif). k = 1,…m, for motifs M.
Based on these parameters, we created four prioritization sub-functions (fh) (fND+BC; fGP; fDP; fFC) that are used in multi-objective function in eqn (3) to calculate the ranking score () of a motif:
where h = 1…4 number of sub-functions. w is the user-defined weighting factor for l number of scenarios that are used to optimize the function (F). One can iteratively modify the values of the weighting factors to optimize the function (F) for one or multiple sub-functions. In this case, no single solution exists that simultaneously optimizes the function (F). Thus, one can have number of Pareto optimum solutions. A solution is called Pareto optimal, if none of the objective functions can be improved in value without degrading some of the other objective values. For example, to optimize (maximize) F for each objective function f1 and f2:
where the values of w1 and w2 can be iteratively modified to maximize the function (F) for each set of weighting factor values. The Pareto set is obtained by merging all the non-identical solutions.
In the presented case study, in order to approximate the Pareto set of all non-dominated functions (), we used 13 different weighting scenarios (where one or two sub-functions are highly weighted, i.e., assign high importance for ranking). From the Pareto set of motifs P(Mkl), top ranked motifs (e.g. top 10 highest score motifs) are selected. The core-regulatory network is obtained by taking union of the top ranked motifs. The core-regulatory networks are amenable for dynamical analysis to identify disease specific molecular signatures and therapeutic targets, which are subjected to experimental validation.
In one case study, we used the E2F1 molecular interaction map  to identify tumor specific core-regulatory networks for EMT regulation in bladder and breast cancer. We identified a large set of feedback loops (n = 444; 213 positive, 228 negative and 3 neutral feedback loops) from the map using Cytoscape plugin NetDS  (v3.0). We considered only three-node feedback loops due to the fact that larger sized network loops are typically composed of one or more smaller sized loops [29,46]. For the disease pathway association, we used KEGG’s cancer disease pathway (KEGG: hsa05200) and based on each node association with disease pathway, the DP score for every motif was calculated. We integrated gene expression profiles of two cancer cell lines for E2F1-driven highly aggressive tumors in bladder and breast. We used the ArrayExpress database (ArrayExpress accession number: E-MTAB-2706)  to find suitable gene expression data in non-invasive and invasive bladder and breast cancer cell lines. In particular, we used RT-4 as non-invasive and UM-UC-3 as invasive cell lines in bladder cancer, while MCF-7 as non-invasive and MDA-MB231 as invasive cell lines in breast cancer. For the multi-objective optimization function we used 13 different weighting scenarios and selected top 10 ranked motif in each iteration for the identification of core-regulatory network in bladder and breast cancer (Figure 4) (for more detail, please see the original publication ).
Core-regulatory networks that drive tumor invasiveness in (A) bladder and (B) breast cancer
Dynamical analysis of the core-regulatory network
Mathematical models are a well-established tool to elucidate the counter-intuitive non-linear dynamical features of complex biochemical networks in living cells. Models are an abstract representation of reality, and thereby their validity and usefulness depend strongly on the context and the assumptions being made. Modeling is the art of making appropriate assumptions. A variety of modeling formalisms have been proposed [35,36] and applied depending on the availability of data, the type of research question as well as the size and the structure of the system. Two commonly used modeling formalisms are: (i) detailed quantitative model of (small) functional modules using, for instance, ODEs [32–34]; and (ii) coarse grained qualitative model of (large) sub-cellular processes that can be encoded through logic-based formalisms [37,38]. If a network is large and contains important non-linear network motifs whose proper characterization is crucial for biological problem under investigation, then a hybrid modeling strategy, which combines ODEs and logic-based models, seems appropriate .
The case study presented here, we used a logic-based model to analyze the stimulus–response behavior of regulatory networks in tumor invasion in bladder and breast cancers. The simplest logic-based model is the Boolean model, popularized by Kauffmann  where the nodes (X1,…n) of a network can be in one of two states: (i) ‘on’ state (also denoted by 1 or ‘true’ state), which represents the ‘active’ or ‘expressed’ state of the component; and (ii) ‘off’ state (also denoted by 0 and also referred to as the ‘false’ state), which represents the ‘inactive’ or ‘not expressed’ state of the component. The state of each node at t + 1 is determined by a Boolean function (BF) that links the incoming effect from its regulators:
Boolean functions determine the states of the node using Boolean gates (NOT, OR and AND). The Boolean gate NOT is assigned based on the network structure while the rules for OR and AND gates can be determined by training/calibrating the model with qualitative data , see Figure 5. Numerous simulation tools are available to simulate logic-based models, for example, the CellCollective , CellNetAnalyzer , CellNOpt , GINSim , BoolNet , BooleanNet , SimBoolNet , SQUAD  and ADAM .
Boolean functions trained with expression data
For the case study, we derived Boolean functions for signals originating from the input layer and their propagation through nodes constituting the regulatory layer of the disease specific core-regulatory networks using Boolean gates (for details see Khan et al. ).
We determined the steady states of each variable in the model for different initial values of the input nodes using CellNetAnalyzer . We consider two sets of scenarios characterized by (i) high expression and (ii) low expression of all the input nodes (E2F1 and receptor proteins present in the core-regulatory network). We obtained 64 input vectors for bladder cancer and 128 for breast cancer. Next, we simulated the network to determine the impact of the input vectors on the level of EMT (see Table 1). Our simulation results suggest that when E2F1, TGFBR1 and FGFR1 are simultaneously active, bladder cancer cells become highly invasive (EMT = 3). A similar effect was observed in breast cancer when E2F1, TGFBR2 and EGFR are simultaneously active.
Active state of the molecule is represented by ‘1’ and the inactive state by ‘0’. The phenotype output (EMT) can take four ordinal levels ranging from ‘0’ (non-invasive) to ‘3’ (highly invasive). Table (a) is the summary of 64 in silico simulations of bladder cancer. Each row represents the results of eight simulations where for the given Boolean state of E2F1, TGFBR1 and FGFR1, all eight combinations of EGFR, CXCR1 and RARA result in the same phenotypical output. Table (b) is the summary of 128 in silico simulations of breast cancer. Each row represents the results of 16 simulations where for the given Boolean state of E2F1, TGFBR2 and EGFR, all 16 combinations of HMMR, THRB, IL1R1 and RARA result in the same phenotypical output.
Furthermore, we carried out in silico perturbation experiments to identify important nodes that can be exploited for therapeutic interventions. Perturbation experiments were performed for highly invasive phenotype (EMT = 3) by changing the Boolean state of each node in the regulatory layer to reduce invasiveness (Table 2). We used single and double perturbation iteratively and observed the greatest reduction of EMT in the latter case. Our perturbation results suggest that in bladder cancer (i) double knockout of ZEB1 in combination with either SNAI1, TWIST1 or NFKB1; (ii) knockout of ZEB1 and activation of CDH1; or (iii) knockout of SMAD2/3/4 in combination with TWIST1 or NFKB1 reduces EMT to 1. In case of breast cancer, double perturbation by silencing SRC, FN1, SNAI1 and SNAI2 or activation of CDH1 in any of the combinations reduces EMT to 1. All the model-based predictions were experimentally validated in non-invasive and invasive tumor cell lines (for more detail, please see the original publication ).
The active state of the molecule is represented by ‘1’ and the inactive state by ‘0’, while the highest EMT phenotype is represented by ‘3’ and low phenotype by ‘1’. (a) and (b) are the perturbation simulations of bladder and breast cancer. The first row of each table represents the molecular signature (i.e. in case of table (a) high expression of E2F1-TGFBR1/2-FGFR1) for highest EMT level. The brown boxes represent the perturbed (knockin = 1 or knockout = 0) state of genes, and their effect on EMT phenotype is shown in the last column.
Patient stratification and validations
Based on the molecular signatures identified through model simulations, we successfully stratified bladder and breast cancer patients into early and advance stages of disease. For a bladder cancer patient cohort, we used data from the Lee Bladder study (GEO ID: GSE13507)  and for a breast cancer we used data from the TRANSBIG network (GEO id: GSE7390) generated by Desmedt et al. . We grouped the patients into high- and low-expression profiles of model predicted signatures, i.e. E2F1-TGFBR1-FGFR1 in the case of bladder cancer and E2F1-TGFBR2-EGFR in the case of breast cancer cohort. The progression-free survival probability of each subgroup reveals that patients with high expression of these signatures had low survival probability in comparison with patient subgroup with low expression in respective cancer types (Figure 6A,B).
Kaplan–Meier plots of progression-free survival of patients with bladder & breast cancer
We also validated molecular signatures in large bladder cancer (BLCA; n=426) and breast cancer (BRCA; n=1218) cohorts available at TCGA data repositories accessible through UCSC Xena http://xena.ucsc.edu. The simulation based-signatures were able to distribute patients into early versus advanced stages in bladder and breast cancer (P-value < 0.005) (Figure 6C,D).
The approach and methodology for the identification of disease gene signature and therapeutic targets have changed drastically over the past decade. On the one hand, high-throughput technological advances resulted in heterogeneous data to understand the mechanisms of complex, multi-factorial diseases. On the other hand, integration, interpretation and understanding of these heterogeneous datasets are becoming more and more challenging. With the advanced network and database interaction tools (e.g. Cytoscape Bisogenet), it is easy to populate disease networks with multiple regulatory layers and large number of interactions. However, methods and workflows to identify key disease gene signatures and therapeutic targets from large disease networks are still in the development stage. In this study, we reviewed our workflow that combines parameters from network structure analysis, disease-related biomedical information and high-throughput expression data to identify disease specific core-regulatory network(s) amenable for dynamical analyses. The major bottleneck to our workflow is the preparation of a directed regulatory network. This requires huge manual efforts as many state-of-art interaction databases provide false positive/negative interactions complied by text-mining algorithms from published literature. We believe that new methods will be available soon for defining regulatory directions to interacting bio-molecules. This may be possible by extracting regulatory information from databases (e.g. BioModels database) and/or by mapping time series expression data of interacting bio-molecules. Once a directed map is prepared, other steps of our workflow can be easily implemented to determine the core-regulatory network driving the disease phenotype. Using a case study on tumor metastasis driven by E2F1, we presented and discussed various steps of our workflow that can be implemented for any complex disease/clinical phenotype under investigation.
With the advancement in high-throughput technology, heterogeneous data are available to construct disease networks composed of large number of interactions.
We reviewed our workflow that combines parameters from network structure analysis, disease-related biomedical information and high-throughput expression data to identify disease specific core-regulatory network(s).
With a case study, we demonstrated logic-based dynamical analyses (stimulus–response and perturbations) of core-regulatory networks for predicting disease gene signatures and therapeutic targets.
We are very thankful to Prof. Brigitte M. Pützer group for their huge efforts in the E2F1 map construction and the experimental validations of in silico predictions.
This work was supported by the German Federal Ministry of Education and Research (BMBF) as part of the project eBio:SysMet ; eBio:MelEVIR [031L0073B]; and the University of Rostock, Rostock Germany.
The authors declare that there are no competing interests associated with the manuscript.