High-throughput, genome-wide analytical technologies are now commonly used in all fields of medical research. The most commonly applied of these technologies, gene expression microarrays, have been shown to be both accurate and precise when properly implemented. For over a decade, microarrays have provided novel insight into many complex human diseases. Microarray-based discovery can be classified into three components, biomarker detection, disease (sub)classification and identification of causal mechanism, in order of accomplishment. Within the respiratory system, the application of microarrays has achieved significant success in all components, particularly with respect to lung cancer. Numerous studies over the last half-decade have applied this technology to the characterization of non-malignant respiratory diseases, animal models of respiratory disease and normal developmental processes. Studies of obstructive lung diseases by many groups, including our own, have yielded not only disease biomarkers, but also some novel putative pathogenic mechanisms. We have successfully used an integrative genomics approach, combining microarray analysis with human genetics, to identify susceptibility genes for COPD (chronic obstructive pulmonary disease). Interestingly, we find that the assessment of quantitative phenotypic variables enhances gene discovery. Our studies contribute to the identification of obstructive lung disease biomarkers, provide data associated with disease phenotypes and support the use of an integrated approach to move beyond marker identification to mechanism discovery.
The completion of genome sequencing of specific organisms, combined with technological advances in the capability to detect changes in genome-wide expression at both the mRNA and protein levels, has heralded a new era in our quest for understanding the complex pathways and mechanisms in living organisms. In the last decade, there has been significant advancement in microarray technology with the development of many different platforms which are being used for analysing gene expression, genotyping and other applications [1,2] (Figure 1). According to the central dogma of molecular biology, genomic DNA is first transcribed into mRNA, which thereafter is translated into protein. Proteins play critical roles in most intra- and extra-cellular activities, including enzymatic, regulatory and structural functions. However, relative difficulties of expression measurement capabilities at the protein level and availability of technologies of high-throughput methods (expression microarrays) for detection of individual mRNA have led to the wide use of microarrays to simultaneously measure the sum of all mRNA expression in a sample, also called the transcriptome . Like most classical methods for analysis of gene expression at the mRNA level, the basic principle of microarray technology is complementary hybridization of nucleotides as explained by the Watson–Crick double helical model of DNA. Microarrays measure transcriptomic modifications that, either at the single gene level or collectively in multiple genes, lead to changes in protein expression. In fact, it is unusual for changes in the level of a specific mRNA to not be accompanied by changes in the protein level for that gene. Furthermore, although not all changes in protein expression and function levels are captured at the steady-state level of that particular mRNA, their downstream effects are captured by the full transcriptome.
Overview of the utility of gene expression microarray technology in lung disease biomarker and therapeutic target discovery
In the early part of the decade, when the use of gene expression microarrays was growing exponentially (see Figure 2), there was considerable concern and debate regarding the precision of the technology. For instance, Tan et al.  reported divergence in microarray-based gene expression measurements. Similar observations, and the experiences of many investigators, resulted in concerns being raised regarding the potential utility of microarrays. This was met by an independent and thorough evaluation of the technology by the MAQC (MicroArray Quality Control)  project. The MAQC project was developed by the FDA (Food and Drug Administration), the EPA (Environmental Protection Agency) and the NIST (National Institute of Standardization and Technology), in association with commercial stakeholders and academic laboratories. The purpose was to evaluate the accuracy of microarray technology, to provide quality-control tools and to develop guidelines for microarray data analysis. The MAQC project involved quantification of gene expression levels using seven microarray platforms tested at three independent sites, with five replicates at each location. Although each microarray platform studied had different performance characteristics, they generated comparable results with up to 95% concordance with regard to defining differential expression. We have recently completed a study showing similar cross-platform concordance with time-series data (R. Du, K. G. Tantisira, V. J. Carey, S. Bhattacharya, S. Metje, B. J. Klanderman, R. Gaedigk, R. Lazarus, A.T. Kho, T. J. Mariani, J. S. Leeder and S.T. Weiss, unpublished work). These data unequivocally demonstrated extremely high levels of gene expression microarray measurement precision.
Analysis of growth in the application of microarray technology as defined by published research articles (A) or publicly deposited datasets (B)
There is no doubt that the technology has suffered setbacks due to rapid growth. Even though all microarray platforms apply the same basic principle of complementary hybridization, diverse probe designs have led to a multitude of problems in quantitative data acquisition and management. In retrospect, we can conclude that significant limitations in the implementation of microarray technology have led to poor performance. We can categorize these limitations as primarily including deficiencies in: (i) experimental design/study size, (ii) analytical methods and (iii) probe sequences.
Possibly the most important, and most overlooked, aspect of successful expression array analysis is appropriate experimental design including adequately powered sample size. A question often asked by investigators is what number of samples to study. Optimal sample sizes can be determined using modified power calculations . The primary objectives of a microarray experiment are usually class comparison, class discovery or class prediction . For class comparison and class prediction studies, a large number of biological replicates (not technical replicates) are recommended, whereas for class discovery, technical replicates from the same individual provide a better assessment of disease classification. Conversely, for inbred animal models (e.g. mouse), where genetic heterogeneity is limited and exposures are controlled, technical replicates (and pooling of biological replicates) are preferred. Irrespective of the study objectives, the use of improper control samples that are derived from tissues of origin other than that of the treatment samples lead to erroneous predictive classifications due to confounding of the samples .
Although there are established protocols for laboratory procedures leading to the generation of data, there is no clear consensus on a ‘best’ method for data analysis. Early analytical approaches relied heavily on a non-statistical measure of expression differences (fold change)  that was repeatedly shown to lack sensitivity and specificity. For instance, we showed that a statistical approach to accurately define differential expression, using measurement precision in technical replicates, is not directly proportional to fold change . Other statistical methods, such as standard t tests, were recognized as being subject to problems of multiple testing resulting from repeated measures in a limited number of samples. Numerous complex and/or specific mathematical and statistical approaches have been subsequently developed and applied to microarray data (see  and references therein). Like experimental design, the choice of methods for statistical analysis should be chosen based on the structure and distribution of the data and the objective of the study. Fortunately, the MAQC project and user consensus have pointed to a limited number of preferred analytical approaches that are recognized as being robust and effective.
Another long-appreciated source of ‘noise’ or technical variability within expression arrays is inaccuracy of probe sequences, which may not match the transcripts they are intended to measure. This was first widely appreciated due to a ‘manufacturing’ (annotation) error made by the leading commercial source of the technology, which led to a large number of arrays/experiments that inadvertently measured sense transcripts . However, this proved to be no more than a minor setback in the evolution of a powerful and complex technology. More recently, it has been appreciated that sequence databases evolve, resulting in transient (in)accuracy of probes . We have previously shown improved accuracy and cross-platform comparisons when accounting for these inaccuracies .
Having survived these temporary setbacks and limitations, microarrays have developed into standard tools for high-throughput analysis of gene expression, and continue to grow in information quality and in new applications. The evolution of microarray technology has been a gradual process. The technological principles involving combinatorial chemistry have been in development since the late 1960s with the works of R. Bruce Merrifield (Nobel Prize in Chemistry, 1984), with arrays in their current forms first appearing in the early-to- mid-1990s. Both Stephen Fodor and Mark Schena developed the early prototypes of cDNA and oligonucleotide arrays almost simultaneously [1,2]. While Fodor in collaboration with Lubert Stryer of Stanford University received a small business innovative research grant from the NIH (National Institutes of Health) and went on to establish Affymetrix Inc., Schena pioneered the cDNA technology under the guidance of Pat Brown at Stanford University .
By the late 1990s, the power and potential of microarray technology were fully appreciated and it was being applied in an effort to develop novel descriptions of diseased states. Novel genes and pathways, previously not implicated in the pathophysiology of a certain disease, may emerge from microarray studies to provide new theories regarding the disease process and potential therapeutic drug targets. The spread in use of the technology was unprecedented (Figure 2A), with exponential growth in the number of publications reporting results from its application in the early part of this decade. Parallel growth was initially observed in the lung biology and disease research community, predominantly by those focused on lung cancer (50–60% of publications each year), although over the last few years, its use in lung research has apparently lagged. Regardless of the sample studied, expression microarray application to disease can be classified under three broad topics: biomarker discovery or class prediction, disease subclassification or class discovery and uncovering the disease mechanism.
Expression arrays have been widely used to predict the state or ‘class’ of an unknown sample using pre-existing information. One of the most studied human diseases by expression-profiling technology is cancer (Figure 2). Initial microarray studies were heavily focused on identifying gene expression markers for human cancer phenotypes distinguishing them from normal samples. Golub et al.  described the potential of expression array data to predict disease using the distinctions between human AML (acute myeloid leukaemia) and ALL (acute lymphoblastic leukaemia) as their experimental model . Although these diseases are capable of being discriminated by other means (such as cytology), this study served as proof-of-principle that the technology could define expression markers informative for the diseased state. This approach has been widely applied to many diseases or previously recognized disease subtypes including breast cancers, cutaneous malignant melanoma, diffuse large B-cell lymphoma, colon cancer, leukaemia and ovarian carcinomas (see  and references therein). In the pulmonary system, numerous human disease states and animal models of disease have been subjected to expression profiling in an effort to define disease biomarkers and/or class predictors .
Detection of genomic signatures from individual studies has provided a wealth of information, but these data are limited for pathological diagnosis unless validated externally. One major limitation to this goal is the small number of samples in individual studies, particularly for human studies where there is a high degree of both intra- and inter-population variability. This ultimately results in disease biomarkers that are population dependent, rather than having global applicability. This is true with regard to studies of lung diseases, where biomarkers for acute and chronic diseases, including severe asthma and COPD (chronic obstructive pulmonary disease), and environmental exposures have been identified . We have recently presented an analysis of lung tissue gene expression in subjects with COPD . We used a novel combination of discrete and quantitative variable analysis to determine differential expression. Inclusion of quantitative phenotypes helped in providing a set of robust COPD markers, which successfully predicted disease in an independent COPD population . Admittedly, this success is much more of an exception than the rule. However, with an increasing number of microarray datasets being deposited in the public domain [21,22] (Figure 2B), real opportunity exists for more reliable information to be generated through the integration of multiple, independently generated datasets focusing on the same biological paradigm. To facilitate this concept, MIAME (Minimum Information About a Microarray Experiment) was developed, which serves as the standard information required to accompany microarray data to ensure correct interpretation of the data and independent verification of analysis of results .
Many groups have proposed approaches for data integration across platforms and laboratories (for example see [24, 25] and references therein). Meta-analysis approaches have also been applied to validate results from different studies, which, in certain cases, have proved to be successful . With a large number of microarray studies on breast cancer, great promise existed to leverage these data to identify a disease biomarker that could be used as a true diagnostic tool. Indeed, van't Veer et al.  identified a gene expression signature strongly predictive of patients with either poor or good prognosis. The gene-expression biomarker has been used as a predictor of the outcome of disease in young patients with breast cancer in combination with standard prediction tools based on clinical and histological criteria . A 70-gene marker chip has been developed and tested for diagnosis of breast cancer that outperformed all clinical variables in predicting the likelihood of distant metastases within 5 years . This represents the first clinically approved gene expression microarray test for molecular-based therapy.
The identification of cancer subtypes can sometimes be a difficult process, as it relies on the subjective interpretation of both clinical and histopathological observations with the aim of classifying samples in currently accepted subtypes based on the tissue of origin of the tumour. This sometimes is hampered by a lack of clinical information or unclear classification of samples based on histology. Alternatively, some diseases present with similar histopathological features of clearly distinct origin or with variable prognoses. Most early microarray studies involved marker distinction between normal and aberrant (diseased) tissues. However, much enthusiasm has been generated regarding the ability of gene expression microarrays to clarify disease subclasses, and even identify previously unappreciated classes with distinct aetiologies, prognoses and/or therapeutic responses. This potential is particularly relevant for cancer, as was first described by Alizadeh et al.  for B-cell lymphoma. Microarray analysis has led to the identification of molecular classification of several human malignant tumours based on pathological parameters, namely stage, recurrence, prognostic outcome or therapy response  in breast cancer [26,27,30], cervical lymph node metastasis in oral squamous cell cancer , non-small-cell lung cancers and clear cell renal carcinoma . One of the most extensively analysed microarray datasets involves a comparison of leukaemia subtypes , which has also been used for identifying predictors for therapy response . Every new discovery of disease subtype molecular markers creates a path for the development of molecular-targeted therapies.
The prognostic and discovery potential of microarrays to perform disease subclassification are possibly best exemplified as applied to NSCLC (non-small-cell lung carcinoma), the most common form of the most frequent cause of cancer death in the world. Lung carcinomas are a heterogeneous collection of tumours characterized by a large number of chromosomal and structural abnormalities. NSCLCs can be subclassified into adenocarcinomas (the most common), squamous cell carcinomas and large-cell carcinomas . Bhattacharjee et al.  demonstrated the ability to define NSCLC subtypes based on gene expression profiles and first identified ‘molecular’ subclasses of adenocarcinoma. These observations have been replicated in numerous other studies [37,38], with the common limitation of the identification of population-dependent markers (as described above). In the absence of robust markers, meta-analysis has been implemented, with limited success. Potti et al.  identified meta-genes to predict therapy response in patients with early-stage non-small-cell lung cancer. In contrast, Ramaswamy et al.  re-analysed multiple datasets of tumour expression profiles to identify a set of gene expression markers that can be used as a multiclass cancer diagnosis tool.
In addition to lung cancer, disease subtype classification has also been attempted in other lung disorders such as COPD , pulmonary fibrosis  or asthma . For instance, two laboratories have recently described markers for severe drug-resistant asthma in lung epithelial cells  and in peripheral blood cells . In an interesting study, Kho et al.  used the organ-specific development transcriptome as a basis for subclass discovery. Molecular subclassification promises the hope of defining class-specific mechanisms and routes for therapeutic intervention.
Even though gene-based signature sets have been developed, the causal mechanisms are still unclear. Initially, it was hoped that defining expression biomarkers of disease would lead directly to causality. We now appreciate that analytical methods and experimental designs focusing on class prediction are typically not efficient at uncovering disease mechanismss. One means used to apply microarrays to uncover the disease mechanism was intuitive: the use of animal modelling. Although such models have been developed in an effort to determine the effects of candidate genes, they can also be used for gene discovery. Such studies are common throughout the literature including those concerning the lung, particularly for models of lung inflammation and allergic hypersensitivity. Many of these studies suffer from the limitations described above for studies of clinical specimens (namely, small study size, ambiguous analytical methods etc.), but have provided tremendous insights nonetheless. For instance, expression profiling from animal models has provided insights into COPD pathogenesis . Additionally, Novershtern et al.  have recently developed a signature gene set for asthma by integrating information from genome-wide expression and protein studies of animal models. In particular, the integration of multiple genomic approaches, such as integrating animal models and genetics, has been particularly useful in uncovering mechanistic genes/pathways. In terms of respiratory disease, we recognized the study by Karp et al.  of an allergic hypersensitivity model of asthma as an early example of this approach.
Most microarray studies are limited to comparison analysis, aiming to identify genes with a change in expression between two classes. In contrast, DeRisi et al.  published a study of the yeast life cycle that initiated the use of microarrays in time-series studies. These types of experiments have typically been used for biological discovery in ‘normal’ samples, and not for the identification of disease biomarkers. However, another successful approach for defining a causal mechanism has been through the integration of human genetics and genome-wide expression, such as that used by Blackshaw et al. . Here, they used transcriptomic profiles of normal developing eye tissue to identify biologically relevant candidate causal genes within loci linked to human eye diseases. Collectively, results from these and similar studies have provided novel insights into the mechanisms of disease pathogenesis, and raise expectations of developing therapeutic targets from the application of DNA microarrays to complex diseases.
As described above, multiple groups have attempted to identify candidate genes for COPD, a complex human disease probably influenced by a genetic component (α1-antitrypsin deficiency), an environmental component (cigarette smoking) and gene-by-environment interactions [41,52–54]. We recently reported the identification of a COPD susceptibility gene through the integration of human genetics with gene expression profiling of normal lung development and diseased lung tissue. Like Blackshaw et al. , we used transcriptomic profiles of organ development to inform us of biologically relevant candidate genes within a disease-linked locus. We identified SERPINE2 [serpin peptidase inhibitor, clade E (nexin, plasminogen activator inhibitor type 1), member 2], a homologue of the only known COPD susceptibility gene, and went on to show that expression of this gene was aberrant in lung tissue from COPD subjects . Of particular note, SERPINE2 expression was not a robust class prediction marker, but was highly correlated with multiple quantitative measures of lung function in multiple datasets where quantitative phenotypes for COPD were available (Figure 3). Within the disease-linked locus, this was uniquely true of SERPINE2. Multiple groups, including our own, have subsequently demonstrated significant associations between SERPINE2 gene variants and COPD phenotypes [55,56] and aberrant SERPINE2 expression in COPD lung tissue . We have further discovered that deficiency in SERPINE2 in the mouse leads to the development of COPD-related lung histopathology (S. Srisuma and T.J. Mariani, unpublished work). We believe that such an integrative genomics approach helps us to expedite candidate gene identification, and that the use of quantitative variables may be particularly useful to uncover causal disease mechanisms and pathways.
Using quantitative disease phenotypes for gene expression biomarker discovery
Gene expression microarrays have come a long way from being a complex technology (sometimes poorly applied) in bio-medical research and are now a critical component of state-of-the-art research on disease discovery, therapeutic responsiveness and pathogenesis. The tools used for statistical analysis, data mining and archiving have steadily improved, and objective criteria demonstrate a high level of precision for the technology when applied appropriately. Numerous studies in respiratory medicine have provided a wealth of data describing biological paradigms from normal development to lung cancer. Although the primary application of the technology has been to identify biomarkers for disease, the technology has been successful in the identification of disease subtypes and molecular diagnostic predictions. We have here listed only a select few instances of the wide ranging capabilities of this high-throughput technology. With proper planning and experimental design, it has been shown to discover or further resolve complex disease mechanisms to levels previously unimaginable. When used in combination with animal models and genetic studies, particularly focusing on quantitative variable analysis, it has provided unexpected power to identify disease mechanisms. Even though we have seen considerable advancement in the technology, in order to achieve its full potential, genome-wide expression studies have to co-evolve in a multidisciplinary approach that includes a combination of well-developed machine-learning algorithms and systems biology approaches. This can only be achieved when clinicians, surgeons, pathologists, epidemiologists, bioinformaticians and molecular biologists undertake a well-co-ordinated effort to properly plan and conduct every step of the process starting from experimental design to validation of the results.
Biochemical Basis of Respiratory Disease: Biochemical Society Focused Meeting held at AstraZeneca, Loughborough, U.K., 5–6 March 2009. Organized and Edited by Colin Bingle (Sheffield, U.K.) and Alan Wallace (AstraZeneca, U.K.).
This work was supported by grants from the National Institutes of Health [grant numbers HL071885 and ES014372] and Flight Attendant Medical Research Institute.