Epigenome-450K-wide methylation signatures of active cigarette smoking: The Young Finns Study

Abstract Smoking as a major risk factor for morbidity affects numerous regulatory systems of the human body including DNA methylation. Most of the previous studies with genome-wide methylation data are based on conventional association analysis and earliest threshold-based gene set analysis that lacks sensitivity to be able to reveal all the relevant effects of smoking. The aim of the present study was to investigate the impact of active smoking on DNA methylation at three biological levels: 5′-C-phosphate-G-3′ (CpG) sites, genes and functionally related genes (gene sets). Gene set analysis was done with mGSZ, a modern threshold-free method previously developed by us that utilizes all the genes in the experiment and their differential methylation scores. Application of such method in DNA methylation study is novel. Epigenome-wide methylation levels were profiled from Young Finns Study (YFS) participants’ whole blood from 2011 follow-up using Illumina Infinium HumanMethylation450 BeadChips. We identified three novel smoking related CpG sites and replicated 57 of the previously identified ones. We found that smoking is associated with hypomethylation in shore (genomic regions 0–2 kilobases from CpG island). We identified smoking related methylation changes in 13 gene sets with false discovery rate (FDR) ≤ 0.05, among which is olfactory receptor activity, the flagship novel finding of the present study. Overall, we extended the current knowledge by identifying: (i) three novel smoking related CpG sites, (ii) similar effects as aging on average methylation in shore, and (iii) a novel finding that olfactory receptor activity pathway responds to tobacco smoke and toxin exposure through epigenetic mechanisms.

Smoking as a major risk factor for morbidity affects numerous regulatory systems of the human body including DNA methylation. Most of the previous studies with genome-wide methylation data are based on conventional association analysis and earliest threshold-based gene set analysis that lacks sensitivity to be able to reveal all the relevant effects of smoking. The aim of the present study was to investigate the impact of active smoking on DNA methylation at three biological levels: 5 -C-phosphate-G-3 (CpG) sites, genes and functionally related genes (gene sets). Gene set analysis was done with mGSZ, a modern threshold-free method previously developed by us that utilizes all the genes in the experiment and their differential methylation scores. Application of such method in DNA methylation study is novel. Epigenome-wide methylation levels were profiled from Young Finns Study (YFS) participants' whole blood from 2011 follow-up using Illumina Infinium Hu-manMethylation450 BeadChips. We identified three novel smoking related CpG sites and replicated 57 of the previously identified ones. We found that smoking is associated with hypomethylation in shore (genomic regions 0-2 kilobases from CpG island). We identified smoking related methylation changes in 13 gene sets with false discovery rate (FDR) ≤ 0.05, among which is olfactory receptor activity, the flagship novel finding of the present study. Overall, we extended the current knowledge by identifying: (i) three novel smoking related CpG sites, (ii) similar effects as aging on average methylation in shore, and (iii) a novel finding that olfactory receptor activity pathway responds to tobacco smoke and toxin exposure through epigenetic mechanisms.

Background
Tobacco use has been estimated to cause approximately 7 million yearly deaths globally through general morbidity, disability, non-communicable chronic diseases, and increased risk of communicable diseases [1]. The chronic diseases caused by smoking are in nature multifactorial, mediated by both genetic and environmental factors [2,3]. Despite the well-established link between smoking and associated diseases in literature, there is a knowledge gap in the causative mechanisms of the diseases.
Previous studies have led to identification of several transcriptomic signatures and their role in diseases related to smoking [4]. Epigenetic mechanisms have been understood to be modulated by environmental factors such as smoking and play a crucial role in disease development process by regulating gene expression [5]. Several studies have reported smoking-related methylation changes at CpG site, gene as well as gene set level [6][7][8][9][10][11]. Zeilinger et al. reported 972 CpG sites significantly associated with smoking from discovery data with 1793 participants and replicated 187 out of 972 CpG sites in independent cohort of 479 participants [6]. The three most significant CpG sites and corresponding genes were cg05575921:AHRR, cg21566642: ALPP/ALPPL2, and cg03636183:F2RL3. McCartney et al. [7] and Zeilinger et al. [6] suggested that smoking-induced DNA methylation changes are reversible. Shenker et al. developed methylation index by combining four smoking related CpG sites (one in AHRR, two intergenic in chromosome 2 and one intergenic in chromosome 6) that can predict former smoking status [8]. The study by Joehanes et al., one of the largest study of smoking-related DNA methylation changes, identified 18760 active smoking-related CpG sites annotated to 7201 genes with FDR < 0.05 [9]. Prince et al. investigated association of 2620 previously reported smoking-related CpG sites with different smoking behaviors (such as ever smoking, current weekly smoking, ever weekly smoking, and blood cotinine levels) and reported smoking behavior related methylation patterns in 11 CpG sites mapping to seven genes [10]. Several studies on impact of aging on DNA methylation, as summarized by Ciccarone et al., have established age-induced DNA methylation changes as hallmark of aging [12]. Recent studies such as by Yang et al. [13] and Lei et al. [14] have reported association between cigarette smoking and age-induced DNA methylation changes. Joehanes et al. also performed Gene Ontology (GO) [15] based gene set analysis of the list of statistically significant CpG sites and reported 99 gene sets that broadly included molecular processes such as signal transduction, protein metabolic process, and transcription pathways [9]. Another study performed gene set analysis of list of CpG sites with altered methylation levels due to exposure to maternal smoking during pregnancy and reported cell cycle, cancer, white blood cell differentiation, genotoxicity as major findings [16]. Bakulski et al. performed gene set analysis of lung-specific smoking related CpG sites and the most significant pathways included mRNA catabolic processes, protein targeting, angiogenesis, and mRNA translation [17].
Traditional epigenome-wide association studies might fail to identify relevant CpG sites due to noise and lack of statistical power. Further, long list of CpG sites discovered through such studies are harder to interpret in terms of underlying biological theme. Gene set analysis, also referred to as pathway analysis, addresses the problems by shifting the analysis from individual CpG sites to groups of related CpG sites. CpG sites are mapped to respective genes and grouped together based on shared biological features. The grouping of corresponding genes can be based on reference knowledge bases such as Reactome Pathway Database [18], BioCarta pathways [19], Kyoto Encyclopedia of Genes and Genomes (KEGG) [20], and GO [15]. The grouping approach averages out errors at single CpG site level and increases statistical power. Additionally, the outcome of gene set analysis is meaningful terms that provide insights into biological theme underlying the list of CpG sites.
Studies that have reported smoking-related methylation changes at biological pathway level have done so mostly with the earliest gene set analysis approach, over-representation analysis (ORA) [9,17,21]. ORA is a type of gene set analysis that is based on statistical tests like hypergeometric and chi-square. The approach requires users to provide a list of genes selected based on arbitrary threshold (for example, P-value<0.05). The arbitrariness of the threshold leads to unstable results that are difficult to replicate and massive reduction in sensitivity. Furthermore, the approach assumes independence between genes, which is violated in most cases resulting to false positive results [22].
Over a decade of development of gene set analysis approaches have provided several robust gene set analysis methods that are likely to provide improved insights into epigenetic consequences of smoking. A class of such methods includes threshold-free competitive gene set analysis methods that test whether genes in a given gene set are more differentially methylated than the other genes in the dataset. The approach can be used to test a large number of gene sets, such as those provided by Molecular Signatures Database (MSigDB) [23,24] in order to profile smoking-related epigenetic alterations at biological pathways or processes level. An example of one of such methods is mGSZ (modified Gene Set Z-score) that have been developed and improved over time by us [25][26][27] and have been shown to be robust and efficient by us [20][21][22] as well as another independent study [28]. Unlike over-representation analysis methods, mGSZ is threshold-free and thus analyzes all the genes in the data resulting into sensitive, comprehensive, and reproducible results. The mGSZ is different from other methods in the class in that it is based on asymptotic P-value estimation that is significantly more efficient than empirical P-value estimation [26,29].
The objective of the present study was to characterize DNA methylation differences between current smokers and non-smokers at different biological levels such as: (1) CpG site level with differential methylation analysis; (2) biological pathway level using a threshold-free gene set analysis method, mGSZ; and (3) average methylation level at different genomic regions with Wilcoxon rank-sum test. We show that application of threshold-free gene set analysis method for DNA methylation data has improved benefit over earliest threshold-based ORA methods used currently by most studies as illustrated by our novel findings such as smoking-induced altered methylation in olfactory system. Even though olfactory ability is known to be adversely affected by active cigarette smoking [30], we for the first time show the underlying epigenetic mechanism.

Study population
Young Finns Study (YFS) is one of the largest existing follow-up studies into cardiovascular health from childhood to adulthood, running in a longitudinal prospective setup with regular follow-ups from 1980 onwards [31]. The study began in 1980 with 3596 children and adolescents aged 3 to 18 years randomly selected from the areas of five university hospitals in Finland (Turku, Tampere, Helsinki, Kuopio, and Oulu). The participants have been followed up for over 40 years. The methylation measurements were performed on a subset of 192 individuals from whole-blood samples from 2011 follow up. The smoking history of the subjects was self-reported and belonged to six categories based on smoking frequency (1. active smoker or at least once a day, 2. once a week or more often, however not daily, 3. less often than once a week, 4. attempts to quit, 5. has quit, 6. has never smoked). The present study is based on a subsample of 125 participants, 40-49 years of age who were either active smoker (n=21) or have never smoked (n=104) [ Table 1]. Participants in the middle conditions such as those who smoked less frequently or has quit smoking were eliminated in order to obtain maximum possible smoking-related biological signal in DNA methylation.

DNA methylation assessment
DNA was extracted from EDTA-blood samples using Wizard ® Genomic DNA Purification Kit (Promega Corporation, Madison, WI, U.S.A.), according to the manufacturer's instructions. Genome-wide quantification of DNA methylation levels were done using Illumina Infinium HumanMethylation450 BeadChips [32] in the Core Facility at the Institute of Molecular Medicine Finland, University of Helsinki, following manufacturer's protocols. The HumanMethylation450 BeadChip measures DNA methylation at more than 485,000 CpG cites across the genome. The arrays were imaged with a high-precision scanner (iScan system, Illumina Inc.), and the signal intensities were extracted using a GenomeStudio software package (Illumina Inc.). The methylation data are available in Gene Expression Omnibus (GEO) under accession number GSE69270.

Data filtering and normalization
Data were obtained and processed from raw methylation image files using the minfi package in R/Bioconductor [33]. All the analyzed samples have sum of detection P-values across all the probes less than 0.05. Logged (log2) median of methylated and unmethylated intensities of the analyzed samples clustered within the default threshold (10.5) of getQC function in minfi. Further, samples for which real sex did not match the predicted sex obtained with getSex function in minfi were excluded. Background subtraction and dye-bias normalization were performed via noob method [34] implemented in minfi. Stratified quantile normalization was performed using preprocessQuantile function in minfi. Probes with detection P-value more than 0.01 in 99% of the samples were filtered out. CpG loci on sex chromosomes were excluded from the analysis to avoid gender-based methylation bias. Also, cross-reactive probes and probes with SNPs were excluded from the analysis. After quality control, the total number of autosomal CpGs was 429,773 in 125 samples (21 active smokers and 104 non-smokers). Batch correction was done by adding the first two principal components of control probes as covariates in multiple linear regression model.

Differential methylation analysis
All statistical analyses were performed using R statistical software (v.3.5.1) [35]. M-values, calculated as the log2 ratio of the intensities of methylated probe versus unmethylated probe were used as measures of methylation level. Differentially methylated CpG loci for smoking status were identified using CpGassoc package in R [36]. In order to keep the present study exploratory in nature, analyses were based on two different models adjusted with varying number of covariates. Model 1 involved adjustment for age, sex, body mass index, cell type proportions, and technical covariates (chip and array). Cell type proportions consisted of proportions of CD8T, CD4T, natural killer cells, B cells, monocytes, and granulocytes in white blood cells. Selection of covariates for model 1 was motivated from study by [9]. Model 2 involved adjustment with two additional covariates, alcohol usage and socioeconomic status, in addition to those in model 1. The addition of the two covariates in model 2 is based on previous reports suggesting effects of alcohol usage [37] and socioeconomic status [38] on DNA methylation. There was reduction in sample size for model 2 because 16 out of 125 participants did not have information on alcohol usage and socioeconomic status. The number of active smokers in the analysis was reduced from 21 to 16 because of missing information on socioeconomic status. Alcohol consumption was measured by asking participants to report their alcohol consumption during the previous week. One unit is equivalent to 14 g of alcohol [39]. Socioeconomic status was based on occupation and was categorized as manual, lower non-manual, and upper non-manual. Potential batch affects were addressed by including the first two principal components of array control probes into the regression models. The cell type proportions of white blood cells were estimated through the reference-based Houseman method [40] using the estimateCellCounts function in the minfi Bioconductor package in R. CpG sites were mapped to genes by using annotation database provided by Illumina [41]. Differentially methylated genes were identified by utilizing as a proxy the CpG site with maximum absolute t-score from any location in the genomic region of the gene, since the mechanism how the methylation influences gene expression is not completely understood. Statistical significance level was set to false discovery rate (FDR) of 0.05 in the CpG site and gene level analysis.

Gene set analysis
Biological relevance of the differentially methylated genes based on both model 1 and 2 was investigated using mGSZ method implemented in mGSZ R package [26]. mGSZ is a gene set analysis method based on robust gene set scoring function and efficient P-value estimation method. Unlike over-representation-based pathway analysis methods [42,43], this approach is threshold free and thus includes all the genes in the analysis irrespective of their effect size or significance level. This is particularly important in the present study as our goal is to identify set of genes contributing to a biological pathway that have milder but coherent smoking-related changes in methylation level. Analyzed gene sets were downloaded from MSigDB version 7.0. The database contained 22569 gene sets (as of September 21, 2019) divided into eight major collections: Hallmark gene sets, positional gene sets, curated gene sets, motif gene sets, computational gene sets, gene ontology (GO) gene sets, oncogenic signatures, and immunogenic signatures. Hallmark gene sets are generated computationally and represent well-defined biological processes [44]. Positional gene sets are generated based on genomic locations of genes. Curated gene sets are generated from knowledge sources such as pathway databases, biomedical literature, and domain experts. The gene sets are divided into two sub-categories: (i) canonical pathways curated from online databases [18][19][20] and (ii) chemical and genetic perturbations (CGP). CGP-based gene sets represent expression signatures of chemical and genetic perturbations mostly curated from biomedical literature [45]. Motif gene sets represent target genes for transcription factors or micro RNAs. Computational gene sets are generated from cancer related microarray gene expression data with data mining techniques. GO gene sets are derived from GO annotations and have three sub-categories-biological process (BP), molecular function (MF), and cellular component (CC).

Average methylation level at different genomic regions
We analyzed effect of smoking on the average methylation level at different genomic regions in relation to both gene region (TSS200, 0-200 bases upstream of the transcriptional start site; TSS1500, 200-1500 bases upstream of the TSS; 5 UTR, within the 5 untranslated region, between the TSS and the ATG start site; Body, between the ATG and stop codon irrespective of the presence of introns, exons, TSS, or promoters; 3 UTR, between the stop codon and poly A signal) and CpG islands (Shores, 0-2 kb from CpG island; Shelves, 2-4 kb from CpG island

Study participants' characteristics
Characteristics of the YFS cohort participants in the present study is summarized in Table 1.

Differential methylation analysis
We identified 60 differentially methylated CpG sites with respect to smoking habit with FDR < 0.05 from statistical model 1 [ Table 2]. There were 26602 nominally significant (P-value < 0.05) CpG sites ( Figure 1). The results identify three novel smoking-associated CpG sites: cg26038589 (CCDC55, Coiled-Coil Domain-Containing Protein 55), cg10385208 (CWC25, Spliceosome Associated Protein Homolog), and cg09355027 (no known gene) and replicate the findings of the previous studies, indicating that our results are technically robust. In addition, a site found in only one previous study, cg13898430 (RUNX3, Runt-related transcription factor 3) [46] was also replicated in the present study. Model 2 identified only 18 differentially methylated CpG sites with respect to smoking habit with FDR < 0.05. Seventeen of the CpG sites from model 2 that were also reported by model 1 are highlighted in Table 2 with bold font. The three novel sites identified in model 1 were not recovered with model 2.

Gene set analysis
Using the gene sets from MSigDB and mGSZ, we identified smoking induced altered methylation in a total of 13 gene sets with FDR ≤ 0.05. The results are presented in Table 3 and Figure 2. The figure serves the purpose of observing overall methylation pattern and we do not expect significant statistical differences with such approach due to sample size limitation. This is where our proposed state-of-the-art gene set analysis method, mGSZ, becomes useful as the approach increases the statistical power and identifies biologically relevant results even with smaller sample size. Among eight major gene set collections of MSigDB, significant results (FDR ≤ 0.05) were obtained for three-Hallmark, Curated, and GO gene sets [ Table 3]. The gene set, Genes down-regulated in response to ultraviolet (UV) radiation (FDR = 0.05) identified from Hallmark collection suggests that smoking induces hypermethylation of genes that are known to be down-regulated by ultraviolet radiation [ Table 3 and Figure 2]. We identified three canonical pathways with smoking related altered methylation (hypomethylation): two from reactome database and one from BioCarta. Gene sets based on reactome pathways are aryl hydrocarbon receptor signaling (FDR = 0.003) and rora activates gene expression (FDR = 0.03). Similarly, BioCarta based results suggest that smoking alters    Intergenic chromatin regions are called OpenSea. CpG sites retained in model 2 are highlighted with bold font. Statistical models: Model 1 involved adjustment for age, sex, body mass index, cell type proportions, and technical covariates (chip and array). Model 2 involved adjustment with two additional covariates, alcohol usage and socioeconomic status, in addition to those in model 1.

standard error, P-values, FDR, and genomic regions in relation to both gene region (TSS200, 0-200 bases upstream of the transcriptional start site; TSS1500, 200-1500 bases upstream of the TSS; 5 UTR, within the 5 untranslated region, between the TSS and the ATG start site; Body, between the ATG and stop codon irrespective of the presence of introns, exons, TSS, or promoters; 3 UTR, between the stop codon and poly A signal) and CpG islands (Shores, 0-2 kb from CpG island; Shelves, 2-4 kb from CpG island) (Continued)
methylation in genomic region responsible for RARRXR pathway (FDR = 0.05), which is linked to cancer. CGP-based gene set identified in the present study suggests altered methylation in genes that are up-regulated in glioma cell lines after knockdown of SPARC gene by RNAi (FDR = 0.003). GO annotation-based gene sets were analyzed separately for the three GO categories. We identified one BP and 7 MF related gene sets with FDR ≤ 0.05 [ Table 3]. Olfactory receptor activity, a novel finding of this study, belongs to GO MF molecular. Gene set analysis of differentially methylated CpG sites identified with model 2 reported seven gene sets with FDR < 0.25 [Supplementary Table S2]. The novel finding of the present study, smoking-induced altered methylation in olfactory system, was retained with FDR of 0.14. Also, the well-established effect of smoking on aryl hydrocarbon receptor signaling was retained with FDR of 0.004. Similarly, effect of smoking on UV sensitive genes as well as four other gene sets with gene regulatory effect was retained.

Average methylation level analysis specific to different genomic regions
We found that active smoking is significantly (Wilcoxon rank-sum test, P-value < 0.05) associated with hypomethylation in shore region among active smokers [Supplementary Table S1 and Figure 3]. Although the conventional P-value threshold of 0.05 was not reached for other CpG regions, perhaps due to lack of power, smoking seems to be associated with hypomethylation in most of the regions [Supplementary Table S1]. These results suggest that effect of active smoking on average methylation in genomic regions such as shore resembles to that of aging [12].

Discussion
To the best of our knowledge, this is the first study that implements a modern threshold free gene set analysis method to study smoking-induced alterations in methylation in genomic regions at biological pathway level. Studies of biological implications of CpG sites with altered methylation due to smoking have been limited to traditional over-representation-based pathway analysis [9,17,21,47]. We performed differential methylation analysis to first identify CpG sites that are significantly associated with smoking habit. We then implemented our robust gene set analysis method, mGSZ, on the whole methylation data at gene level to identify biological processes that might be affected due to smoking-induced altered methylation. The present study extended the current knowledge by identifying three novel smoking-associated CpG sites and altered methylation in genomic regions that play role in olfactory sensing system, immune response, cardiovascular disease, and cancer development. Furthermore, the results suggest that the global effect of smoking is similar to aging process that is known to be associated with hypomethylation in genomic regions with high frequency of CpG sites such as shores [12]. Hypomethylation in the genomic regions of active smokers is in line with findings at CpG site and gene set level. All except three CpG sites in the region of MYO1G gene are hypomethylated among active smokers. Similarly six out of seven significant gene sets based on model 2 are hypomethylated among active smokers. Previously, unidentified site cg26038589 maps to gene CCDC55. The protein product of the gene plays role in integrated regulation of gene expression in various ways and contribute to the development of cancers, viral infections, and neurological disorders [48]. CpG site cg10385208 is located in the genomic region of gene CWC25. This gene has been annotated to be overexpressed in lung and peripheral blood mononuclear cells in GeneCards-the human gene database [49]. The annotation is based on analysis of 69 proteomics datasets from Human Integrated Protein Expression Database resided within GeneCards. The gene has also been shown to play role in cognitive impairment [50]. Furthermore, the genes CCDC55 and CWC25 are linked to alternative splicing suggesting a role of DNA methylation on alternative splicing. The genes, therefore, should be interesting targets to follow up in the laboratory. The third novel site, cg09355027 has not been annotated to any gene and its biological significance is unknown. However, the CpG site is one of the numerous sites whose methylation is modulated by methyl-CpG-binding domain protein 2 (MBD2) [51]. MBD2 participates in gene silencing and has active role in shaping the cancer methylome. Site cg13898430 identified in our study has been identified previously to be associated with smoking by only one study [46]. The CpG site maps to gene RUNX3, a runt domain-containing transcription factor and a known tumor suppressor, frequently deleted or transcriptionally silenced in cancer according to the NCBI Gene database [52]. The gene is associated with numerous cancers, including lung cancer and several autoimmune diseases, including the inflammatory bowel disease. Overall, both the replicated and the discovered methylation sites map to genes that play roles in biological processes and pathways plausibly connected to smoking-induced disease.
Olfactory receptor activity, one of the significant gene sets identified in the present study, is responsible for initiating cell activity in response to smell detection. Surprisingly, despite the long-established knowledge that smoking is associated with olfactory and gustatory dysfunction, relevant pathways have not been reported in previous studies. Decline in the sense of smell has been proposed to originate from several mechanisms operating in parallel, including the tobacco-induced sinonasal inflammation and squamous cell metaplasia in the airways [53]. The olfactory epithelium is a highly dynamic structure going through constant renewal through the exfoliation of the aged cells and the generation of new ones from the stem cells, thus constantly regenerating the sensory neurons [54]. The division and differentiation of the stem cells are regulated by conserved epigenetic mechanisms, including DNA methylation [55]. As the cycle is constantly repeated, it is conceivable that the tobacco smoke exposure effects would be seen widely. In our study, these effects appear to be reflected indirectly in the epigenome measured from the blood. It is noteworthy that the genes related to olfactory system did not reach statistical significance threshold in the gene level differential methylation analysis, while their co-occurrence as a gene set (on pathway level) passes statistical significance threshold. This could partially explain why the olfactory pathways have been missed by many previous studies utilizing only the highest ranking genes in over-representation-based gene set analysis.
Consistent with the previous findings that tobacco exposure leads to neural tube defects, the present study identified altered methylation in genes that play role in regulation of smoothened signaling pathway involved in dorsal ventral neural tube [56]. Gene sets representing semaphoring receptor activity and G-protein-coupled receptor identified in our study have been annotated to the immune system. Altered methylation of these receptors due to smoking might deteriorate regulation of immunity [57]. We also identified altered methylation in genes involved in the thrombin signaling pathway that suggests that smoking plays role in mechanisms of coronary thrombosis [58]. Gene sets based on aryl hydrocarbon receptor signaling, RORA activates gene expression, and RARRXR pathway identified in our study are related to gene regulation and consequently have a role in cancer. Similarly, we identified a set of genes in glioma cells that are up-regulated after knockdown of SPARC (secreted protein acidic and cysteine rich) by RNAi. Previous study has shown that nicotine is associated with stimulation of malignant behavior of glioma cells [59]. To summarize, in addition to the novel findings such as smoking-induced alteration in methylation in olfactory system and ultraviolet radiation sensitivity genes, the previously identified biological processes and pathways in cardiovascular disease, immune response, and cancer were prominent in our study.
The main weaknesses of the present study are the small sample size, self-reported smoking status and alcohol usage information, and occupation-based SES assessment. However, nearly all of the significant hits replicated the previous findings indicating the robustness of our novel bioinformatics approach. Furthermore, we were able to extend the current knowledge by discovering novel sites (cg26038589, cg10385208, cg09355027) with plausible connections to the diseases associated with smoking.
Another acknowledged issue is the missing gene annotations of measured CpG sites, which affects nearly every study on the effects of smoking on the DNA methylation, the majority conducted using the Illumina HM450 methylation array. In the present study, 16 significant CpG sites were not mapped to any gene, leaving the regulation potential and the participation in the biological pathways completely unknown. CpGs were mapped to gene sets via genes. CpGs mapping to any genomic region of the corresponding genes (for example, coding or promoter regions) were considered. As the pooling of CpGs sites from different regions of different genes was done, speculation on whether the altered methylation activates or deactivates gene expression is inconclusive and thus outside the scope of this work. However, it is important to note that function of DNA methylation varies with different genomic contexts [60].
The next logical step would be to include analysis based on the transcriptomics of the smoking-associated gene sets, elucidating the regulatory potential of the DNA methylation with respect to gene expression.

Conclusions
The adverse health effects of smoking and the damage repair responses of the human body are presumably mediated by the epigenetic mechanisms regulating the gene expression. Extending previous works, the present study replicates 57 methylation sites and presents three novel sites (cg26038589, cg10385208, cg09355027) that potentially have roles in the cardiovascular disease, cancer, and immune response. As the most significant novel result, smoking alters methylation in the gene sets related to olfactory sensing system, which undergoes intense regeneration under tobacco smoke and toxin exposure. The other significant gene sets with smoking induced alteration in methylation related to cardiovascular disease, cancer, and immune response replicate the findings of the previous studies indicating the robustness of our novel bioinformatics analysis [26].