Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a method to identify DNA-binding regions of specific proteins or histone modifications throughout the genome. It uses antibodies against these proteins and next-generation sequencing (NGS) technology to obtain the DNA sequences of the binding regions. Like other methods that use NGS, ChIP-seq method requires basic data analysis skills. In this beginner’s guide, we explain the basic steps of the ChIP-seq experiment and its data analysis steps. The goal of this article is to help readers to improve their understanding of ChIP-seq experiments, data analysis, and the insights from data analysis results.

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a powerful method that allows researchers to map protein-DNA interactions and histone modification regions across the genome. DNA interacts with a variety of proteins such as transcription factors and replication factors through direct binding interactions (Figure 1). In addition, chemical modifications of histone proteins, which wrap DNA to form chromatin, regulate the biological functions of genomic regions such as enhancers and promoters. Histone modification is known to form euchromatin, an open and accessible chromatin region, and heterochromatin, a compact and less accessible region. Open chromatin regions are favorably bound by proteins such as transcription factors (TFs) and regulators. Genes close to the open chromatin region are regulated by TFs and are thought to have higher expression levels. Some regions of those protein binding sites and histone modification sites dynamically change through different biological conditions.

Figure 1

Workflow of ChIP-seq analysis.

Figure 1

Workflow of ChIP-seq analysis.

Close modal

Historically, the ChIP-qPCR method was used to identify protein-DNA binding in specific genomic regions of interest to researchers. However, with the development of ChIP-seq, researchers can generate data on binding sites across the entire genome in a single experiment. This advancement has significantly progressed epigenomic analysis, allowing researchers to study protein-DNA interactions in various species, cell types, developmental stages, and diseases. ChIP-seq has become a key technique in the field by offering a comprehensive genome-wide view of protein-DNA interactions.

The ChIP-seq workflow is typically divided into two major phases: wet experiments and dry analysis (Figure 1). Both require specific skill sets.

  • Wet experiment steps require expertize in molecular biology techniques. The experiment follows steps such as cross-linking of protein and DNA, fragmentation of cross-linked DNA, immunoprecipitation of target protein, DNA purification, library preparation, and sequencing. The output of the wet experiment is sequence data, which is then passed to the data analysis step.

  • Dry analysis requires a basic understanding of statistics and command-line software operations. This step involves identifying binding regions and performing comparative analyses between samples. The main outputs of dry analysis are lists of binding regions, overlapping regions between samples, and specific regions unique to each sample. Using this list, downstream analyses can be performed to extract biological insights. One of the downstream analyses is the Gene ontology (GO) enrichment analysis of peak regions. This analysis provides insights into the biological functions indicated by the peak regions. If the peak region corresponds to a TF binding site, the motif analysis can reveal nucleotide sequences that the target TF tends to bind. When performing epigenomic analysis targeting histone modifications, it is possible to annotate chromatin states, ranging from active transcription start site (TSS) to quiescent site.

The wet experiment in a ChIP-seq workflow involves several key steps as explained previously. The process begins with the cross-linking of proteins and DNA using formaldehyde, which preserves the chemical interactions between proteins and DNA in the cell. The fixation solution, temperature, and time must be optimized for each experiment. Extended fixation could lead to the unintended recovery of soluble proteins that are not typically associated with DNA, raising the likelihood of false positives due to protein-protein cross-linking. After the cross-linking, cross-linked DNA is fragmented into smaller pieces using sonication or enzymes, breaking the DNA into fragments ranging from 200 to 600 base pairs in length. These DNA fragments, called “reads” in the context of next-generation sequencing (NGS), are materials for identifying binding sites throughout the genome. Once the DNA has been fragmented, the next step is immunoprecipitation. In this step, genomic regions bound by the protein or histone modification of interest are isolated using target-specific antibodies. The antibody recognizes the target protein that has been cross-linked with DNA. After immunoprecipitation, DNA fragments are then purified. The cross-links formed by formaldehyde are broken down by heating in a lysis buffer containing sodium dodecyl sulfate (SDS), which disrupts protein structure. After purification, DNA fragments are prepared for sequencing by adding adaptor sequences to each fragment, creating what is called a sequencing library. After generating the library, the next-generation sequencer reads the bases of DNA fragments, which represent the genomic regions bound by the target protein or histone modification. The quality of the resulting sequence data depends on both the specificity of the antibody used during immunoprecipitation and the effectiveness of the purification process.

The ChIP-seq method involves many steps in sample preparation and is technically demanding, so even published data quality may differ. Therefore, it is important to understand the characteristics before analyzing. After preparing the DNA library, there are two types of sequencing methods: single-end and paired-end reads. Single-end reads sequence the DNA fragment from one end, making it suitable for short fragments and more cost-effective. Paired-end reads, on the other hand, sequence from both ends separately, making it ideal for analyzing complex genomic regions. In ChIP-seq, single-end reads are commonly used instead of paired-end reads because the DNA fragments obtained from ChIP-seq are typically 200–600 base pairs, which are shorter than those used in transcriptome sequencing. Although sequencing technology was limited to short read lengths (up to 36 bases) until 2015, recent advancements now allow for reads of up to 100 base pairs. The number of reads required for ChIP-seq analysis depends on the target and typically ranges from 500 million to 5 billion reads. Owing to the potential loss of DNA molecules under various conditions—such as low binding affinity of the target protein or reduced antibody activity or specificity—ChIP-seq experiments often require pooling one million to ten million cells, depending on the target. For example, targets such as RNA polymerase II, which is abundantly present in many regions of the genome, or localized histone modifications such as H3K4me3 can yield sufficient signals even with a lower number of cells. However, for low-abundance proteins or histone modifications that are widely distributed across the genome, larger numbers of cells are recommended to ensure accurate quantification. To ensure reliable results, appropriate negative controls during immunoprecipitation and input controls for read normalization are essential to reduce background noise.

During library preparation, short fragments, called spike-ins, are frequently introduced to facilitate signal normalization and enable absolute quantification, allowing comparison of samples by absolute values. This is particularly important when analyzing multiple samples simultaneously. To further enhance the signal, the DNA fragments are often amplified by PCR before sequencing. The sequencing approach using pooled cells is known as bulk sequencing. Recent technological advancements have enabled the development of single-cell resolution methods, which provide a higher level of detail in the analysis of protein-DNA interactions.

After obtaining the sequencing data, computational analysis is performed to extract meaningful biological insights from the raw sequences.

Dry data analysis step extracts biological insights from the data. First, sequence reads are mapped to the reference genome to determine the genomic region from which each DNA fragment was obtained. For example, let’s set our experiment to quantify the binding site of protein A. Since the reads obtained by chromatin immunoprecipitation are concentrated in the DNA regions where the protein A was bound, the mapped reads are visualized as “peaks” (Figure 2), commonly identified using MACS2 tools. By identifying these peak regions, the binding regions of protein A can be obtained throughout the genome. Peak regions are detected as sites where reads are statistically concentrated compared with the mapped read distribution across the genome. Statistical processing is required to identify peak regions from the distribution of mapped reads. When identifying peak regions, a read distribution model (background model) representing the number of mapped reads in non-peak regions (background) is determined, and sites, where reads are statistically concentrated compared with that model, are detected.

Figure 2

Visualization of ChIP-seq peak patterns. (a) Three types of peak shapes around the gene body. (b) The figure shows the results of a ChIP-seq analysis performed on ENCODE HepG2 histone modifications (GEO: GSE29611). The visualization represents 127 Mb to 128 Mb in chromosome 10 with annotated gene region. The figure was generated using the Churros pipeline with default settings. Detected peaks identified by MACS2 are shown in yellow.

Figure 2

Visualization of ChIP-seq peak patterns. (a) Three types of peak shapes around the gene body. (b) The figure shows the results of a ChIP-seq analysis performed on ENCODE HepG2 histone modifications (GEO: GSE29611). The visualization represents 127 Mb to 128 Mb in chromosome 10 with annotated gene region. The figure was generated using the Churros pipeline with default settings. Detected peaks identified by MACS2 are shown in yellow.

Close modal

The shape of peaks obtained by ChIP-seq analysis varies depending on the factor. Peaks are broadly classified into three categories (Figure 2a): sharp peaks that concentrate locally, broad peaks that concentrate thinly over a wide range, and mixed peaks that combine the features of sharp and broad peaks. Sharp peaks correspond to most transcription factors that bind to specific genomic regions or histone modifications that indicate promoter regions (such as TF or H3K4me3). On the other hand, histone modifications such as H3K36me3, which concentrate in the interior of active genes, H3K27me3, a marker of inactive genomic regions, and H3K9me3, a marker of heterochromatin, concentrate over a long genomic region of up to several million bases, forming broad peaks. Mixed peaks refer to peaks with a sharp, strong peak near the transcription start site and a weak, broad concentration extending into the gene interior, corresponding to RNA polymerase II (Pol2) and super elongation complex. Different peak detection strategies are required for each shape, but since most ChIP-seq data correspond to sharp peaks, it can be assumed that sharp peaks are used as the default unless otherwise stated.

Various plots are used for peak visualization in ChIP-seq, and peak intensity and distribution can be investigated. Commonly those peaks are visualized as bar graphs, where the x-axis represents the genomic region, and the y-axis shows the distribution of peaks (Figure 2b). Confirming known binding sites of the target protein helps validate the experiment’s accuracy and antibody specificity. These quality control measures are crucial for robust ChIP-seq results.

After peak calling in ChIP-seq analysis, several downstream tools are essential for further exploration, and understanding their functionality is important for effective analysis. IGV is used to visualize peaks, allowing researchers to visually examine binding regions across the genome. HOMER provides tools for functional annotation of peak sites, helping to interpret the biological roles of binding sites. For comparative studies, BEDTools is used to identify overlapping peaks between samples, which is essential for cross-sample comparisons. chromHMM is an algorithm based on Hidden Markov Models (HMM) that predict chromatin state by a model based on histone modification patterns, offering insights into the regulatory landscape of the genome (Figure 3a). The state-emission order represents the probabilities of specific histone modification patterns associated with each chromatin state. In the visualization heatmap, darker blue indicates a higher probability of histone modifications enriched in the corresponding state regions. DeepTools enables clustering of peaks, revealing patterns and trends in binding across multiple regions (Figure 3b). The plotHeatmap function in DeepTools generates heatmap to visualize signal intensity at peaks or specific genomic regions (e.g., TSS and transcription end site (TES)) (Figure 3c). This plot is useful for comparing different samples and understanding overall peak patterns. The plotHeatmap function is also useful for quality control by comparing input reads, which are generated from samples sequenced without immunoprecipitation, those not targeting specific protein-binding regions. For example, Figure 3c represents several histone modifications peak intensity of HepG2 cells in gene regions on chromosomes 19 and X. This plot shows that H3K4me3 and H3K27ac, which are enriched at the TSS of active genes, are located in similar gene regions. Additionally, H3K36me3, known to localize in the gene bodies of active genes, is also present in these gene regions. In contrast, peaks of H3K27me3, which is associated with gene silencing, are lower near the active TSS.

Figure 3

Chromatin functional annotation and visualization of histone modification patterns. This figure represents example ChIP-seq downstream analysis on histone modifications (H3K4me3, H3K27ac, H3K27me3, H3K36me3) and histone (H2Z) from HepG2 cells (GEO: GSE29611). (a) Chromatin region functional annotation of the sample genome using histone chromatin marks. The analysis was performed using chromHMM. This analysis shows the chromatin region functional annotation results for the sample, based on the functional annotation of genome regions identified by combinations of histone modifications pre-trained by databases. In this figure, the regions with state one are predicted to include active TSS. (b) The similarity of the peak regions for each sample used in the analysis is visualized using Spearman correlation. (c) The plotHeatmap function in DeepTools visualizes the intensity of each peak. In this plot, peak intensity was calculated as the log2 ratio of histone modification over the input (control) sample. The x-axis represents the gene bodies region, flanked by a segment of 3000 bases at the 5' end of transcription start sites (TSSs) and 5000 bases at the 3' end of transcription end sites (TESs). The y-axis represents genes on the chromosome. Plotted regions can be customized with a list of regions of interest. An average profile for each histone modification is displayed at the top.

Figure 3

Chromatin functional annotation and visualization of histone modification patterns. This figure represents example ChIP-seq downstream analysis on histone modifications (H3K4me3, H3K27ac, H3K27me3, H3K36me3) and histone (H2Z) from HepG2 cells (GEO: GSE29611). (a) Chromatin region functional annotation of the sample genome using histone chromatin marks. The analysis was performed using chromHMM. This analysis shows the chromatin region functional annotation results for the sample, based on the functional annotation of genome regions identified by combinations of histone modifications pre-trained by databases. In this figure, the regions with state one are predicted to include active TSS. (b) The similarity of the peak regions for each sample used in the analysis is visualized using Spearman correlation. (c) The plotHeatmap function in DeepTools visualizes the intensity of each peak. In this plot, peak intensity was calculated as the log2 ratio of histone modification over the input (control) sample. The x-axis represents the gene bodies region, flanked by a segment of 3000 bases at the 5' end of transcription start sites (TSSs) and 5000 bases at the 3' end of transcription end sites (TESs). The y-axis represents genes on the chromosome. Plotted regions can be customized with a list of regions of interest. An average profile for each histone modification is displayed at the top.

Close modal

In this section, we highlight recent advancements and future developments in ChIP-seq analysis, including data imputation, and the application of ChIP-seq to emerging technologies such as the single-cell approaches. Gene expression regulation by TFs is cell-type specific, and it is difficult to obtain data for all transcription factors owing to experiment cost. In such cases, it is useful to use publicly available data obtained by existing studies for large-scale data comparison analysis, such as ChIP-Atlas, a database which provides ChIP-seq data for various transcription factors in various cell types. The problem of cell type specificity is also true for histone modification data. For example, in human vascular endothelial cells, ChIP-seq results revealed distinct enhancer patterns between the upper and lower body, indicating region-specific gene expression in endothelial cells from different body parts. To collect histone modification data from various cell types, there are databases such as ENCODE, Roadmap Epigenomics Consortium, and the International Human Epigenome Consortium (IHEC). Using these databases, we can collect chromatin state information to enhance genome functional annotation. However, these databases have not yet collected data on all histone modifications in all cell types, which can create bias in comprehensive analysis. To solve this problem, imputation methods such as Avocado have been developed. These computational imputation methods can support machine learning studies that require a wide range of datasets.

Furthermore, ChIP-seq can be applied not only to cell lines but also to tissue and tumor samples. Emerging techniques such as single-cell ChIP-seq provide novel insights into tissue and tumor heterogeneity, uncovering regulatory elements specific to individual cell types. After extracting cell types with similar peak patterns through cell clustering, downstream analyses described in the previous section can also be performed. In addition to advancements in ChIP-seq technology itself, the reannotation of data enabled by progress in related fields is becoming increasingly important. The application of ChIP-seq to the T2T genome, the latest version of the human reference genome, is opening new regions for annotating functional elements, which were previously left uncharacterized owing to repetitive sequences. However, it was reported that users should pay attention to the impact of ChIP-seq quality metrics and peak-calling results on the T2T genome. These advancements illustrate how ChIP-seq continues to evolve, providing researchers with more comprehensive tools to explore regulatory mechanisms.

  • ChIP-seq analysis is a method to analyze which regions of the genome proteins and histone modifications are bound to.

  • ChIP-seq analysis can be divided into two steps: wet experiments and dry analysis.

  • In wet experiments, antibodies isolate genomic binding regions, which are prepared for sequencing.

  • In dry analysis, the sequenced DNA reads are mapped to the reference genome sequence to determine the region of protein binding sites.

Further reading

  • Overview of the ChIP-seq analysis

  • Park, P. J. (2009) ChIP–seq: advantages and challenges of a maturing technology. Nat Rev Genet10, 669–680. https://doi.org/10.1038/nrg2641 (Review of Overall ChIP-seq topic).

  • Nakato, R. & Shirahige, K. (2017) Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Briefings in Bioinformatics, 18, 279–290. https://doi.org/10.1093/bib/bbw023 (Review of ChIP-seq analysis, especially on the quality control and data analysis).

  • Jiang, S. & Mortazavi, A. (2018) Integrating ChIP-seq with other functional genomics data. Briefings in Functional Genomics, 17, 104–115 https://doi.org/10.1093/bfgp/ely002 (Review of integrating ChIP-seq data with other multi omics data).

  • Nakato, R. et al. (2019) Comprehensive epigenome characterization reveals diverse transcriptional regulation across human vascular endothelial cells. Epigenetics & Chromatin12, 77. https://doi.org/10.1186/s13072-019-0319-0 (Tissue specific enhancer annotation on human ES cell using multi omics analysis).

  • Grosselin, K. et al. (2019) High-throughput single-cell ChIP-seq identifies heterogeneity of chromatin states in breast cancer. Nat Genet51, 1060–1066. https://doi.org/10.1038/s41588-019-0424-9 (Single cell application).

  • ChIP-seq analysis tools

  • Ernst, J. & Kellis, M. (2017) Chromatin-state discovery and genome annotation with ChromHMM. Nat Protoc12, 2478–2492. https://doi.org/10.1038/nprot.2017.124 (Histone modification ChIP-seq analysis tool).

  • Zhang, Y., et al. (2008) Model-based Analysis of ChIP-Seq (MACS). Genome Biol9, R137. https://doi.org/10.1186/gb-2008-9-9-r137 (ChIP-seq peak estimation tool).

  • Robinson, J. T. et al. Integrative Genomics Viewer. Nature Biotechnology29, 24–26 (2011). (ChIP-seq peak visualization tool).

  • Heinz, S. et al. (2010) Simple Combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell38, 576-589. (HOMER: ChIP-seq peak annotation tool, Tool page is as: http://homer.ucsd.edu/homer/).

  • Ramírez, F. et al. (2016) deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Research44, W10-W165. https://doi.org/10.1093/nar/gkw257 (ChIP-seq data analysis and visualization tool).

  • Wang, J. & Nakato, R (2024) Churros: a Docker-based pipeline for large-scale epigenomic analysis. DNA Research, 31, dsad026. https://doi.org/10.1093/dnares/dsad026 (ChIP-seq analysis pipeline).

  • Schreiber, J. et al. (2020) Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biol21, 81. https://doi.org/10.1186/s13059-020-01977-6 (ChIP-seq imputation).

graphic

Gina M. Oba is a project researcher at the Institute of Quantitative Biosciences, The University of Tokyo. She received her Ph.D. in Computational Biology from the University of Tokyo in 2024. Her research focuses on extracting biological insights from large-scale multi-omics data, utilizing metrics such as gene publication number and tissue specificity scores. She is also passionate about scientific communication and science education, actively engaging as an assistant in children’s science lectures, a research advisor for school students, and a volunteer at a science museum, where she interacts with visitors of all ages. Email: [email protected].

graphic

Ryuichiro Nakato is an assistant professor for Computational Genomics at the Institute of Quantitative Biosciences, the University of Tokyo. He earned a PhD in informatics at the Department of Intelligence Science and Technology, Kyoto University. He moved to the University of Tokyo as a research associate and started his career as a computational genomics researcher using NGS. He developed multiple tools for NGS analysis, including ChIP-seq. In 2019, he became a principal investigator at the University of Tokyo. The main research interest is a data-driven analysis using large-scale multi-omics NGS datasets. Email: [email protected]. Twitter: @RyuichiroNakato

Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND)