Find peaks using MACS2

Description

This tool identifies statistically significantly enriched genomic regions in ChIP- and DNase-seq data using the MACS2 algorithm (Model-Based Analysis of ChIP-Seq).

Parameters

File format (ELAND, BAM, BED) [BAM]
Mappable genome size (hg18, hg19, mm9, mm10, rn5, user-specified) [hg19]
User-specified mappable genome size []
q-value cutoff (0..0.99) [0.01]
Read length (1..200) [0]
Keep duplicates (auto, all, 1) [auto]
Build peak model (yes, no) [yes]
Bandwidth (1..1000) [300]
Upper M-fold cutoff (1..100) [30]
Lower M-fold cutoff (1..100) [10]
Extension size (1..1000) [200]
Call broad peaks (yes, no) [no]

Details

MACS2 performs several steps as described below, ranging from duplicate filtering and peak model building to the actual peak detection and multiple testing correction. It has also an option to link nearby peaks together in order to call broad peaks.

If the read length parameter is set to zero, MACS2 detects read length automatically. MACS2 then proceeds to filter out duplicate reads. By default it calculates the maximum number of duplicate reads in a single position warranted by the sequencing depth, and removes redundant reads in excess of this number. Alternatively, you can select to keep only one read, or all duplicates.

MACS2 models the distance between the paired forward and reverse strand peaks from the data. It slides a window across the genome to find enriched regions, which have M-fold more reads than background. The size of the window is twice the bandwidth parameter. The expected background is the number of reads times their length divided by the mappable genome size. Note that the mappable genome size is always less than the real genome size because of repetitive sequence. The regions' fold enrichment must be higher than 10 and less than 30, but you can change these values if not enough regions are found. A smaller value for the lower cutoff provides more regions for model building, but it can also include spurious data into the model and thereby adversely affect the peak finding results. MACS2 uses 1000 enriched regions to model the distance d between the forward and reverse strand peaks.

In the actual peak detection phase, MACS2 extends the reads in the 3' direction to the fragment length obtained from modeling. If the model building failed or if it was switched off, the reads are extended to the value of the extension size parameter. If a control sample is available, MACS2 scales the samples linearly to the same read number. It then selects candidate peaks by scanning the genome again, now using a window size which is twice the fragment length. MACS2 calculates a p-value for each peak using a dynamic Poisson distribution to capture local biases in read background levels. If a control sample is available, it is used to calculate the local background. Finally, q-values are calculated using the Benjamini-Hochberg correction.

Output

The analysis results consist of the following files:

macs2-peaks.tsv: List of peaks and their length, summit location and height, as well as their fold change, p- and q-value. Note that you can visualize this file in the Chipster genome browser.
macs2-narrowpeak.bed: List of peak locations, q-values, fold change, p-value, q-value (again) and summit position relative to peak start, in narrow peak format (BED6+4).
macs2-summits.bed: List of peak summits and q-values in BED format.
macs2-log.txt: A log file listing the output from the various steps, which can be useful for diagnostic purposes and to get to know the details of the process.
macs2-model.pdf: If the peak model building is successful, a plot of the model is generated. The shape of the modeled peaks allows you to assess the quality of the model.

References

Please cite the following article and the MACS2 website.

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. (2008) Model-based Analysis of ChIP-Seq (MACS), Genome Biology, 2008;9(9):R137.