ChIP-seq / Find peaks using MACS, treatment vs. control
Description
This tool identifies statistically significantly enriched genomic regions in ChIP-seq data.
The analysis is performed on one treatment sample against a sample from a control experiment.
Parameters
- File format (ELAND, BAM, BED) [BAM]
- MACS version (1.4.2, 2) [1.4.2]
- Mappable genome size (hg18, hg19, mm9, mm10, rn5, user-specified) [hg19]
- User-specified mappable genome size []
- Read length (1..200) [25]
- Bandwidth (1..1000) [200]
- P-value cutoff (0..1) [0.00001]
- Peak model (yes, no) [yes]
- Upper M-fold cutoff (1..100) [30]
- Lower M-fold cutoff (1..100) [10]
- Input datasets (treatment data file, control data file)
Details
This tool is based on the Model-Based Analysis of ChIP-Seq (MACS) algorithm. The bandwidth, p-value cutoff, whether to use a peak model and the M-fold cutoff parameters should be titrated to obtain optimal results.
A good starting point for optimizing the bandwidth, which determines the window size used by MACS to scan the genomic regions,
is to input a value roughly half the size of the average fragment length of the DNA.
Lowering the p-value cutoff will increase the specificity of the found peak regions, but may result in a restrictively low number of peaks.
Conversely, applying a less stringent cutoff will improve sensitivity, but may yield too high proportion of false positives.
By default MACS tries to build a peak model, based on the bandwidth and the M-fold cutoff limits, that is later used to determine statistically
significantly enriched genomic regions. A small value for the lower cutoff will provide a greater number of peak regions for model building,
but could potentially include spurious data into the model, that will adversely affect the peak finding results.
Setting it too high improves the quality of the model, but could have the downside of excluding so many candidate model regions,
that model building fails. If a good compromise setting is difficult to obtain, there is the option to turn off the model building altogether,
resulting in MACS simply using a shift-size of 100 bp when searching for peaks.
Output
The analysis output consists of the following:
- positive-peaks.tsv: A table containing all the genomic regions, including a number of associated descriptive statistics, that have been found to be statistically significantly enriched at the p-value cutoff level defined by the user.
- negative-peaks.tsv: A table containing all the genomic regions, including a number of associated descriptive statistics, that have been found to be statistically significantly enriched upon performing a sample swap. These results can serve as a sort of "negative control".
- analysis-log.txt: A text file listing the output from the various steps of the MACS algorithm, which can be useful for diagnostic purposes and to get to know the details of the process.
- model-plot.png: If the parameter that enables model building is turned on and the model building is successful, a plot of the model is generated.
The shape of the mode modelled peaks, together with the estimated shift size, may help in assessing the quality of the model.
Smooth curves and a shift size compatible with the expected binding properties of the transcription factor in question are likely indicators of a successful model.
References
This tool uses the MACS package for peak detection. Please read the following article for more detailed information.
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C,
Myers RM, Brown M, Li W, Liu XS. (2008) Model-based Analysis of ChIP-Seq (MACS), Genome Biology, 2008;9(9):R137.