Dimont

Description

Dimont is a universal tool for de-novo motif discovery. Dimont has successfully been applied to ChIP-seq, ChIP-exo and protein-binding microarray (PBM) data.

Inputs

An annotated FastA file with sequences, e.g., from ChIP-seq experiments; can be generated using the "Dimont data extractor" tool.

Parameters

Position tag: The tag for the position information in the FastA-annotation of the input file, default as generated by "Dimont data extractor".
Value tag: The tag for the value information in the FastA-annotation of the input file, default as generated by "Dimont data extractor".
Standard deviation: The standard deviation of the position distribution centered at the position specified by the position tag.
Weighting factor: The value for weighting the data, a value between 0 and 1. Recommended values: 0.2 for ChIP-seq/ChIP-exo, 0.01 for PBM data.
Starts: The number of pre-optimization runs. Default value is fine for most applications.
Initial motif width: The motif width that is used initially, may be adjusted during optimization.
Markov order of motif model: The Markov order of the model for the motif. A value of 0 (default) specifies a position weight matrix (PWM) or position-specific scoring matrix (PSSM), a value of 1 specifies a weight array matrix (WAM) model.
Markov order of background model: The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution.
Equivalent sample size: Reflects the strength of the prior on the model parameters. Default value is fine for most applications.
Delete BSs from profile: A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs.

Details

Input sequences must be supplied in an annotated FastA format as generated using the "Dimont sequence extractor" tool. In the annotation of each sequence, you need to provide a value that reflects the confidence that this sequence is bound by the factor of interest. Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. In case of ChIP data, this anchor position could for instance be the peak summit. For instance, an annotated FastA file for ChIP-seq data comprising sequences of length 1000 centered around the peak summit could look like:

> peak: 500; signal: 515 ggccatgtgtatttttttaaatttccac... > peak: 500; signal: 199 GGTCCCCTGGGAGGATGGGGACGTGCTG... ...
where the anchor point is given as 500 for the first two sequences, and the confidence amounts to 515 and 199, respectively. The FastA comment may contain additional annotations of the format key1 : value1; key2: value2;....code>

Accordingly, you would need to set the parameter "Position tag" to peak and the parameter "Value tag" to signal for the input file.

For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. However, you may want adjust these parameters to meet your prior information.

The parameter "Markov order of the motif model" sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to 0, you obtain a position weight matrix (PWM) model. If it is set to 1, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most 3.

The parameter "Markov order of the background model" sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. If this parameter is set to -1, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to 4 resulted in an increased prediction performance in our case studies. The maximum allowed value is 5.

The parameter "Weighting factor" defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of 0.2 typically works well. For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. 0.01.

The "Equivalent sample size" reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.

The parameter "Delete BSs from profile" defines if BSs of already discovered motifs should be deleted, i.e., "blanked out", from the sequence before searching for futher motifs.

Output

dimont-log.txt: Logfile, logfile of the Dimont run.
dimont-predictions-*.txt: Predictions, binding sites predicted by Dimont.
dimont-logo-rc-*.png: Sequence logo (rc), the sequence logo of the reverse complement of the motif discovered by Dimont.
dimont-logo-*.png: Sequence logo, the sequence logo of the motif discovered by Dimont.
dimont-model-*.xml: Dimont model, the model (as XML) of the motif discovered by Dimont. Can be used in "DimontPredictor".

Reference

If you use Dimont, please cite

J. Grau, S. Posch, I. Grosse, and J. Keilwagen. A general approach for discriminative de-novo motif discovery from high-throughput data. Nucleic Acids Research, 41(21):e197, 2013.