Dimont sequence extractor using own genome

Description

Dimont sequence extractor prepares an annotated FastA file as required by Dimont from a genome (in FastA format) and a tabular file (e.g., BED, GTF, narrowPeak,...).

This version of the tool allows to use custom genomes.

Inputs

The genomic regions to be extracted in a BED-like file format, e.g., BED, GTF, narrowPeak.
The input genome to which the genomic regions refer, single FastA file.

Parameters

Chromosome column: The column of the Regions file, which contains the chromosome information.
Start column: The column of the Regions file containing the start position of the genomic region.
Second coordinate: The second genomic coordinate with meaning specified by parameter "Meaning of second coordinate".
Meaning of second coordinate: The meaning of the second genomic coordinate. This may either be the position of the peak summit relative to the position in Start, or the end position of the peak.
Statistics column: The column containing the peak statistics information (or another measure of peak confidence).
Width: The width of the genomic region to be extracted. Recommended values: 1000 for ChIP-seq and 100 for ChIP-exo.
Regions file has row names: Select "yes" if the Regions file has row names in its first column, e.g., is in a general TSV-like format.

Details

The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter "Width".

In case of ChIP data, the center position could for instance be the peak summit. An annotated FastA file for ChIP-seq data comprising sequences of length 1000 centered around the peak summit might look like:

> peak: 500; signal: 515 ggccatgtgtatttttttaaatttccac... > peak: 500; signal: 199 GGTCCCCTGGGAGGATGGGGACGTGCTG... ...
where the anchor point is given as 500 for the first two sequences, and the confidence amounts to 515 and 199, respectively.

Output

extracted.fa: Extracted sequences, the sequences extracted from the given genome using the supplied region specifications.

Reference

If you use Dimont, please cite

J. Grau, S. Posch, I. Grosse, and J. Keilwagen. A general approach for discriminative de-novo motif discovery from high-throughput data. Nucleic Acids Research, 41(21):e197, 2013.