Seurat -Filtering, regression and detection of variable genes

Description

As input, give the Seurat R-object (Robj) from the Seurat setup -tool.

This tool filters out cells, and regresses out uninteresting sources of variation. It then detects highly variable genes across the single cells, which are used for performing the principal component analysis in the next tool.

PLEASE NOTE that you might need to run the tool couple of times, as setting the max and min limits to average expression and dispersion (bottom three parameters) is an iterative process. Start with some values, see how it goes and run the tool again with different parameters.

Parameters

The bottom three parameters (x min, x max, y min) are used to select the variable genes. You might need to run the tool again if the initial guess with the parameters doesn't give a good number of variable genes.

Details

As input, give the Seurat R-object (Robj) from the Seurat setup -tool.

You can use the QC-plots.pdf to estimate the cut-offs for this tool: the upper limits for number of genes per cell and mitochondrial transcript percentage.

First, the tool performs filtering based on the number of genes to get rid of possible multiplets. Cells can also be filtered based on the percentage of mitochondrial transcripts present. User can determine (based on the QC plots) the upper limit for unique gene counts per cell and the upper limit for mitochondrial transcript percentage.

Next, the "uninteresting" sources of variation are regressed out to improve downstream dimensionality reduction and clustering. Seurat implements a basic regression by constructing linear models to predict gene expression based on user-defined variables. This tool regresses on the number of detected molecules per cell as well as the percentage mitochondrial transcript content.

Next, the variable genes across the single cells are detected. These highly variable genes are used on the downstream analysis (in the next tools). The detection is done by calculating the average expression and dispersion for each gene, placing these genes into bins, and then calculating a z-score for dispersion within each bin. Dispersion plot is drawn in the Dispersion.pdf, together with the number of variable genes when using the user defined cut-offs. Based on this plot, user is to define the cut-offs for expression (x-axis) and dispersion (y-axis) to mark visual outliers. Note that the plot can be a bit misleading, as the variable genes (selected based on the x min, x max and y min parameters) are drawn with their gene name, whereas the other, un-selected genes are only drawn as dots.

Note that this is an iterative process -you have to first draw the image with one set of parameters and then run the tool again if there's a need to change the parameters. These setting vary based on the data type, heterogeneity in the sample, and normalization strategy. For example for a UMI data normalized to a total of 10 000 molecules, one would expect ~2,000 variable genes.

For more details, please check the Seurat tutorials.

Output