Seurat -Filtering, regression and detection of variable genes


This tool filters out cells and regresses out uninteresting sources of variation. It then detects highly variable genes across the single cells, which are used for performing the principal component analysis in the next tool.

PLEASE NOTE that you might need to run the tool couple of times, as setting the max and min limits to average expression and dispersion (bottom three parameters) is an iterative process. Start with some values, see how it goes, and run the tool again with different parameters if needed.


The bottom three parameters (x min, x max, y min) are used to select the variable genes. You might need to run the tool again if the initial values don't give a good number of variable genes.


The tool performs the following three steps. As an input, give the Seurat R-object (Robj) from the Seurat setup -tool.

  1. Filtering is performed in order to remove multiplets and broken cells. You can use the QC-plots.pdf to estimate the the upper limit for the number of genes per cell and mitochondrial transcript percentage.
  2. Uninteresting sources of variation are regressed out in order to improve dimensionality reduction and clustering later on. Seurat implements a basic regression by constructing linear models to predict gene expression based on user-defined variables. This tool regresses on the number of detected molecules per cell as well as the percentage mitochondrial transcript content.
    You can also choose to regress out cell cycle differences (default: no filtering). By choosing all differences the tool removes all signal associated with cell cycle. In some cases this method can impact negatively on the downstream analysis, particularly in differentiating processes,where stem cells are quiescent and differentiated cells are proliferating (or vice versa). As an alternative you can choose regressing out the difference between the G2M and S phase scores.
  3. Genes which are highly variable genes across the single cells are selected for downstream analysis in the next tools. The detection is done by calculating the average expression and dispersion for each gene, placing these genes into bins, and then calculating a z-score for dispersion within each bin. The result file Dispersion.pdf indicates the number of variable genes based on the user defined cut-offs. Based on this plot, you can define the cut-offs for expression (x-axis) and dispersion (y-axis) to mark visual outliers. Two plots are drawn: scaled and non-scaled graphs.

Note that the step 3 is an iterative process -you have to first draw the image with one set of parameters and then run the tool again if there's a need to change the parameters. These settings vary based on the data type, heterogeneity in the sample, and normalization strategy. For example for UMI data normalized to a total of 10 000 molecules, one would expect ~2,000 variable genes.

For more details, please check the Seurat tutorials.