Seurat -Filter, normalize, regress and detect variable genes


This tool filters out cells, normalizes gene expression values, and regresses out uninteresting sources of variation. It then detects highly variable genes across the cells, which are used for performing principal component analysis in the next step. You can also choose to filter out the differences caused by the cell cycle stage. Before normalisation, the tool filters out potential empties, multiplets and broken cells based on the parameters.
Note, that this tool and Seurat -SCTransform: Filter, normalize, regress and detect variable genes tool are doing the same thing using different methods: you can choose between the two.



The tool performs the following four steps. As an input, give the Seurat R-object (Robj) from the Seurat setup -tool.

  1. Filtering is performed in order to remove empties, multiplets and broken cells. You can use the QC-plots.pdf from the Seurat- Setup and QC tool to estimate the the upper limit for the number of genes per cell and mitochondrial transcript percentage.

  2. Expression values are normalized accross the cells using global scaling normalization: gene’s expression value in a cell is divided by the the total number of transcripts in that cell, the ratio is multiplied by a scale factor (10,000 by default) and log-transformed.

  3. Uninteresting sources of variation in the expression values are regressed out in order to improve dimensionality reduction and clustering later on. Seurat implements a basic regression by constructing linear models to predict gene expression based on user-defined variables. This tool regresses on the number of detected molecules per cell as well as the percentage mitochondrial transcript content.

    You can also choose to regress out cell cycle differences. By choosing all differences the tool removes all signal associated with cell cycle. In some cases this method can negatively impact downstream analysis, particularly in differentiating processes, where stem cells are quiescent and differentiated cells are proliferating (or vice versa). Alternatively you can regress out the difference between the G2M and S phase scores. This means that signals separating non-cycling cells and cycling cells will be maintained, but differences in cell cycle phase amongst proliferating cells (which are often uninteresting), will be regressed out of the data.
    For more information about cell cycle filtering, check out the vignette here.

    In current Seurat version, a list of cell cycle markers (from Tirosh et al, 2015 ) is loaded with Seurat;
    "MCM5" "PCNA" "TYMS" "FEN1" "MCM2" "MCM4" "RRM1" "UNG" "GINS2" "MCM6" "CDCA7" "DTL" "PRIM1" "UHRF1" "MLF1IP" "HELLS" "RFC2" "RPA2" "NASP" "RAD51AP1" "GMNN" "WDR76" "SLBP" "CCNE2" "UBR7" "POLD3" "MSH2" "ATAD2" "RAD51" "RRM2" "CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1" "BLM" "CASP8AP2" "USP1" "CLSPN" "POLA1" "CHAF1B" "BRIP1" "E2F8"
    "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2" "TOP2A" "NDC80" "CKS2" "NUF2" "CKS1B" "MKI67" "TMPO" "CENPF" "TACC3" "FAM64A" "SMC4" "CCNB2" "CKAP2L" "CKAP2" "AURKB" "BUB1" "KIF11" "ANP32E" "TUBB4B" "GTSE1" "KIF20B" "HJURP" "CDCA3" "HN1" "CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1" "NCAPD2" "DLGAP5" "CDCA2" "CDCA8" "ECT2" "KIF23" "HMMR" "AURKA" "PSRC1" "ANLN" "LBR" "CKAP5" "CENPE" "CTCF" "NEK2" "G2E3" "GAS2L3" "CBX5" "CENPA"

  4. Genes which are highly variable across the cells are detected by calculating the average expression and dispersion for each gene, placing these genes into bins, and then calculating a z-score for dispersion within each bin. These settings vary based on the data type, heterogeneity in the sample, and normalization strategy. For example for UMI data normalized to a total of 10 000 molecules, one would expect ~2,000 variable genes.
  5. These will be used in downstream analysis, like PCA. The procedure used in Seurat3 is described in detail here.

For more details, please check the Seurat tutorials.