Seurat v3 -SCTransform: Filter, normalize, regress and detect variable genes
Description
This tool uses SCTransform method for normalisation, scaling and finding variable features.
You can also choose to filter out the differences caused by the cell cycle stage.
Before normalisation, the tool filters out potential empties, multiplets and broken cells based on the
parameters.)
Note, that this tool and Seurat -Filter,
normalize, regress and detect variable genes
tool are doing the same thing using different methods: you can choose between the two.
Parameters
- Filter out cells which have less than this many genes expressed [200]
- Filter out cells which have higher unique gene count [2500]
- Filter out cells which have higher mitochondrial transcript percentage [5]
- Regress out cell cycle differences [no]
- Number of variable features to return [3000]
Details
The tool performs the following steps: (1) filtering of problematic cells,
and the three steps included in the SCTransform tool, (2) normalisation, (3) scaling and (4) finding variable
features.
As an input, give the Seurat R-object (Robj) from the Seurat setup
-tool. The R-object output can be used as an input for the Seurat -PCA tool.
Compared to basic Seurat normalisation + scaling + variable gene finding,
the SCTransform workflow performs more effective normalization,
strongly removing technical effects from the data.
This allows more PCs to be selected in the PCA step, as the higher PCs are more likely to represent
subtle, but biologically relevant, sources of heterogeneity instead of variation in sequencing depth.
- Filtering is performed in order to remove empties, multiplets and broken cells.
You can use the QC-plots.pdf to estimate the parameters for this step.
- Expression values are normalized using the SCTransform normalisation method, which uses
Pearson residuals from “regularized negative binomial regression,”
where cellular sequencing depth is utilized as a covariate in a generalized linear model (GLM).
The parameters for the model are estimated by pooling information acreoss genes that are expressing at
similar levels.
This should remove the technical characteristics but preserve the biological heterogeneity,
and avoid overfitting the model to the data.
- Uninteresting sources of variation in the expression values are regressed out in order to improve
dimensionality reduction and clustering later on.
This tool regresses on the number of detected molecules per cell as well as the percentage
mitochondrial transcript content.
You can also choose to regress out cell cycle differences.
By choosing all differences the tool removes all signal associated with cell cycle.
In some cases this method can negatively impact downstream analysis,
particularly in differentiating processes, where stem cells are quiescent and differentiated cells are
proliferating (or vice versa).
Alternatively you can regress out the difference between the G2M and S phase scores.
This means that signals separating non-cycling cells and cycling cells will be maintained, but differences
in cell cycle phase amongst proliferating cells (which are often uninteresting), will be regressed out of
the data.
For more information about cell cycle filtering, check out the vignette here.
In current Seurat version, a list of cell cycle markers (from Tirosh et al, 2015 ) is loaded with
Seurat;
s.genes
"MCM5" "PCNA" "TYMS" "FEN1" "MCM2"
"MCM4" "RRM1" "UNG" "GINS2" "MCM6"
"CDCA7" "DTL" "PRIM1" "UHRF1" "MLF1IP"
"HELLS" "RFC2" "RPA2" "NASP" "RAD51AP1"
"GMNN" "WDR76" "SLBP" "CCNE2" "UBR7"
"POLD3" "MSH2" "ATAD2" "RAD51" "RRM2"
"CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1"
"BLM" "CASP8AP2" "USP1" "CLSPN" "POLA1"
"CHAF1B" "BRIP1" "E2F8"
g2m.genes
"HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5"
"TPX2" "TOP2A" "NDC80" "CKS2" "NUF2"
"CKS1B" "MKI67" "TMPO" "CENPF" "TACC3"
"FAM64A" "SMC4" "CCNB2" "CKAP2L" "CKAP2"
"AURKB" "BUB1" "KIF11" "ANP32E" "TUBB4B"
"GTSE1" "KIF20B" "HJURP" "CDCA3" "HN1"
"CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1"
"NCAPD2" "DLGAP5" "CDCA2" "CDCA8" "ECT2"
"KIF23" "HMMR" "AURKA" "PSRC1" "ANLN"
"LBR" "CKAP5" "CENPE" "CTCF" "NEK2"
"G2E3" "GAS2L3" "CBX5" "CENPA"
- Genes which are highly variable across the cells are detected using the Pearson residuals
computed in the normalisation step.
The highly variable genes will be used in downstream analysis, like PCA.
For more details, please check the Seurat tutorials.
The SCTransform function used in Seurat is described in Hafemeister, C., Satija,
R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial
regression. Genome Biol 20, 296 (2019), DOI: https://doi.org/10.1186/s13059-019-1874-1
Use of SCTransform function is demonstrated in Seurat SCTransform vignette page.
Output
- seurat_obj.Robj: The Seurat R-object to pass to the next Seurat tool, or to import to R. Not viewable in
Chipster.
- Dispersion.pdf: The variation vs average expression plots (in the second plot, the 10 most highly variable
genes are labeled).
If you selected to regress out cell cycle differences, PCA plots on cell cycle genes before and after the
regression
will be added in the end of this pdf. Also lists the number of highly variable genes.