Skip to contents

Overview

Description

Cellular phenotyping broadly refers to all procedures that aim to identify the identity and/or state of cells, as well as the relationship between cells. In this section, we discus challenges and the analytical focus associated with cellular phenotyping. The overall learning goals of the workshop is to understand trajectory analysis, a common approach to phenotype continuous cell states or timecourse data. We will also discuss the general analytical strategies associated with phenotyping single cells and future perspectives.


Preparation and pre-reading


Learning objectives

  • identify different workflows to build trajectory including calculate pseudo-times
  • generate appropriate graphics to visualise the trajectories and expression of genes
  • understand the different trajectory characterisation approaches including differentially expressed genes across trajectories


Schedule

Activity Time
Introduction 5m
How to infer the a trajectory 20m
Visualisation of the trajectory 20m
Evaluating the trajectory 30m
Characterisation of the trajectory 60m
Concluding remarks


Introduction

Many biological processes manifest as a continuum of dynamic changes and diverse cellular states. One of the major goals of single cell data analysis is to identify and characterize these cell types and states that are induced from external stimulation (such as environmental insults or perturbations), the spatial context, the point along the differentiation system in which a cell lies and more. Collectively, these biological, genetic and environmental factors contribute towards the diversity of the phenotypes we see among cells.

We define the process of phenotyping as any procedure that attempts to further characterize the identity and/or state of cells and the relationship between cells and is distinct from procedures relating to the pre-processing or clustering of the data. One of the most common and basic phenotyping procedure is cell type identification, which is a highly researched topic in single cell data analysis. We can further explore various characteristics of the cells by

  1. investigating where a cell lies along a trajectory, if the data demonstrates a pseudo-temporally ordered state;
  2. determining if there are multiple branches in the trajectory (which may associate with different cell type lineages);
  3. if the previous is true, distinguishing if certain gene expression programs are associated with a certain trajectory through differential expression analysis or gene set enrichment studies; and lastly;
  4. investigating if a certain signalling patterns can be observed between cell types. Collectively, these efforts contribute towards phenotyping the data and obtaining a more descriptive characterization of the data at hand.

Thus, the question becomes how do we identify the diverse phenotypes? Whilst single cell data enables us to interrogate the heterogeneity among the phenotypes and tease out different and distinct states, a cell’s phenotype is the manifestation of a collective and integrated output of a set of molecules or networks that span across multiple modalities (transcriptome, genome, epigenome, and proteome etc). In this workshop, we begin to investigate this using single-cell transcriptomes, acknowledging at the same time that a single modality alone may not be sufficient to accurately pinpoint all the phenotypes present in the data.

In this workshop, we will demonstrate how to do trajectory analysis using a mouse scRNA-seq dataset of the hippocampus Hochgerner et al. (2018)

Loading libraries and the data

load(file="../data/hippocampus_minimal.rda")
sce <- hippocampus_minimal
rm(hippocampus_minimal)

sce <- scuttle::logNormCounts(sce)


Initial visualisation of the data

We will visualise the data using PCA, t-SNE and UMAP calculated from the previous section of the workshop.

p1 <- scater::plotPCA(sce, colour_by = "cell_cluster", point_size=0.5) + ggtitle("PCA")
p2 <- scater::plotTSNE(sce, colour_by = "cell_cluster", point_size=0.5) + ggtitle("TSNE")
p3 <- scater::plotUMAP(sce, colour_by = "cell_cluster", point_size=0.5) + ggtitle("UMAP")

patchwork::wrap_plots(list(p1 + p2 + p3), guides="collect")

The visualisations suggests the presence of a clear trajectory. Starting from the progenitor cells (neural intermediate progenitor cells (“nIPC”), there is bifurcation of the trajectory into the two major lineages: the immature dentate gyrus granule neurons (“immature-GC”) and immature pyramidal neurons (“Immature-Pyr”).


Trajectory inference: How can I infer the trajectory for my data?

This section contains two parts. The first part focuses on the four key steps of building a trajectory and the second part focuses on quality assessment of the built trajectory.

Building a trajectory

Step 1: Feature selection

One of the first step in building a trajectory is to perform feature selection. Given the high dimensional nature of high throughput single cell data, oftentimes it is advantageous to select for genes that provide biologically relevant information and remove those that add noise to the data. There are many ways to perform feature selection in single cell data, reviewed here.

In this workshop, we demonstrate feature selection using differential stability from Cepo to choose the top 150 cell identity genes for each cell type.

cepo = Cepo::Cepo(logcounts(sce), sce$cell_cluster, exprsPct = 0.05)
cepoGenes = unique(unlist(Cepo::topGenes(cepo, n = 150)))

Step 2: Dimensionality reduction

The code below demonstrate using Cepo selected genes to perform PCA.

set.seed(2024)
sce = scater::runPCA(sce, scale = TRUE, subset_row = cepoGenes, name="PCA_Cepo")
scater::plotReducedDim(sce, dimred = "PCA_Cepo", colour_by = "cell_cluster", point_size=0.5) +
  ggplot2::ggtitle("Cepo_PCA")

Similarly, there are different dimensionality reduction approaches and we again illustrate the use of t-SNE and UMAP to achieve the same purposes.

set.seed(2024)
sce = scater::runTSNE(sce, dimred = "PCA", name="TSNE_Cepo", BPPARAM= BiocParallel::MulticoreParam(workers = 4))

sce = scater::runUMAP(sce, dimred = "PCA", name="UMAP_Cepo", min_dist=0.5,  BPPARAM=BiocParallel::MulticoreParam(workers = 4))

p2 = scater::plotReducedDim(sce, dimred = "TSNE_Cepo", colour_by = "cell_cluster", point_size=0.5) + ggplot2::ggtitle("Cepo_TSNE")
p3 = scater::plotReducedDim(sce, dimred = "UMAP_Cepo", colour_by = "cell_cluster", point_size=0.5) + ggplot2::ggtitle("Cepo_UMAP")

patchwork::wrap_plots(list(p2, p3), guides="collect")

While the PCA looks similar to our previous visualisation using HVGs, we can see from the t-SNE and UMAP that there is greater refinement of the two potential subpopulations within the immature pyramidal cells.


Step 3: Creating a lineage structure and pseudo-time inference

Building the trajectory is to infer and construct the trajectory and steps 3 and 4 are often combined in one function in many methods. Development of trajectory inference methods is actively evolving, and the effectiveness of current trajectory inference tools for transcriptomics data has been extensively reviewed in this benchmarking paper.

In this workshop, we will run slingshot on UMAP reduced dimension space as an example, for which more details of the implementation can be found here. The basic input of slingshot is:

  • the expression matrix (as an SCE object),
  • cell type labels, and
  • the coordinates for the reduced dimensions.

Details: Slingshot implements a two-stage procedure whereby in the first step it infers the global lineage structure using the multi-spanning tree (MST) and then infers the pseudo-time for cells along each lineage by fitting simultaneous “principal curves” across multiple lineages. When known, we can also assign the starting node/cell type (i.e., “nIPC”).

slingshot_umap <- slingshot(sce,
                            clusterLabels = sce$cell_cluster,
                            reducedDim = "UMAP_Cepo",
                            start.clus = "nIPC")

Step 4: Visualisation of the trajectory

Visualisation with base R

Let us visualise the slingshot results using the base R

rd <- reducedDim(slingshot_umap, "UMAP_Cepo")
plotcol <- tableau_color_pal("Classic 10")(5)[as.factor(slingshot_umap$cell_cluster)]

# set the aspect ratio for base R plot as 1
par(pty = "s")
plot(rd, col = plotcol,
     pch = 16, asp = 1, xlab = "UMAP1", ylab = "UMAP2")
lines(SlingshotDataSet(slingshot_umap), lwd=2, col='black')

Visualisation the same results using ggplot2.
df_toPlot_umap <- data.frame(UMAP1 = rd[, 1],
                        UMAP2 = rd[, 2],
                        ClusterName = slingshot_umap$cell_cluster)
pseudotime_umap <- slingshot::slingPseudotime(slingshot_umap)
df_toPlot_umap <- cbind(df_toPlot_umap, pseudotime_umap)

curves <- slingCurves(slingshot_umap, as.df = TRUE)

ggplot() +
  geom_point(data = df_toPlot_umap, aes(x = UMAP1, y = UMAP2, color = ClusterName), size=0.5) +
  geom_path(data = curves,
            aes(x = UMAP1, y = UMAP2, group = Lineage)) +
  scale_color_tableau(palette = "Classic 10") +
  theme(aspect.ratio = 1)

We can see a total of two lineages that has been identified by slingshot, where we have a trajectory from nIPCs to the Neuroblast and then it bifurcates at Neuroblast to give rise to two terminal cell types, the Immature Granule Cells and Immature Pyramidal Cells


Visualisation of the inferred pseudotime lineages

A more interesting visualisation would be to see how the pseudotimes have been inferred for the individual branches. By using the slingshot::slingPseudotime function, we can access the inferred pseudotimes for the individual cells for each of the lineages identified.

## visualising the pseuotime
df_toPlot_umap2 <- reshape2::melt(df_toPlot_umap, id = c("UMAP1", "UMAP2", "ClusterName"))
colnames(df_toPlot_umap2) <- c("UMAP1", "UMAP2", "ClusterName", "Lineages", "Pseudotime")

ggplot() +
  geom_point(data = df_toPlot_umap2, aes(x = UMAP1, y = UMAP2, color = Pseudotime)) +
  geom_path(data = curves,
            aes(x = UMAP1, y = UMAP2, group = Lineage)) +
  scale_color_viridis_c() +
  theme(aspect.ratio = 1) +
  facet_wrap(~Lineages)


Visualisation of the density of cells along the pseudotime

Using the frequency of cells identified along the above pseudotime, we can also visualise the density of cells for each cell type along the two lineages using the geom_density_ridges2 function.

ggplot() +
  geom_density_ridges2(data = df_toPlot_umap2,
                       aes(x = Pseudotime, y = ClusterName, fill = ClusterName),
                       alpha = 0.5) +
  scale_fill_tableau(palette = "Classic 10") +
  theme(aspect.ratio = 1, legend.position = "bottom") +
  facet_wrap(~Lineages)

For users wanting to explore alternative methods to build the trajectory, please refer to the dynverse framework, which provides a uniform interface to run more than 50 trajectory inference methods.

Discussion:

  • Do these orderings make sense given the cluster labels? Why?
  • Would specifying starting cluster (start.clus) or not impact our trajectory results?
  • As an additional exercise, would performing dimensionality reduction with a different set/number of selected features impact the trajectory results? What if you try a different dimensionality reduction technique?


Is my trajectory appropriate or sensible?

In this section, we will look at the critical thinking questions that a researcher will ask about the trajectory.

How well do the pseudotime orderings correlate from different trajectory methods?

As an example to illustrate the concordance between the results from running two different trajectory inference methods, here we will use monocle3 to construct the trajectory on the same UMAP space and compare it against the results from slingshot.

The black lines show the structure of the graph. The circles with numbers in them denote special points within the graph. Each leaf, denoted by light gray circles, corresponds to a different outcome (i.e. cell fate) of the trajectory. Black circles indicate branch nodes, in which cells can travel to one of several outcomes. You can control whether or not these are shown in the plot with the label_leaves and label_branch_points arguments to plot_cells. Note that numbers within the circles are provided for reference purposes only.

library(monocle3)
# Construct cell data set object for monocle3
rowData(sce) <- DataFrame(rowData(sce), gene_short_name = rownames(sce))
cds <- as(sce, "cell_data_set")
colData(cds)[, 1] <- NULL
cds <- cluster_cells(cds, cluster_method = "leiden")
# plot_cells(cds, color_cells_by = "partition")

cds <- learn_graph(cds, use_partition = FALSE, close_loop=FALSE, verbose=TRUE) # set use_partition to FALSE so only a single graph is learnt

plot_cells(cds, reduction_method = "UMAP",
           color_cells_by = "cell_cluster", label_groups_by_cluster=FALSE,
           group_label_size = 4, graph_label_size = 4) +
  scale_color_tableau(palette = "Classic 10") +
  theme(aspect.ratio = 1)

Interpretation:

Similar to the approach using slingshot, visualisation of the monocle3 approach shows two distinct lineages, replicating the bifurcation point at the Neuroblast.


Are the trajectory methods robust to the choice of “root cells”?

We can now calculate the pseudotimes. Note that it is important to set the cell type of origin (i.e., the cell type denoting the earliest temporal point of the differentiation when calculating pseudotimes. For monocle3, we will need to provide the root_cells, which requires us to specify a set of cells from the root cell type. By creating a customised function get_earliest_principal_node to do this, we can arbitrarily assign a set of cell from the cell type of origin as our root_cells.

It’s often desirable to specify the root of the trajectory programmatically, rather than manually picking it. The function below does so by first grouping the cells according to which trajectory graph node they are nearest to. Then, it calculates what fraction of the cells at each node come from the earliest time point. Then it picks the node that is most heavily occupied by early cells and returns that as the root.

get_earliest_principal_node <- function(cds, coldata_name, start.clust){
  cell_ids <- which(colData(cds)[, coldata_name] == start.clust)
  closest_vertex <- cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  root_pr_nodes <- igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names(which.max(table(closest_vertex[cell_ids,]))))]
  root_pr_nodes
}

# Set the root cells
root_cells <- get_earliest_principal_node(cds, coldata_name = "cell_cluster",
                                          start.clust = "nIPC")
# Order the cells to calculate pseudotime
cds <- order_cells(cds, root_pr_nodes = root_cells)

plot_cells(cds,
           color_cells_by = "pseudotime",
           label_cell_groups = FALSE,
           # label_leaves = FALSE,
           # label_branch_points = FALSE,
           graph_label_size = 4, group_label_size = 4) +
  theme(aspect.ratio = 1)


monocle3_pseudotime <- cds@principal_graph_aux[["UMAP"]]$pseudotime

We can evaluate the concordance in pseudotimes inferred across the two trajectory inference methods, slingshot and monocle3 by calculating a simple pearson’s correlation. Are you satisfied with the agreement?

par(pty = "s")
plot(monocle3_pseudotime, pseudotime_umap[, 1], xlab = "Monocle 3", ylab = "Slingshot (Lineage 1)")

cor(monocle3_pseudotime, pseudotime_umap[, 1], use = "complete.obs")
#> [1] 0.9776447

Discussion:

  • Can we use order_cells in monocle3 if we do not know what the root cell type is?


How can I characterise the trajectory of the data?

So far, we have built a trajectory from our mouse hippocampus data using the transcriptomes of single cells. During this process, we determined the presence of two distinct lineages in the trajectory that was in line with the underlying biological system, and investigated the impact of using different dimensionality reduction techniques and different methods of trajectory inference.

We can now being to further phenotype the trajectories by investigating the changes in the specific gene expressions programs across and between lineages. We will ask the following questions:

  • Within a lineage, which genes are differentially expressed across a pseudotime?
  • Which genes exhibit different patterns of gene expression across the pseudotime between lineages?
  • Can we group the genes in terms of their gene expression patterns across the pseudotime?
  • Are specific pathways or gene sets enriched in these gene groups?
  • Which gene-pair correlations change across lineages?

Within a lineage, which genes are differentially expressed across a pseudotime?

Background: A suitable statistical model for determining which gene expression is related to pseudotime is the generalise additive model. A generalised additive model (GAM) in statistics is a generalised linear model in which the linear response variable is linearly dependent on unknown smooth functions of certain predictor variables. GAMs can be thought of as a bridge between simple linear models and advanced machine-learning techniques, allowing researchers to model non-linear data or more complex systems. In GAM, inference on smooth functions is frequently the main focus. The GAM model is typically written as \[ y = \beta_0 + f(t) + \epsilon; \epsilon \sim N(0,\sigma^2)\] where \(t\) represents the pseudotime.

The approach tradeSeq Berge et al., 2020 implement a negative bionomial-GAM (NA-GAM) model to detect genes that are differentially expressed along one or multiple lineages in a trajectory. The NB-GAMS model can be fitted using the functions in tradeSeq. Here the choice of knot number is an important model parameters. This parameter specifies how complex the data expression patterns are, with a higher number of knots allowing for more intricate expression patterns. The most optimal choice of knots can be derived from using the evaluateK function. If your data contains many conditions, use the conditions option to fit individual smoothers for each condition.

Computation: In the interest of time, we suggest performing this analysis using

- the set of differentially stable genes (450 genes) derived earlier in this workshop; and
- set `knots = 6` when fitting the NB-GAM. 

This component is computationally intensive, and the ‘fitGAM’ process, as seen below, will take about ten minutes to complete. As an alternative, we recommend completing the rest of this tutorial with the pre-run output.

library(tradeSeq)
slingshot_umap <- slingshot_umap[!duplicated(rownames(slingshot_umap)),]

if (!file.exists("../data/slingshot_umap.rda")) {
  slingshot_umap <- fitGAM(slingshot_umap,
                           genes = cepoGenes,
                           nknots = 6,
                           verbose = TRUE,
                           parallel = FALSE,
                           sce = TRUE,
                           family = "nb",
                           gcv = FALSE)
  save(slingshot_umap, file = "../data/slingshot_umap.rda")
} else {
  load(file="../data/slingshot_umap.rda")
}

table(rowData(slingshot_umap)$tradeSeq$converged)
#> 
#> TRUE 
#>  567

Test for association

Next, using the fitted gene expressions, we can assess for significant changes in gene expression (i.e., differential expression) as a function of pseudotime within each lineage using the associationTest function. Importantly, setting lineages = TRUE will specify that the differential expression analysis should be performed for each lineage separately, as well as performing the test across all lineages simultaneously. Here, we will only focus on perform the association test on the Cepo selected genes.

assoRes <- associationTest(slingshot_umap[cepoGenes, ], global = TRUE, lineages = TRUE)
assoRes <- assoRes[order(assoRes$waldStat, decreasing = TRUE),]

dt <- DT::datatable(assoRes, options = list(scrollX=TRUE)) %>%
  formatSignif(which(sapply(assoRes, is, 'numeric')), digits = 3)
dt

## extracting candidate from each lineage
lineage1Genes <-  rownames(assoRes)[which(p.adjust(assoRes$pvalue_1, "fdr") <= 0.05)]
lineage2Genes <-  rownames(assoRes)[which(p.adjust(assoRes$pvalue_2, "fdr") <= 0.05)]
allGenes <-  rownames(assoRes)[which(p.adjust(assoRes$pvalue, "fdr") <= 0.05)]

DEgenes = list(
  lineage1 = lineage1Genes,
  lineage2 = lineage2Genes
)
sapply(DEgenes, length)
#> lineage1 lineage2 
#>      282      287

We will use the UpSetR package to visualise the number of intersecting and exclusive genes.

UpSetR::upset(UpSetR::fromList(DEgenes))


length(setdiff(lineage1Genes, lineage2Genes))
#> [1] 142
length(setdiff(lineage2Genes, lineage1Genes))
#> [1] 147
length(intersect(lineage1Genes, lineage2Genes))
#> [1] 140
plotSmoothers(slingshot_umap, counts(slingshot_umap), gene = c("RUNX1T1"))

Interpretation:

  • Based on the upset graphics, we see there is a large overlap in DE genes between the three lineages (around 93%).
  • From the pseudotime graph, we can check whether the expression of an identified DE gene indeed change over pseudotime within each lineage. As an example, we select RUNX1T1, a gene that has been found to be differentially expressed for the two lineages.


Which genes exhibit different patterns of gene expression across the pseudotime between lineages?

Whilst the associationTest looks for genes that show differential expression across a pseudotime within a lineage, we can also look for genes that demonstrate different differential expression patterns across lineages through the patternTest function. This is accomplished by testing whether the expression pattern of each gene varies between lineages. As previously, we will only focus on running the pattern test on the Cepo-selected genes.

patternRes <- patternTest(slingshot_umap[cepoGenes,])
patternRes <- patternRes[order(patternRes$waldStat, decreasing = TRUE),]
datatable(patternRes) %>%
  formatSignif(which(sapply(patternRes, is, 'numeric')), digits = 3)

## determine the number of candidate genes
patternGenes <-  rownames(patternRes)[which(p.adjust(patternRes$pvalue, "fdr") <= 0.05)]
length(patternGenes)
#> [1] 454

## Assessing our results through visualization
plotSmoothers(slingshot_umap, counts(slingshot_umap), gene = c("SYT4"))

Interpretation:

  • We have 454 of 567 genes demonstrate differential gene expression patterns between lineages.
  • It is critical, like with any omics data study, to double-check our results using visualisation. For this question, we will visualise the change in the expression of the most important gene, SYT4.


Can we group the genes in terms of their gene expression patterns across the pseudotime?

Hypothesis testing (‘patternTest’) generates a value describe the degree to which a gene’s expression as a function of pseudotime is distinct across lineages; however, it does not capture the diversity of pseudotime patterns among these genes. To this end, we can further cluster genes according to their expression pattern along the pseudotime for each lineage using the clusterExpressionPattern function from the tradeSeq package. The resulting output is a clustering label that denotes gene groups based on like gene expression profiles.

Computation:

The main arguments of the function are

1) the slingshot output, 
2) `nPoints` which denotes the number of points along each lineage to cluster the genes, and 
3) the `genes` to use in the clustering. More details of the implementation can be found in [this vignette](https://statomics.github.io/tradeSeq/articles/tradeSeq.html).

Note that the clustering uses the clusterExperiment package (Risso et al. 2018), hence the user may select any clustering algorithm that’s built into that package (check the output of clusterExperiment::listBuiltInFunctions() on the command line to see available methods). clusterExpressionPatterns take a few minutes to run in our case, you can load our pre-run results instead.

library(clusterExperiment)
nPointsClus <- 20

if (!file.exists("../data/clusPat.rda")) {
  clusPat <- tradeSeq::clusterExpressionPatterns(slingshot_umap, nPoints = nPointsClus,
                                       genes = rownames(patternRes))
  save(clusPat, file="../data/clusPat.rda")
} else {
  load(file="../data/clusPat.rda")
}

clusterLabels <- clusterExperiment::primaryCluster(clusPat$rsec)

The clustering results generated from running the clusterExpressionPatterns function shows a total 52 clustering groups with 98 genes that have not been assigned to a cluster and may demonstrate unique gene expression patterns. We will also visualise the normalised expression patterns of genes from three clusters.

table(clusterLabels)
#> clusterLabels
#> -1  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
#> 89 70 33 28 28 25 23 17 31 14 14 13 13 11 10  9  9  8  8  8  8  8  7  7  7  7 
#> 26 27 28 29 30 31 32 33 34 35 36 
#>  6  6  6  6  6  6  6  5  5  5  5

ggCluster <- list()
for (xx in 1:4) {
  cId <- which(clusterLabels == c(1,2,3,4)[[xx]])
  p <- ggplot(data = data.frame(x = 1:nPointsClus,
                                y = rep(range(clusPat$yhatScaled[cId, ]),
                                        nPointsClus / 2)),
              aes(x = x, y = y)) +
    geom_point(alpha = 0) +
    labs(title = paste0("Cluster", c(1,2,3,4)[[xx]]),  x = "Pseudotime", y = "Normalized expression") +
    theme_classic() +
    theme(plot.title = element_text(hjust = 0.5))
  for (ii in 1:length(cId)) {
    geneId <- rownames(clusPat$yhatScaled)[cId[ii]]
    p <- p +
      geom_line(data = data.frame(x = rep(1:nPointsClus, 2),
                                  y = clusPat$yhatScaled[geneId, ],
                                  lineage = rep(1:2, each = nPointsClus)),
                aes(col = as.character(lineage), group = lineage), lwd = 1.5)
  }
  ggCluster[[xx]] <- p + guides(color = "none") +
    scale_color_manual(values = c("orange", "darkseagreen3"),
                       breaks = c("1", "2"))
}

patchwork::wrap_plots(ggCluster, ncol = 4, nrow = 1)

Discussion:

  • Can you describe the various expression patterns? Note that there are two layers of differences: genes show (1) differential expression across a pseudotime within a lineage and (2) these differential expression patterns are significantly different between lineages.

  • Would setting a greater number of nPoints (e.g., 40, 60, 80) change the clustering results?


What is the functional annotation for the various gene sets?

To establish the functional annotation of the various clusters of genes obtained, it is similar to identifying whether specific pathways or gene sets is enriched in these clusters of genes. Here, we perform gene set over-representation (GO) analysis and illustrate this analysis by identify gene ontology gene sets that are over-represented in cluster 1 (70 genes). The results is shown in a table.

names(clusterLabels) <- rownames(clusPat$yhatScaled)
names(clusterLabels) <- rowData(sce)$gene_name[match(names(clusterLabels), rownames(sce))]

library(clusterProfiler)
library(org.Mm.eg.db)
eg <- bitr(names(clusterLabels)[clusterLabels == 1],
           fromType = "SYMBOL",
           toType = "ENTREZID",
           OrgDb = "org.Mm.eg.db")


geneList <- bitr(rowData(sce)$gene_name,
                 fromType = "SYMBOL",
                 toType = "ENTREZID",
                 OrgDb = "org.Mm.eg.db")

ego_res <- enrichGO(gene = eg$ENTREZID,
                    universe = geneList$ENTREZID,
                    OrgDb = org.Mm.eg.db,
                    ont = "BP",
                    pAdjustMethod = "BH",
                    minGSSize = 50,
                    maxGSSize = 200,
                    pvalueCutoff = 1,
                    qvalueCutoff = 1,
                    readable = TRUE)

## generate results
head(as.data.frame(ego_res))
#>                    ID                                               Description
#> GO:0000070 GO:0000070                      mitotic sister chromatid segregation
#> GO:1902850 GO:1902850 microtubule cytoskeleton organization involved in mitosis
#> GO:0007051 GO:0007051                                      spindle organization
#> GO:0051983 GO:0051983                      regulation of chromosome segregation
#> GO:0051310 GO:0051310                            metaphase chromosome alignment
#> GO:0007052 GO:0007052                              mitotic spindle organization
#>            GeneRatio   BgRatio       pvalue     p.adjust       qvalue
#> GO:0000070     26/65 169/20311 8.629563e-38 2.934051e-35 1.916671e-35
#> GO:1902850     22/65 152/20311 3.246372e-31 5.518832e-29 3.605181e-29
#> GO:0007051     22/65 182/20311 2.113668e-29 2.395490e-27 1.564856e-27
#> GO:0051983     20/65 127/20311 3.923134e-29 3.334664e-27 2.178372e-27
#> GO:0051310     17/65  86/20311 1.338809e-26 9.103901e-25 5.947131e-25
#> GO:0007052     18/65 119/20311 6.867068e-26 3.891338e-24 2.542020e-24
#>                                                                                                                                                           geneID
#> GO:0000070 Aurkb/Birc5/Ccnb1/Cdca8/Cdk1/Cenpe/Cenpk/Cit/Fbxo5/Incenp/Kif11/Kif22/Kif23/Kif2c/Kifc1/Mad2l1/Ncapg/Nuf2/Nusap1/Plk1/Prc1/Spag5/Spc24/Spc25/Tpx2/Ttk
#> GO:1902850                    Aurka/Aurkb/Birc5/Ccnb1/Cdca8/Cdk1/Cenpe/Cenph/Gpsm2/Incenp/Kif11/Kif23/Kifc1/Mad2l1/Nuf2/Nusap1/Plk1/Prc1/Sapcd2/Spc25/Tacc3/Tpx2
#> GO:0007051                       Aspm/Aurka/Aurkb/Birc5/Ccnb1/Ccnb2/Cdca8/Cenpe/Cenph/Fbxo5/Gpsm2/Incenp/Kif11/Kif23/Kifc1/Nuf2/Plk1/Prc1/Spag5/Spc25/Tacc3/Tpx2
#> GO:0051983                                    Aurkb/Birc5/Ccnb1/Cdca2/Cdca8/Cdk1/Cenpe/Cit/Fbxo5/Incenp/Kif2c/Mad2l1/Mki67/Ncapg/Nuf2/Plk1/Spc24/Spc25/Tacc3/Ttk
#> GO:0051310                                                 Aurkb/Birc5/Ccnb1/Cdca8/Cenpe/Cenpf/Cenpq/Ect2/Fam83d/Incenp/Kif22/Kif2c/Kifc1/Nuf2/Spag5/Spc24/Spc25
#> GO:0007052                                              Aurka/Aurkb/Birc5/Ccnb1/Cdca8/Cenpe/Cenph/Gpsm2/Incenp/Kif11/Kif23/Kifc1/Nuf2/Plk1/Prc1/Spc25/Tacc3/Tpx2
#>            Count
#> GO:0000070    26
#> GO:1902850    22
#> GO:0007051    22
#> GO:0051983    20
#> GO:0051310    17
#> GO:0007052    18

The clusterProfiler::dotplot function can be used to visualise the results for the top significant pathways as a dot plot where the size of the circle represents the number of genes identified for the pathway of interest and the color of the function denotes the level of significance.

clusterProfiler::dotplot(ego_res)

Discussion:

  • What are the top enriched pathways from the gene set over-representation analysis?
  • Are the pathways in line with what you expect?


[Optional] Which gene-pair correlations change across lineages?

In the previous four questions, the focus is to identify a candidate gene interest. Here we aim to identify candidate “gene-pair”. This is achieve using an algorithm called scHOT Ghzanfar et al., 2020, where we can further test the change in gene expression of a pair of genes across a pseudotime. In the R package, there are two statistical higher order testing we can perform, controlled by the higherOrderFunction argument in scHOT. We can test for differential variability or differential correlation between pairs of genes across the pseudotime. 1) variability: higherOrderFunction = matrixStats::weightedVar 2) correlation: higherOrderFunction = matrixStats::weightedSpearman

Computation: To reduce computation time, we will select the top 10 genes identified from associationTest.

library(scHOT)

first_branch_cells <- !is.na(slingshot_umap$slingPseudotime_1)
assoRes_lingeage1_top10 <- rownames(assoRes[order(assoRes$waldStat_1, decreasing = TRUE),])[1:10]
gene_to_test <- t(combn(assoRes_lingeage1_top10, 2))

To identify gene pairs that are differential correlated across the pseudotime for each lineage, we visualise the results of the top 5 significant gene pairs that demonstrate differential correlation across our pseudotime.

if(!file.exists("../data/scHOT_obj.rda")) {
  scHOT_obj <- scHOT_buildFromMatrix(
    mat = logcounts(slingshot_umap)[,first_branch_cells],
    cellData = list(pseudotime = slingshot_umap$slingPseudotime_1[first_branch_cells]),
    positionType = "trajectory",
    positionColData = "pseudotime")
  scHOT_obj
  
  scHOT_obj <- scHOT(scHOT_obj,
                     testingScaffold = gene_to_test,
                     nrow.out = 100, # The number of weightings to include for testing, a smaller value is faster for computation
                     higherOrderFunction = weightedSpearman,
                     higherOrderFunctionType = "weighted",
                     numberPermutations = 50)
  
  scHOT_obj@scHOT_output <- scHOT_obj@scHOT_output[order(scHOT_obj@scHOT_output$FDREstimated),]
  save(scHOT_obj, file ="../data/scHOT_obj.rda")
} else {
  load(file="../data/scHOT_obj.rda")
}

plotHigherOrderSequence(scHOT_obj, rownames(scHOT_obj@scHOT_output)[1:5])

Reference and other useful resources

  • dynverse, a collection of R packages aimed at supporting the trajectory inference: https://dynverse.org