UMAP Analysis of hESC scRNA-seq Data

Setting up: loading libraries and data of all time points
- Merging seurat objects for integrated analysis
UMAP Plot for all conditions
- Checking batch effects
- Batch correction via Harmony
Gene-Specific UMAPs

This document presents the UMAP (Uniform Manifold Approximation and Projection) analysis of single-cell RNA sequencing (scRNA-seq) data from the published study E-MTAB-12598. UMAP is a dimensionality reduction technique that captures high-dimensional transcriptional variation, allowing visualization of cellular heterogeneity and lineage transitions.

The dataset consists of human embryonic stem cells (hESCs) subjected to different durations of Wnt signaling activation via GSK3β inhibition (CHIR99021).
Cells were collected at 4 hours, 24 hours, and 3 days after Wnt activation for analysis.

In the next phase, we will conduct a time-resolved study of cardiomyocyte differentiation from hESCs, analyzing both Wnt inhibition and differentiation stages, with cells collected at various time points.

The filtered feature barcode matrix (filtered_feature_bc_matrix.h5) is used for analysis, as it contains high-quality cells.

Setting up: loading libraries and data of all time points

# Set working directory 
setwd("/home/telomere/Documents/degree_project_AG_Cantulab/poster/UMAPs")

# Load necessary libraries
library("Seurat")
library("ggplot2")
library("umap")
library("patchwork") 
library("Cairo")

# Define time points
time_points <- c("hESC_WntOFF", "hESC_Chir4h", "hESC_Chir24h", "hESC_Chir3d")

# Create an empty list to store Seurat objects
seurat_list <- list()

# Loop through each time point and process it
for (time in time_points) {
  # Define file path
  file_path <- paste0(time, "_filtered_feature_bc_matrix.h5")
  
  # Load data
  data <- Read10X_h5(file_path)
  
  # Create Seurat object
  seurat_obj <- CreateSeuratObject(counts = data, min.cells = 3, min.features = 200)
  
  # Assign time point identity
  seurat_obj$orig.ident <- time  # <-- This line ensures each time point is labeled
  
  # Normalize and find variable features
  seurat_obj <- NormalizeData(seurat_obj)
  seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 2000)
  
  # Scale data and perform PCA
  seurat_obj <- ScaleData(seurat_obj)
  seurat_obj <- RunPCA(seurat_obj, npcs = 30)
  
  # Run UMAP
  seurat_obj <- FindNeighbors(seurat_obj, dims = 1:20)
  seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)
  seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)
  
  # Store processed object in list
  seurat_list[[time]] <- seurat_obj
}

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2377
## Number of edges: 99249
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8505
## Number of communities: 5
## Elapsed time: 0 seconds

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 4000
## Number of edges: 156048
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8707
## Number of communities: 8
## Elapsed time: 0 seconds

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3231
## Number of edges: 122425
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8202
## Number of communities: 6
## Elapsed time: 0 seconds

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3181
## Number of edges: 122107
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8418
## Number of communities: 7
## Elapsed time: 0 seconds

The original dataset labeled hESC_WntOFF to represent cells without CHIR99021 (CHIR) treatment. However, since Wnt signaling is already active in hESCs, this label may be misleading. To better reflect the experimental setup, we rename it hESC_D0 (No CHIR), indicating cells collected before CHIR exposure.

CHIR was added to enhance Wnt activation, promoting mesoderm induction alongside Activin A. The time points correspond to different durations of CHIR exposure, capturing early lineage transitions.

To implement this change, we update the factor levels accordingly.

To analyze all time points together while preserving their individual identities, we merge the Seurat objects.

Merging seurat objects for integrated analysis

# Merge all Seurat objects and keep their identities
merged_seurat <- merge(seurat_list[[1]], y = seurat_list[2:length(seurat_list)], add.cell.ids = names(seurat_list))

# Ensure `orig.ident` is correctly assigned
table(merged_seurat$orig.ident)  # should show counts per time point

## 
## hESC_Chir24h  hESC_Chir3d  hESC_Chir4h  hESC_WntOFF 
##         3231         3181         4000         2377

# Rename 'hESC_WntOFF' to 'hESC_D0 (No CHIR)' in the metadata
merged_seurat$orig.ident <- as.character(merged_seurat$orig.ident)  # Convert to character
merged_seurat$orig.ident[merged_seurat$orig.ident == "hESC_WntOFF"] <- "hESC_D0 (No CHIR)" 

# Convert back to factor and define levels in ascending order
merged_seurat$orig.ident <- factor(
  merged_seurat$orig.ident,
  levels = c("hESC_D0 (No CHIR)", "hESC_Chir4h", "hESC_Chir24h", "hESC_Chir3d")
)

# Run PCA and UMAP again after merging
merged_seurat <- NormalizeData(merged_seurat)
merged_seurat <- FindVariableFeatures(merged_seurat)
merged_seurat <- ScaleData(merged_seurat)
merged_seurat <- RunPCA(merged_seurat, npcs = 30)
merged_seurat <- FindNeighbors(merged_seurat, dims = 1:20)
merged_seurat <- FindClusters(merged_seurat, resolution = 0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 12789
## Number of edges: 468178
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9258
## Number of communities: 14
## Elapsed time: 1 seconds

merged_seurat <- RunUMAP(merged_seurat, dims = 1:20)

UMAP Plot for all conditions

Checking batch effects

# Check if `orig.ident` correctly assigns batch identity
table(merged_seurat$orig.ident)

## 
## hESC_D0 (No CHIR)       hESC_Chir4h      hESC_Chir24h       hESC_Chir3d 
##              2377              4000              3231              3181

# Plot UMAP to visualize batch effects
DimPlot(merged_seurat, group.by = "orig.ident", reduction = "umap", cols = c("firebrick", "gold", "seagreen", "steelblue"))

Batch effect assessment in UMAP
This UMAP visualizes cells by time points (orig.ident) to check for batch effects before correction.

Distinct clusters suggest transcriptional differences across time points, but strong separation may indicate batch effects.
Overlap between CHIR 4h (yellow) and CHIR 24h (green) suggests a gradual transition, while D0 (No CHIR) (red) and CHIR 3d (blue) remain distinct.
If differences are biological, batch correction may not be needed. However, if separation is driven by technical artifacts, Harmony or another correction method should be applied.

Batch correction via Harmony

library(harmony)

# Run Harmony to correct batch effects
merged_seurat <- RunHarmony(
  object = merged_seurat,
  group.by.vars = "orig.ident"  # Adjust batch based on time points
)

# Run UMAP on Harmony-corrected embeddings
merged_seurat <- RunUMAP(merged_seurat, reduction = "harmony", dims = 1:20)

# Plot corrected UMAP
DimPlot(merged_seurat, reduction = "umap", group.by = "orig.ident", cols = c("firebrick", "gold", "seagreen", "steelblue"))

Interpretation of Batch-corrected UMAP
This UMAP plot visualizes all time points after batch correction, ensuring that clustering reflects biological variation rather than technical artifacts.

Stronger overlap among D0 (No CHIR) (red), CHIR 4h (yellow), and CHIR 24h (green) suggests better alignment of transcriptionally similar populations, reducing artificial separation.
CHIR day 3 (blue) remains distinct, likely representing true biological differences rather than batch effects, as later differentiation stages should diverge.
Batch correction has minimized artificial clustering, preserving meaningful transitions in gene expression dynamics.
This confirms that Harmony has effectively integrated the datasets, allowing a clearer analysis of differentiation trajectories.

The trajectory from Day0 (No CHIR) → CHIR 4h → CHIR 24h → CHIR 72h suggests a progressive shift in gene expression.
Overlapping regions between adjacent time points indicate transitional states, where cells are gradually shifting toward a Wnt-activated phenotype.
The distinct spatial separation of the clusters indicates that Wnt signaling significantly alters gene expression dynamics over time.

This UMAP suggests that Wnt activation drives gradual and distinct transcriptional changes in hESCs, likely influencing their pluripotency, differentiation potential, or lineage commitment. Further gene-specific UMAPs (e.g., stemness, mesoderm, cardiomyocyte markers) will help clarify how Wnt influences differentiation pathways.

Gene-Specific UMAPs

# Check presence of genes in all conditions
gene_list <- c("AXIN2", "SP5", "NANOG", "SOX2", "POU5F1", "FOXA2", "EOMES", "MESP1", "TBXT", "GATA4", "HAND1", "NKX2-5", "ISL1", "TBX5")

# Apply function to check if each gene is present in all conditions
gene_presence <- sapply(gene_list, function(g) sapply(seurat_list, function(x) g %in% rownames(x)))

# Print results
print(gene_presence)

##              AXIN2   SP5 NANOG SOX2 POU5F1 FOXA2 EOMES MESP1  TBXT GATA4 HAND1
## hESC_WntOFF   TRUE FALSE  TRUE TRUE   TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## hESC_Chir4h   TRUE  TRUE  TRUE TRUE   TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## hESC_Chir24h  TRUE  TRUE  TRUE TRUE   TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## hESC_Chir3d   TRUE  TRUE  TRUE TRUE   TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
##              NKX2-5 ISL1 TBX5
## hESC_WntOFF   FALSE TRUE TRUE
## hESC_Chir4h   FALSE TRUE TRUE
## hESC_Chir24h  FALSE TRUE TRUE
## hESC_Chir3d    TRUE TRUE TRUE

UMAPs for stemness markers

Stemness markers → AXIN2, NANOG, SOX2, POU5F1

# Stemness markers
FeaturePlot(merged_seurat, features = c("AXIN2"), reduction = "umap", cols = c("lightgray", "blue")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("NANOG"), reduction = "umap", cols = c("lightgray", "blue")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("SOX2"), reduction = "umap", cols = c("lightgray", "blue")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("POU5F1"), reduction = "umap", cols = c("lightgray", "blue")) +
  theme(plot.title = element_text(size = 10))

UMAPs for mesoderm markers

Mesoderm markers → EOMES, MESP1, TBXT, GATA4

# Mesoderm markers
FeaturePlot(merged_seurat, features = c("EOMES"), reduction = "umap", cols = c("lightgray", "darkgreen")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("MESP1"), reduction = "umap", cols = c("lightgray", "darkgreen")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("TBXT"), reduction = "umap", cols = c("lightgray", "darkgreen")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("GATA4"), reduction = "umap", cols = c("lightgray", "darkgreen")) +
  theme(plot.title = element_text(size = 10))

UMAPs for cardiomyocyte markers

Cardiomyocyte markers → NKX2.5, ISL1, TBX5, HAND1

# Cardiomyocyte markers
FeaturePlot(merged_seurat, features = c("NKX2-5"), reduction = "umap", cols = c("lightgray", "red")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("ISL1"), reduction = "umap", cols = c("lightgray", "red")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("TBX5"), reduction = "umap", cols = c("lightgray", "red")) +
  theme(plot.title = element_text(size = 10))

FeaturePlot(merged_seurat, features = c("HAND1"), reduction = "umap", cols = c("lightgray", "red")) +
  theme(plot.title = element_text(size = 10))

Interpreting gene-specific UMAPs in hESC differentiation
Each UMAP plot visualizes the expression of a specific gene across all time points. The color gradient indicates expression levels, with gray representing low or no expression and blue/red/yellow showing high expression.

Cells are positioned based on transcriptional similarity—closer clusters suggest similar expression profiles.

Stemness Markers (AXIN2, NANOG, SOX2, POU5F1)
These UMAP plots visualize the expression of key stemness markers (POU5F1, NANOG, SOX2, AXIN2) across all conditions.

POU5F1 (OCT4): A master regulator of pluripotency, essential for maintaining self-renewal in embryonic stem cells. Its widespread expression in earlier time points indicates active pluripotency.
SOX2: Functions alongside OCT4 to sustain pluripotency and prevent differentiation, showing a similar expression pattern.
NANOG: Supports stem cell identity and self-renewal, but its expression is weaker, suggesting a gradual loss of pluripotency as cells transition towards differentiation.
AXIN2: A Wnt signaling target involved in regulating self-renewal and differentiation balance, showing moderate expression.
The expression of POU5F1 (OCT4), SOX2, and NANOG is prominent in the left clusters, corresponding to early time points, and diminishes in the right cluster (CHIR 3d), indicating a transition away from pluripotency. This suggests that as Wnt signaling persists, cells progressively lose their pluripotent identity, aligning with early differentiation events.
Mesoderm Markers (EOMES, MESP1, TBXT, GATA4)
TBXT (Brachyury): A key mesodermal marker, essential for early mesoderm and primitive streak formation. Its expression is enriched in distinct clusters, indicating mesodermal commitment in a subset of cells.
GATA4: A transcription factor involved in mesoderm and cardiac lineage specification. Its low expression across time points suggests that cardiogenic commitment is not yet prominent.
MESP1: A master regulator of cardiovascular progenitor development, showing minimal expression, consistent with an early differentiation state before mesoderm-derived cardiac fate specification.
EOMES: Critical for mesoderm induction and primitive streak formation. Its expression is observed in specific clusters, aligning with cells transitioning toward mesodermal fate.

The expression patterns of these markers suggest early mesodermal specification, with TBXT and EOMES showing the highest expression, marking early-stage mesoderm induction. The low expression of GATA4 and MESP1 reinforces that cells are in early mesodermal stages and have not yet fully committed to mesoderm differentiation.

Cardiomyocyte Markers (NKX2-5, ISL1, TBX5, HAND1)
Since these cells have not yet been subjected to a cardiomyocyte differentiation protocol, it is expected that cardiac markers like NKX2-5, ISL1, HAND1, and TBX5 would show little to no expression at these time points.

NKX2-5: A crucial cardiac transcription factor essential for heart development, showing sparse expression.
ISL1: A marker of cardiac progenitors, expected at early stages of heart specification, but largely absent here.
TBX5: Regulates cardiac lineage commitment and chamber formation, with low expression, consistent with the early time points.
HAND1: Plays a role in ventricular development, but its minimal presence further confirms that these cells have not initiated cardiac differentiation.
The lack of strong expression indicates that cardiomyogenic commitment has not yet begun, aligning with the experimental setup where Wnt has not been inactivated to promote cardiac differentiation.

Future datasets, including Wnt inhibition phases, will be necessary to observe definitive cardiomyocyte commitment.