Generic case

In this tutorial, we demonstrate how to perform total mRNA-aware analysis with TOMAS using a generic scRNA-seq dataset that contains multiple cell types and unknown cell type labels.

To illustrate the concept, we utilized a simulation dataset with known ground truth. In this simulation scenario, we generated four cell types with mRNA ratios set to 1:2:4:6. The Dirichlet-Multinomial probability signatures of these four cell types were derived from a human PBMC dataset. This example dataset is available for download.

[1]:

import scanpy as sc
import pandas as pd
import numpy as np
import tomas as tm
import subprocess
import seaborn as sns

print(tm.__version__)

sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=60, facecolor='white')
sns.set_style("ticks")

0.1.26

[2]:

adata = sc.read_h5ad('../../datasets/adata_mRNAonly.h5ad') # set the path to match how you save the downloaded data.
adata.obs.head(3)  # ground truth of droplet labels and total-mRNA ratios are stroed in adata.obs

[2]:

	danno	trueR
d_5120	A	NaN
d_5201	A	NaN
d_5156	A	NaN

Computational doublet detection

Here we use R package DoubletFinder for doublet detection. For the sake of compatibility, we demonstrate here the process of running DoubletFinder directly in Python using the subprocess module, and retrieving the identified doublet results for TOMAS.

Please note that the R package DoubletFinder and its dependancy Seurat are required. Before moving on, please download the R script runDblFinder.R first and call it by subprocess.call like below. (Or you could also run DoubletFinder commands in R following its tutorial and save the results for TOMAS.)

Three input arguments include the path of data stored in mtx format, the doublet rate and the ouput path.

If available, the ‘mtx_path’ could be set as the path to the ‘filtered_feature_bc_matrix’ folder in standard output of CellRanger.

[3]:

mtx_path = './data_mtx'
tm.auxi.write_h5ad_to_mtx(adata,adata.obs_names,adata.var_names,mtx_path) # write .h5ad file to mtx file

We suggest using a moderately high doublet rate to run DoubletFinder.

[4]:

doubletrate = 0.15
dblFinder_out = './dbl'

To maintain the cleanliness and readability of the tutorial, we have cleared the output of this command.

[5]:

subprocess.call(['Rscript', 'runDblFinder.R', mtx_path, dblFinder_out, str(doubletrate)])

Read the output and save it in adata.obs.

[6]:

dblFinder = pd.read_csv(dblFinder_out+'/DoubletFinder_out.csv', header=0, index_col=0)
pdbl_barcodes = dblFinder.index[dblFinder.iloc[:,-1]=='Doublet'].values.tolist()

adata.obs['dblFinder'] = 'Others'
adata.obs.loc[dblFinder.index,'dblFinder'] = dblFinder.iloc[:,-1].values
# match the terminologies used in TOMAS
dbl2tomas = dict(zip(['Doublet','Singlet','unknown'],['Heterotypic','Homotypic','unknown']))
adata.obs['droplet_type'] = [dbl2tomas[f] for f in adata.obs['dblFinder']]

Homotypic droplet population identification

Perform clustering on homotypic droplets.

[7]:

adata_psgl = adata[adata.obs['droplet_type']=='Homotypic'].copy()
sc.pp.filter_cells(adata_psgl, min_genes=1)
sc.pp.filter_genes(adata_psgl, min_cells=1)
sc.pp.log1p(adata_psgl)
sc.pp.highly_variable_genes(adata_psgl, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata_psgl.raw = adata_psgl
adata_psgl = adata_psgl[:, adata_psgl.var.highly_variable]
sc.pp.scale(adata_psgl, max_value=10)
sc.tl.pca(adata_psgl, svd_solver='arpack')
sc.pp.neighbors(adata_psgl, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata_psgl)

filtered out 2460 genes that are detected in less than 1 cells
extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)
... as `zero_center=True`, sparse input is densified and may lead to large memory consumption
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:56)
computing neighbors
    using 'X_pca' with n_pcs = 40
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:03)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:05)

Please adjust the resolution parameter based on the results to avoid significant over-subdivision or under-subdivision.

[8]:

sc.tl.leiden(adata_psgl, resolution=0.1)
sc.pl.umap(adata_psgl, color=['leiden'])

running Leiden clustering
    finished: found 4 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)

(optional) In real case, you could first annotate the clusters with known cell-type markers before moving on. Here, as an example, we directly apply TOMAS to those clusters to showcase how to infer total-mRNA ratios with only scRNA-seq data.

Save identified cell cluters/types in adata.obs and perform DE analysis.

[9]:

adata.obs['danno_RNAonly'] = 'unknown'
adata.obs.loc[adata_psgl.obs_names,'danno_RNAonly'] = adata_psgl.obs['leiden'].values
sc.tl.rank_genes_groups(adata_psgl, 'leiden', method='wilcoxon')
degene_sorted = tm.auxi.extract_specific_genes(adata_psgl, 'leiden')

ranking genes
    finished: added to `.uns['rank_genes_groups']`
    'names', sorted np.recarray to be indexed by group ids
    'scores', sorted np.recarray to be indexed by group ids
    'logfoldchanges', sorted np.recarray to be indexed by group ids
    'pvals', sorted np.recarray to be indexed by group ids
    'pvals_adj', sorted np.recarray to be indexed by group ids (0:00:04)

Hetero-doublets identification and refinement

Identify hetero-dbls based on co-expression of marker genes of two cell types.

The naming convention of hetero-doublets is to use an underscore to connect the names of two constitutive cell types/clusters. (Underscore is not allowed in naming convention of pure cell types/clusters.)

Identify hetero-dbls composed by cluster ‘0’ and cluster ‘1’

[10]:

## try default parameters first
tm.infer.heteroDbl_bc(adata,
                      '0_1',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      de_sorted = degene_sorted,
                      dpi=60)

[11]:

# adjust the cutoff by setting 'threshold_x', 'threshold_y' or 'threshold' if necessary
tm.infer.heteroDbl_bc(adata,
                      '0_1',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      threshold_x = 1.8,
                      de_sorted = degene_sorted,
                      dpi=60)

Identify hetero-dbls composed by cluster ‘0’ and cluster ‘2’

[12]:

## try default parameters first
tm.infer.heteroDbl_bc(adata,
                      '0_2',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      de_sorted = degene_sorted,
                      dpi=60)

Identify hetero-dbls composed by cluster ‘0’ and cluster ‘3’

[13]:

## try default parameters first
tm.infer.heteroDbl_bc(adata,
                      '0_3',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      de_sorted = degene_sorted,
                      dpi=60)

[14]:

# adjust the cutoff by setting 'threshold_x', 'threshold_y' or 'threshold' if necessary
tm.infer.heteroDbl_bc(adata,
                      '0_3',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      threshold_y = 2.5,
                      de_sorted = degene_sorted,
                      dpi=60)

Identify hetero-dbls composed by cluster ‘1’ and cluster ‘2’

[15]:

## try default parameters first
tm.infer.heteroDbl_bc(adata,
                      '1_2',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      de_sorted = degene_sorted,
                      dpi=60)

Identify hetero-dbls composed by cluster ‘1’ and cluster ‘3’

[16]:

## try default parameters first
tm.infer.heteroDbl_bc(adata,
                      '1_3',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      de_sorted = degene_sorted,
                      dpi=60)

[17]:

# adjust the cutoff by setting 'threshold_x', 'threshold_y' or 'threshold' if necessary
tm.infer.heteroDbl_bc(adata,
                      '1_3',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      threshold_y = 2.8,
                      de_sorted = degene_sorted,
                      dpi=60)

Identify hetero-dbls composed by cluster ‘2’ and cluster ‘3’

[18]:

## try default parameters first
tm.infer.heteroDbl_bc(adata,
                      '2_3',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      de_sorted = degene_sorted,
                      dpi=60)

[19]:

# adjust the cutoff by setting 'threshold_x', 'threshold_y' or 'threshold' if necessary
tm.infer.heteroDbl_bc(adata,
                      '2_3',
                      d_groupby = 'droplet_type',
                      ct_groupby = 'danno_RNAonly',
                      threshold_y = 2.8,
                      de_sorted = degene_sorted,
                      dpi=60)

Summarize droplet populations

[20]:

tm.infer.heterodbl_summary(adata,ct_groupby='danno_RNAonly')

Counts of each kind of droplets:
        Hetero-dbl-type  counts
0                     0    2443
0_1                 0_1     159
0_2                 0_2     141
1                     1    1892
2                     2    1371
2_1                 2_1     120
3                     3    1094
3_0                 3_0     125
3_1                 3_1      65
3_2                 3_2      69
unknown         unknown     521

[ ]:

Fit signature distribution for each cell type

[21]:

tm.fit.logNormal(adata,groupby='danno_RNAonly')

tm.fit.dmn(adata,
           groupby='danno_RNAonly',
           groups=['0','1','2','3'],
           verbose=2,
           verbose_interval = 100)

Initialization 0
  Iteration 100  time lapse 31.15464s    ll change 0.01474
Initialization converged: True   time lapse 53.09061s    ll -22083.59051
3 is done!
Initialization 0
  Iteration 100  time lapse 62.58861s    ll change 0.03337
  Iteration 200  time lapse 68.57251s    ll change 0.00225
Initialization converged: True   time lapse 166.77823s   ll -22659.08424
1 is done!
Initialization 0
  Iteration 100  time lapse 48.29173s    ll change 0.06491
  Iteration 200  time lapse 38.97226s    ll change 0.00822
  Iteration 300  time lapse 70.11431s    ll change 0.00205
Initialization converged: True   time lapse 186.56300s   ll -20376.31360
2 is done!
Initialization 0
Initialization converged: True   time lapse 63.20879s    ll -20824.03802
0 is done!

[22]:

adata_mgdic = tm.infer.get_dbl_mg(adata, groupby = 'danno_RNAonly')

[23]:

for dbl in adata_mgdic:
    print(dbl)

    tm.infer.ratios_bc(adata_mgdic[dbl],dbl,verbose=1)
    w_best = adata_mgdic[dbl].uns['ratio']['w_best']
    tm.vis.logRatio_dist((1-w_best)/w_best)

3_0
Initialized.
Iteration 2     time lapse 7.464699983596802s   ll change 986.740858053905

Iteration 4     time lapse 7.5402398109436035s  ll change 1.627344752108911

Iteration 6     time lapse 7.284444332122803s   ll change 0.19341495129629038

Iteration 8     time lapse 7.35881495475769s    ll change 0.22153924047597684

3_2
Initialized.
Iteration 2     time lapse 4.016129016876221s   ll change 501.9426881689724

Iteration 4     time lapse 3.8351340293884277s  ll change 0.6582975641649682

Iteration 6     time lapse 3.8424510955810547s  ll change 0.0386034477269277

Iteration 8     time lapse 3.987287998199463s   ll change 0.02481315698241815

Iteration 10    time lapse 4.088726997375488s   ll change 0.0035839457996189594

3_1
Initialized.
Iteration 2     time lapse 3.9632608890533447s  ll change 1085.2957540539064

Iteration 4     time lapse 3.7715671062469482s  ll change 2.7137881421658676

Iteration 6     time lapse 3.600537061691284s   ll change 0.011093866094597615

Iteration 8     time lapse 3.563723087310791s   ll change 0.0011783692752942443

0_2
Initialized.
Iteration 2     time lapse 8.239360809326172s   ll change 2562.3200844880776

Iteration 4     time lapse 8.28295087814331s    ll change 0.3022694913961459

Iteration 6     time lapse 8.9981529712677s     ll change 0.09861751084099524

Iteration 8     time lapse 7.94610595703125s    ll change 0.025448093510931358

Iteration 10    time lapse 8.159227132797241s   ll change 0.0016252986679319292

0_1
Initialized.
Iteration 2     time lapse 8.979889869689941s   ll change 3930.2558828322217

Iteration 4     time lapse 8.725122928619385s   ll change 5.201312209479511

Iteration 6     time lapse 8.938920974731445s   ll change 0.37982264140737243

Iteration 8     time lapse 8.75301194190979s    ll change 0.08157373909489252

2_1
Initialized.
Iteration 2     time lapse 6.419339895248413s   ll change 7985.879368977185

Iteration 4     time lapse 6.3484368324279785s  ll change 7.526343683304731

Iteration 6     time lapse 6.2439188957214355s  ll change 3.0303274058387615

Iteration 8     time lapse 6.381319999694824s   ll change 0.15699274133658037

Iteration 10    time lapse 6.293601989746094s   ll change 3.559465659491252

Iteration 12    time lapse 6.3302342891693115s  ll change 0.0681253346556332

Iteration 14    time lapse 6.408145189285278s   ll change 0.48068093752954155

Iteration 16    time lapse 6.477036952972412s   ll change 0.0

Compare with simulation ground truth

[25]:

from matplotlib import pyplot as plt

dbl_groups = list(adata_mgdic.keys())

plt.figure(figsize=(9,3*np.ceil(len(dbl_groups) / 3)), dpi=72)
for i in range(len(dbl_groups)):
    ax = plt.subplot(int(np.ceil(len(dbl_groups)/3)),3,i+1)

    dbl = dbl_groups[i]

    log2R_true = np.log2(adata.obs.loc[adata.obs['danno_RNAonly']==dbl,'trueR'])
    w_best = adata_mgdic[dbl].uns['ratio']['w_best']
    log2R_pred = np.log2((1-w_best)/w_best)

    plt.scatter(log2R_true, log2R_pred,s=5)
    plt.plot([np.min(log2R_true),np.max(log2R_true)],[np.min(log2R_true),np.max(log2R_true)],c='r')

    plt.xlabel('True log2R',fontsize=12)
    plt.ylabel('Estimated log2R',fontsize=12)
    plt.title(dbl.split('_')[1]+' : '+dbl.split('_')[0],fontsize=15)

    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.yaxis.set_ticks_position('left')

plt.tight_layout()
plt.show()

Save estimated ratios in AnnData object of the whole dataset

[26]:

tm.infer.save_ratio_in_anndata(adata_mgdic, adata)

[27]:

adata.write('../../datasets/adata_mRNAonly_tomas.h5ad')

[ ]:

[ ]: