TOMAS Basics
TOMAS
is a computational framework that derives total mRNA content ratios between cell populations via deconvoluting their heterotypic doublets.
In this tutorial, we will begin with a simple scenario involving two cell types to demonstrate how TOMAS
works. The general case with multiple cell types could be broken into several two-cell-type cases. For simplicity, we assume the droplet labels are known.
We utilized a simulation dataset with known ground truth as an illustration, setting the mean of total-mRNA ratios between the two cell types at 4. The Dirichlet-Multinomial probability signatures of the two cell types are derived from our in-house T cell activation dataset. This example dataset is available for download.
[1]:
import tomas as tm
import scanpy as sc
import numpy as np
import pandas as pd
import time
from matplotlib import pyplot as plt
[2]:
adata = sc.read_h5ad('../../datasets/adata_basic.h5ad') # set the path to match how you save the downloaded data.
adata.obs.head(3) # ground truth of droplet labels and total-mRNA ratios are stroed in adata.obs
[2]:
danno | trueR | |
---|---|---|
d_1705 | naive | NaN |
d_2401 | naive | NaN |
d_1319 | naive | NaN |
There are totally 1000 droplets with 50 doublets (doublet rate is 5%) in the simulation dataset.
The label of droplets should be stored in adata.obs
. The naming of heterotypic doublets must be the concatenation of names of two component cell types by underscore(’_’). Avoid undersocre in names of homotypic droplets.
[3]:
np.unique(adata.obs['danno'],return_counts=True)
[3]:
(array(['active', 'naive', 'naive_active'], dtype=object),
array([475, 475, 50]))
Display UMI amount distribution if necessary.
[4]:
tm.vis.UMI_hist(adata,groupby='danno')
Step1: Fit Homotypic Droplet Populations
Fit total UMI amounts with log-Normal distribution. The parameters are stored in adata.uns['para_logUMI']
.
[5]:
tm.fit.logNormal(adata,groupby='danno')
Fit gene probability signatures with Dirichlet-Multinomial (DMN) model. The optimizated paremeters Alpha
of each cell types are stroed in adata.varm['para_diri']
.
[6]:
tm.fit.dmn(adata,
groupby='danno',
groups=['naive','active'],
verbose=2,
verbose_interval = 100)
Initialization 0
Iteration 100 time lapse 12.34517s ll change 0.10193
Initialization converged: True time lapse 23.97408s ll -163680.93666
active is done!
Initialization 0
Iteration 100 time lapse 9.94024s ll change 0.06904
Iteration 200 time lapse 9.56807s ll change 0.00602
Iteration 300 time lapse 9.88499s ll change 0.00104
Initialization converged: True time lapse 29.68939s ll -37940.07350
naive is done!
Step2: Estimate Total-mRNA Ratios by Synthesizing Doublets
We merge metagenes according to probability signatures of two component cell types to speed up. Metagenes preserving the key difference of the two cell types warrant efficient estimation of total-mRNA ratios.
[7]:
adata_dbl_mg = tm.infer.get_dbl_mg_bc(adata,
groupby = 'danno',
dblgroup = 'naive_active')
Estimate \(R\) by optimizing the likelihood of heterotypic doublet with Monte Carto sampling
[8]:
tm.infer.ratios_bc(adata_dbl_mg,
dblgroup='naive_active',
max_iter=30,
verbose=1,verbose_interval=2)
w_best = adata_dbl_mg.uns['ratio']['w_best']
tm.vis.logRatio_dist((1-w_best)/w_best)
Initialized.
Iteration 2 time lapse 2.7430717945098877s ll change 1189.861348478502
Iteration 4 time lapse 2.8204591274261475s ll change 0.29203234720625915
Iteration 6 time lapse 2.77731990814209s ll change 0.004154046822804958
Compare estimated total-mRNA ratios of each heterotypic doublet with ground truth
[10]:
log2R_pred = np.log2((1-w_best)/w_best)
log2R_true = np.log2(adata.obs.loc[adata.obs['danno']=='naive_active','trueR'])
[11]:
plt.figure(figsize=(3,3))
plt.scatter(log2R_true, log2R_pred,s=5)
plt.plot([np.min(log2R_pred),np.max(log2R_pred)],[np.min(log2R_pred),np.max(log2R_pred)],'--',c='r')
plt.xlabel('True $log_{2}$R',fontsize=12)
plt.ylabel('Estimated $log_{2}$R',fontsize=12)
plt.tight_layout()
plt.show()
[ ]: