TOMAS Basics

TOMAS is a computational framework that derives total mRNA content ratios between cell populations via deconvoluting their heterotypic doublets.

In this tutorial, we will begin with a simple scenario involving two cell types to demonstrate how TOMAS works. The general case with multiple cell types could be broken into several two-cell-type cases. For simplicity, we assume the droplet labels are known.

We utilized a simulation dataset with known ground truth as an illustration, setting the mean of total-mRNA ratios between the two cell types at 4. The Dirichlet-Multinomial probability signatures of the two cell types are derived from our in-house T cell activation dataset. This example dataset is available for download.

[1]:
import tomas as tm
import scanpy as sc
import numpy as np
import pandas as pd
import time
from matplotlib import pyplot as plt
[2]:
adata = sc.read_h5ad('../../datasets/adata_basic.h5ad') # set the path to match how you save the downloaded data.
adata.obs.head(3)  # ground truth of droplet labels and total-mRNA ratios are stroed in adata.obs
[2]:
danno trueR
d_1705 naive NaN
d_2401 naive NaN
d_1319 naive NaN

There are totally 1000 droplets with 50 doublets (doublet rate is 5%) in the simulation dataset.

The label of droplets should be stored in adata.obs. The naming of heterotypic doublets must be the concatenation of names of two component cell types by underscore(’_’). Avoid undersocre in names of homotypic droplets.

[3]:
np.unique(adata.obs['danno'],return_counts=True)
[3]:
(array(['active', 'naive', 'naive_active'], dtype=object),
 array([475, 475,  50]))

Display UMI amount distribution if necessary.

[4]:
tm.vis.UMI_hist(adata,groupby='danno')
_images/TOMAS_basics_10_0.png

Step1: Fit Homotypic Droplet Populations

Fit total UMI amounts with log-Normal distribution. The parameters are stored in adata.uns['para_logUMI'].

[5]:
tm.fit.logNormal(adata,groupby='danno')

Fit gene probability signatures with Dirichlet-Multinomial (DMN) model. The optimizated paremeters Alpha of each cell types are stroed in adata.varm['para_diri'].

[6]:
tm.fit.dmn(adata,
           groupby='danno',
           groups=['naive','active'],
           verbose=2,
           verbose_interval = 100)
Initialization 0
  Iteration 100  time lapse 12.34517s    ll change 0.10193
Initialization converged: True   time lapse 23.97408s    ll -163680.93666
active is done!
Initialization 0
  Iteration 100  time lapse 9.94024s     ll change 0.06904
  Iteration 200  time lapse 9.56807s     ll change 0.00602
  Iteration 300  time lapse 9.88499s     ll change 0.00104
Initialization converged: True   time lapse 29.68939s    ll -37940.07350
naive is done!

Step2: Estimate Total-mRNA Ratios by Synthesizing Doublets

We merge metagenes according to probability signatures of two component cell types to speed up. Metagenes preserving the key difference of the two cell types warrant efficient estimation of total-mRNA ratios.

[7]:
adata_dbl_mg = tm.infer.get_dbl_mg_bc(adata,
                                      groupby = 'danno',
                                      dblgroup = 'naive_active')

Estimate \(R\) by optimizing the likelihood of heterotypic doublet with Monte Carto sampling

[8]:
tm.infer.ratios_bc(adata_dbl_mg,
                  dblgroup='naive_active',
                  max_iter=30,
                  verbose=1,verbose_interval=2)

w_best = adata_dbl_mg.uns['ratio']['w_best']
tm.vis.logRatio_dist((1-w_best)/w_best)
Initialized.
Iteration 2     time lapse 2.7430717945098877s  ll change 1189.861348478502

Iteration 4     time lapse 2.8204591274261475s  ll change 0.29203234720625915

Iteration 6     time lapse 2.77731990814209s    ll change 0.004154046822804958

_images/TOMAS_basics_20_1.png

Compare estimated total-mRNA ratios of each heterotypic doublet with ground truth

[10]:
log2R_pred = np.log2((1-w_best)/w_best)
log2R_true = np.log2(adata.obs.loc[adata.obs['danno']=='naive_active','trueR'])
[11]:
plt.figure(figsize=(3,3))
plt.scatter(log2R_true, log2R_pred,s=5)
plt.plot([np.min(log2R_pred),np.max(log2R_pred)],[np.min(log2R_pred),np.max(log2R_pred)],'--',c='r')
plt.xlabel('True $log_{2}$R',fontsize=12)
plt.ylabel('Estimated $log_{2}$R',fontsize=12)
plt.tight_layout()
plt.show()
_images/TOMAS_basics_23_0.png
[ ]: