Page 1 of 1

Median transformation?

PostPosted: Mon Feb 19, 2024 4:25 pm
by bumpCyto

I'm new to CyTOF but have a lot of experience analyzing single cell RNA data.

In single cell RNA, it is common to summarize the expression of a gene in a cell type using the mean (or sum of) expression across cells. However, in CyTOF, I've noticed that everyone uses the median.

Looking at the distribution of markers in my CyTOF data, it's clear to me that in general the data have a very long right tail (even after arcsinh(x/5) transformation). So, my guess is that people use the median because they are worried that this fat right tail may cause the mean to be very sensitive to points in that tail. However, what is less clear to me is if these points in the right tail are really "outliers", or if they are just cells with high expression being correctly measured. Is there a reason to believe that these cells are technical artifacts? Or, is there a biological reason to care less about cells with very high expression relative to the rest of the population (which would cause one to prefer the median to the mean)?

Thanks for your help!

Re: Median transformation?

PostPosted: Wed Feb 21, 2024 6:09 pm
by mleipold
Hi Eric,

There are a number of Cytoforum readers who regularly work with both CyTOF and RNAseq; hopefully some of them will chime in.

But, to get started, I think there are a couple points to make.

1. If I'm understanding you correctly, you're looking at the entire sample distribution (ie, all the junk gated out and you're looking at total LiveIntactSinglets or something similar). If so, then yes, the cells at very high expression relative to the rest of the population may indeed be biologically relevant (and correctly measured).

An example of this could be Plasmablasts, which are often considered to be CD27hi CD38hi relative to the rest of the CD3- CD19+ population. Add in the fact that Plasmablasts are often rare in a regular healthy donor PBMC sample, and they would be considered to be outliers by some definitions (especially if compared to the rest of the LiveIntactSinglets).

2. As you alluded to, Median vs Mean (vs Geometric Mean and others) have their assumptions about the shape of the data: unimodal vs bi/multimodal, symmetric vs asymmetric, presence or absence of long tails, etc.

Some of that is discussed in these links (to get you started): ... at-is-mfi/ ... n-mod.html ... Techniques

3. However, I don't know how useful these terms are at the level of a heterogeneous cell sample like total PBMC LiveIntactSinglets. In most cases in flow/mass cytometry, you're usually reporting Median or Mean once you have already gated down to some subset that's relatively heterogeneous in the marker you're interested in.

So, for instance, you might report MFI of pSTAT3 expression on total CD4+ after a stim in a Phospho experiment: a lot of phospho signals are pretty homogeneous at the level of a population like total CD4+.
However, reporting something like MFI of CD45RA expression on the same total CD4+ population wouldn't be very informative: Naive CD4+ have a very different CD45RA compared to Effector Memory CD4+.

4. I would also point out that in many (though not certainly all) cases, scRNAseq experiments are done on pre-sorted cells. So, before the scRNAseq even begins, you already have a relatively homogeneous population like total CD4+. Therefore, this may also be one of the differences that your eyes aren't quite trained to yet.


Re: Median transformation?

PostPosted: Wed Feb 21, 2024 7:44 pm
by sgranjeaud

I don't know if it helps, but here are a few points that come to my mind.

I never did a scRNAseq analysis on my own, only watched at some results as I am surrounded by 2 dedicated analysts here. I remember about boxplots or violin plots of log2 cpm. I don't know exactly what you call "the expression of gene". If I am correct, a transcript expression is a count of a reads in a cell, so I feel it had to be related to the total number of reads in the cell. This is why I mentioned counts per million, i.e. cpm.

Median value is a well adopted measure and, as you already stated, is less sensitive to outliers. Moreover, it is easy to scale: median(log(intensity)) = log(median(intensity))

Most of markers in cytometry are at the surface of the cells. Proteomics has a wider dynamic range than transcriptomics. Proteomics and transcriptomics are not directly related.


Re: Median transformation?

PostPosted: Wed Feb 21, 2024 10:07 pm
by bumpCyto
Thanks for the thoughtful replies.

For context, I am looking at the data from the paper "Multipanel mass cytometry reveals anti–PD-1 therapy–mediated B and T cell compartment remodeling in tumor-draining lymph nodes" by Ho et al.

I am looking at gated files and I have also performed cell type clustering based on their markers in the supplement using flowSOM (as well as a colleague who is very experienced working with CyTOF).

I'm attaching the distribution of arcsinh(x/5) transformed TNFA in Treg cells for one mouse. You can see the really long right tail, and how the mean and the median will lead to very different summaries of this data.

In my mind, there would essentially be two reasons to use the median:

(1) You care about the population mean of expression for a given set of cells, but because of concern over data contamination e.g. technical artifacts in the tails, you think that the median will be a better estimator.

(2) You actually care about the population median of expression for some biological reason, and so you prefer the median to the mean.

It's not clear to me that either of these apply in CyTOF, but I'm not sure.

Re: Median transformation?

PostPosted: Wed Feb 21, 2024 11:17 pm
by mleipold
Hi Eric,

In flow/mass cytometry, we do use histogram/distribution plots of single markers (in your case, TNFa), but it's probably more common to look at bivariate plots. This is often more informative, as it often allows you to figure out what might be legitimate smear/non-bimodal staining (eg, CD38, CD45RA, HLADR) vs a smear due to a technical artifact (over titered antibody, high debris, etc).

Unlike a lot of/most phospho markers, a lot of/most cytokines such as TNFa are reasonably bimodal: either they express a marker after a stim, or they don't. Since they're an intracellular marker, sometimes the background can be higher, but comparing an Unstim from a Stim sample, you should be able to figure out where to draw the gate for TNFa+ (particularly in a bivariate).

In that case, it would really only be the TNFa+ that you would be calculating the Median/mean for (along with Freq Parent TNFa+).


Re: Median transformation?

PostPosted: Wed Feb 21, 2024 11:21 pm
by mleipold
For one example: see Figure 2 Unstim vs PMA+I stim here:

While this data is human rather than mouse, the first plots are IFNg vs TNFa (CD8+ T cells).