Now, first, allow me to make a clarification about the protocol:

The nonrandomized data were preprocessed as the pipeline in Figs 1 (A) – (D) shows. That is, IMD -> FCS (randomize off) -> normalize -> gate.

Then, the default randomization was applied. That is, uniform (referred to as Type 1) and gaussian with sigma = 1 (referred to as Type 2).

For the extreme case, the nonrandomized data were transformed (arcsinh) and then randomized with gaussian noise with sigma = 0.3.

As Samuel very nicely shows on CD45, this is exactly the case.

For completeness, I re-quote from the text:

Cells were acquired on a CyTOF2 (Fluidigm) mass cytometer, CyTOF software version 6.0.626 with noise reduction, a lower convolution threshold of 200, event length limits of 10–150 pushes, a sigma value of 3

and

Data were preprocessed having the automatic randomization disabled. For data normalization we used the algorithm of Finck et al. whereas to remove cell doublets and dead cells, the data were manually pre‐gated (Supporting Information Fig. S13). Both Type 1 and Type 2 randomization were applied on the data. For Type 2 randomization we substituted every value x with a random number drawn from a Gaussian density centered at x and a standard deviation of 1, which is the default in the device software. Then, each multivariate analysis algorithm was applied in turn on both randomized and non‐randomized data using its default input parameter values. Prior to each analysis the data were transformed using the hyperbolic arcsin with a cofactor of five. To investigate the reasonable limits of the effect of randomization we also employ data preprocessed for bivariate visualization and gating as shown in Figure 1. For this, we applied Type 2 randomization with a standard deviation of 0.3. We will refer to this randomization scheme as the maximal randomization type.

Then, regarding Samuel’s questions:

Why the non randomized sample does not consists in integers solely?

This is because of bead normalization (this is indicated in Flowrepository (comments): “Please note that the non-randomized data are already normalized and the maximal randomized data are already scaled using arcsinh with co-factor 5.")

Why advocating for releasing IMD and not releasing the IMD?

This PBMC sample is a control sample in a larger study where several multiplexed samples were run. Because our group is still working on these data we cannot publish their raw version yet.

Data processing has to be fair, not extreme, in that I will never add that amount of noise to any of my data in order to expect them "speak to me" intelligibly.

Agreed, and let me requote from the Discussion of the paper,

As expected, the Type 1 randomization we employed conferred the smallest effect among the three cases we examined while the maximal randomization case the largest ones. Regarding maximal randomization, we should emphasize that Supporting Information Figures S1 and S2 clearly depict that there is no consensus neither as to what type of randomization is typically used nor as to how many times it is applied during data preprocessing. Therefore, even if the maximal randomization is not default as Type 1 is, it stands as a reasonable case to illustrate the limits of the potential effects.

Statistics: Posted by geopap — Fri Nov 15, 2019 10:20 am

]]>

Otherwise, I think it will be easier to refer to them via explicit names (uniform, Gaussian, etc.) rather than introduce an additional terminology which does not exist outside of this paper.

Statistics: Posted by dtelad11 — Thu Nov 14, 2019 7:03 pm

]]>

As for Axel and Jim, it's not clear to me which randomization have be done exactly in the figures, if randomization took place before transformation or after, if type 2 was done with sd = 1 or 0.3. That's good to have the code of all the analyses, but the randomization code is missing... No matter, let's be a detective! I looked a CD45 and tried to compute the difference between the non randomized data and the 3 randomizations. Here are some figures that show the randomization distributions and their parameters. The full code (last figure) allows to define at each step of the pipeline it occurs.

Type 1: uniform -1..0 before asinh

Type 2: gaussian (sd = 1) before asinh

Type 3: gaussian (sd = 0.3) after asinh

Let's appreciate the amount of information from CD45 and added noise.

For comparison purpose, here are two gaussians on top of CD45 histogram (non randomized data)

- blue one fits the non randomized transformed CD45

- orange one shows the amount of noise that will be added (aligned at the apex of CD45) during type 3 randomization

The reference PBMC non randomized data looked at the lowest values.

This puzzles me, because I was expecting raw integer counts, i.e. bars at each integer 1, 2, 3...9. This should indicate that bead normalization was applied.

The following is my very personal view. I feel I spent enough time on that article, not saying that I am not a reviewer. I read half of the article and won't read the second part yet. What is the protocol used to process the data from IMD to flow repository FCS? I can't read it clearly. Why the non randomized sample does not consists in integers solely? I don't know. The M&M presents a generic pipeline, but does not pinpoint what was really done for the available data. Why advocating for releasing IMD and not releasing the IMD? No idea. Sorry for being a little bit angry and disappointed. Maybe I was expecting too much. The article sounds negative but is not crystal clear to me. IMO, when an article is negatively constructive (nevertheless constructive, because it is raising points that should be clarified and standardized in the community), protocols and data have to be double-checked and representative. Data processing has to be fair, not extreme, in that I will never add that amount of noise to any of my data in order to expect them "speak to me" intelligibly. Currently, I feel this article deserves the CyTOF technology and worries users, as David stated. Users, your protocol is probably matching the type 1 randomization, not the most extreme one of this article. Fluidigm, could you bring your point of view/recommendations about setting up your software please?

Cheers.

Overall analysis.

Statistics: Posted by sgranjeaud — Thu Nov 14, 2019 6:50 pm

]]>

Using randomization introduces artificial noise into the measurements. This noise may be further multiplied in the subsequent preprocessing steps (e.g., transformations). In addition, it may negatively affect the data analysis methods. Of course, this is not a necessary condition. There may be no subsequent preprocessing steps, or the data analysis method used maybe robust to this noise. However, why should we keep adding noise and artifacts to our measurements before we analyze the data? It makes no real sense. If some method requires "noisy" continuous measurements instead of count data, then we should adapt it to work on the actual measurements, not the perturbed ones. Or, we can use another method.

On this basis, this work presents results for many different scenarios (4 preprocessing schemes, 1 without + 3 with randomization, on 9 different multivariate methods). Indeed, the most important difference is whether randomization is performed before (Type 1 and Type 2 cases) or after arcsinh transformation (maximal case). But, as said in the discussion and above, data transformation (arcsinh) may itself affect the underlying distributions.

It also serves as a cautionary tale: researchers should be as aware as possible regarding how their data are generated, and this is true in cytometry as well as in any other field.

Statistics: Posted by geopap — Thu Nov 14, 2019 12:14 pm

]]>

"The third task is data rescaling transformation to allow their adequate representation across the full range of abundance values (Fig. 1E)..."

"Fourth is randomization that it is performed to avoid the poor display of low abundance distributions (Fig. 1E)..."

Statistics: Posted by AxelOlin — Thu Nov 14, 2019 9:02 am

]]>

AxelOlin wrote:

Another thing that's important to point out, if I understood the paper correctly, is that they perform randomization after scaling/transformation. My guess is that most labs do it the other way around as the randomization is built into the CyTOF software whereas any transformation you apply is normally done either in FlowJo or as part of whatever analysis pipeline you're running.

Totally agree, but I think they're doing randomisation before transformation / scaling / analysis:

Data were preprocessed having the automatic randomization disabled. For data normalization we used the algorithm of Finck et al. whereas to remove cell doublets and dead cells, the data were manually pre‐gated (Supporting Information Fig. S13). Both Type 1 and Type 2 randomization were applied on the data. For Type 2 randomization we substituted every value x with a random number drawn from a Gaussian density centered at x and a standard deviation of 1, which is the default in the device software. Then, each multivariate analysis algorithm was applied in turn on both randomized and non‐randomized data using its default input parameter values. Prior to each analysis the data were transformed using the hyperbolic arcsin with a cofactor of five.

However, I agree that the wording of this could be improved. I interpret it as follows:

1) IMD -> FCS without randomisation.

2) Data is normalised + gated.

3) Randomisation.

4) Transform.

5) Analysis / visualisation.

Statistics: Posted by jimbomahoney — Thu Nov 14, 2019 8:01 am

]]>

As they discuss in the paper, the arcsinh transformation is probably the most commonly applied transformation in mass cytometry data analysis. In my opinion, randomization should never be performed after arcsinh-transformation. The reason is that the effect of the randomization will vary depending on where you are on the scale. If you introduce an error of, let's say 0.1, around 0 on the arcsinh-scale, the error on the linear scale is relatively small and can be calculated by taking the inverse of the arcsinh function:

sinh(0.1) - sinh(0) = 0.100...

However, at higher values, the introduced error grows. For example:

sinh(3.1) - sinh(3) = 1.05...

sinh(6.1) - sinh(6) = 21.21...

This means that any error you introduce around the value 6 on the arcsinh-scale (which is not uncommon in mass cytometry data) is >200 times larger than an error introduced around 0. This also depends on your randomization algorithm, the larger the error you introduce, the bigger this effect becomes.

Sorry to get a bit mathematical, but I think it's an important point. It would be interesting to see the same analysis as in the paper but with the randomization step before the transformation step. This type of confusion also highlights the point of the paper that this step should be described carefully in the methods-section.

Statistics: Posted by AxelOlin — Wed Nov 13, 2019 3:41 pm

]]>

Briefly, they compare clustering between randomized and non-randomized data. The clusters look different. That is not surprising, we expect them to look different. I would want them to compare the clustering to some gold standard (such as traditional gating) and show loss-of-accuracy there, which they do not (this goes back to Samuel's point about biologically meaningful difference). Furthermore, the authors compare t-SNE maps via "visual inspection of the respective low dimensional geometries". This is not a proper way to assess dimensionality reduction techniques -- in fact, as a reviewer I have sent back several manuscripts that still do this, and I'm disappointed Cytometry A permitted this. Instead, some sort of distance- or neighbor-based metric should be used. "Visual inspection" Is not an acceptable metric.

I think that the authors raise a valid point, and as with any computational method, you should validate your analysis results via orthogonal means (such as traditional gating or a follow-up experiment). With that said, completely rejected randomization seems unnecessary to me.

Statistics: Posted by dtelad11 — Wed Nov 13, 2019 2:52 pm

]]>

]]>

Further to the recent paper posted on the effects of randomisation, is there any futher consensus on whether we should be continuing with the defaults presented in CyTOF software?

The paper summary states:

... our recommendation is to avoid as much as possible the use of randomization when analyzing CyTOF data or when developing new multivariate analysis tools.

The only other opinion I could find was presented in a previous discussion here on the forums:

1) If your high-dimensional algorithm of choice performs Z-score normalization, like ACCENSE, you should turn randomization off

2) If your high-dimensional algorithm of choice does not perform Z-score normalization, like vISNE/SPADE/PCA, you should leave randomization on

3) If you are visualizing the data in 2D plots, you should leave randomization on

I would guess, and the authors of this paper further suggest, that 99% of CyTOF systems are just running the acquisition default, which includes randomisation:

Is the issue of randomisation a mountain or a molehill?

For what it's worth, I'm inclined towards the latter, particularly as the default ("Type 1" / "Uniform Negative Gaussian") randomisation is not particularly "aggressive" and doesn't seem to impact the results significantly (from the plots in the paper).

However, it does seem a valid point that randomisation is rarely mentioned in papers, and it is possible that users may (knowingly or unknowingly) apply "Type 2" (Gaussian with user-defined sigma) or "Type 3" (Gaussian Negative Half-Zero / GNHZ), which would then more significantly affect the downstream analysis.

Statistics: Posted by jimbomahoney — Wed Nov 13, 2019 12:49 pm

]]>

https://lma-cytof.dana-farber.org/core-videos.html

Statistics: Posted by EHaasDFCI — Wed Nov 06, 2019 2:04 pm

]]>

I am starting a data analysis seminar series here at DFCI to look at best practices for analyzing CyTOF data. The first session will be next Tuesday, October 29 and will be held here at DFCI. If you're in the Boston Area, please feel free to come out! The first session will be looking at how we use the Gaussian parameters to clean data and some important sample preparation/quality control issues and how this impacts the data.

Best,

Eric Haas

Statistics: Posted by EHaasDFCI — Fri Oct 25, 2019 8:56 pm

]]>

]]>

https://youtu.be/47u4-vGXePY

Statistics: Posted by dtelad11 — Wed Oct 16, 2019 4:32 pm

]]>

Just for fun we are comparing gated XML file from different softwares example: Flowjo, FACS DIVA but lacking XML file from cytobank. Can any one of you share any of your gated XML file from cytobank?

Statistics: Posted by komal2000 — Tue Oct 08, 2019 7:38 am

]]>