FAQ  •  Register  •  Login

Randomization / randomisation consensus

Forum rules
Please be as geeky as possible. Reference, reference, reference.
Also, please note that this is a mixed bag of math-gurus and mathematically challenged, so choose your words wisely :-)
<<

jimbomahoney

Master

Posts: 73

Joined: Wed Feb 27, 2019 11:21 am

Post Wed Nov 13, 2019 12:49 pm

Randomization / randomisation consensus

Hi all,

Further to the recent paper posted on the effects of randomisation, is there any futher consensus on whether we should be continuing with the defaults presented in CyTOF software?

The paper summary states:

... our recommendation is to avoid as much as possible the use of randomization when analyzing CyTOF data or when developing new multivariate analysis tools.


The only other opinion I could find was presented in a previous discussion here on the forums:

1) If your high-dimensional algorithm of choice performs Z-score normalization, like ACCENSE, you should turn randomization off
2) If your high-dimensional algorithm of choice does not perform Z-score normalization, like vISNE/SPADE/PCA, you should leave randomization on
3) If you are visualizing the data in 2D plots, you should leave randomization on


I would guess, and the authors of this paper further suggest, that 99% of CyTOF systems are just running the acquisition default, which includes randomisation:

Image

Is the issue of randomisation a mountain or a molehill?

For what it's worth, I'm inclined towards the latter, particularly as the default ("Type 1" / "Uniform Negative Gaussian") randomisation is not particularly "aggressive" and doesn't seem to impact the results significantly (from the plots in the paper).

However, it does seem a valid point that randomisation is rarely mentioned in papers, and it is possible that users may (knowingly or unknowingly) apply "Type 2" (Gaussian with user-defined sigma) or "Type 3" (Gaussian Negative Half-Zero / GNHZ), which would then more significantly affect the downstream analysis.
<<

sgranjeaud

Master

Posts: 79

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Wed Nov 13, 2019 2:45 pm

Re: Randomization / randomisation consensus

Numerically, there is a difference. Biologically, is there a difference? This paper does not address this question which is the most important that should have been addressed in my humble opinion. Many validation articles did not notice any effect when comparing mass to fluo. So, I will not worry yet. But this article is interesting on a MIFlow point of view.
<<

dtelad11

Master

Posts: 107

Joined: Mon Oct 31, 2016 6:26 pm

Post Wed Nov 13, 2019 2:52 pm

Re: Randomization / randomisation consensus

Over the past couple of days I've been contacted by several researchers who were concerned by the conclusion of the paper. Personally, I disagree with their statement that "randomization should be used only for visualization purposes, but not in conjunction with high‐dimensional analytical tools". I actually posted about this on LinkedIn, where my conclusion was that "randomization might lead to analysis artifacts", emphasis on the might.

Briefly, they compare clustering between randomized and non-randomized data. The clusters look different. That is not surprising, we expect them to look different. I would want them to compare the clustering to some gold standard (such as traditional gating) and show loss-of-accuracy there, which they do not (this goes back to Samuel's point about biologically meaningful difference). Furthermore, the authors compare t-SNE maps via "visual inspection of the respective low dimensional geometries". This is not a proper way to assess dimensionality reduction techniques -- in fact, as a reviewer I have sent back several manuscripts that still do this, and I'm disappointed Cytometry A permitted this. Instead, some sort of distance- or neighbor-based metric should be used. "Visual inspection" Is not an acceptable metric.

I think that the authors raise a valid point, and as with any computational method, you should validate your analysis results via orthogonal means (such as traditional gating or a follow-up experiment). With that said, completely rejected randomization seems unnecessary to me.
<<

AxelOlin

Participant

Posts: 3

Joined: Tue Aug 12, 2014 9:24 am

Post Wed Nov 13, 2019 3:41 pm

Re: Randomization / randomisation consensus

Another thing that's important to point out, if I understood the paper correctly, is that they perform randomization after scaling/transformation. My guess is that most labs do it the other way around as the randomization is built into the CyTOF software whereas any transformation you apply is normally done either in FlowJo or as part of whatever analysis pipeline you're running.

As they discuss in the paper, the arcsinh transformation is probably the most commonly applied transformation in mass cytometry data analysis. In my opinion, randomization should never be performed after arcsinh-transformation. The reason is that the effect of the randomization will vary depending on where you are on the scale. If you introduce an error of, let's say 0.1, around 0 on the arcsinh-scale, the error on the linear scale is relatively small and can be calculated by taking the inverse of the arcsinh function:

sinh(0.1) - sinh(0) = 0.100...

However, at higher values, the introduced error grows. For example:

sinh(3.1) - sinh(3) = 1.05...
sinh(6.1) - sinh(6) = 21.21...

This means that any error you introduce around the value 6 on the arcsinh-scale (which is not uncommon in mass cytometry data) is >200 times larger than an error introduced around 0. This also depends on your randomization algorithm, the larger the error you introduce, the bigger this effect becomes.

Sorry to get a bit mathematical, but I think it's an important point. It would be interesting to see the same analysis as in the paper but with the randomization step before the transformation step. This type of confusion also highlights the point of the paper that this step should be described carefully in the methods-section.
<<

jimbomahoney

Master

Posts: 73

Joined: Wed Feb 27, 2019 11:21 am

Post Thu Nov 14, 2019 8:01 am

Re: Randomization / randomisation consensus

AxelOlin wrote:Another thing that's important to point out, if I understood the paper correctly, is that they perform randomization after scaling/transformation. My guess is that most labs do it the other way around as the randomization is built into the CyTOF software whereas any transformation you apply is normally done either in FlowJo or as part of whatever analysis pipeline you're running.


Totally agree, but I think they're doing randomisation before transformation / scaling / analysis:

Data were preprocessed having the automatic randomization disabled. For data normalization we used the algorithm of Finck et al. whereas to remove cell doublets and dead cells, the data were manually pre‐gated (Supporting Information Fig. S13). Both Type 1 and Type 2 randomization were applied on the data. For Type 2 randomization we substituted every value x with a random number drawn from a Gaussian density centered at x and a standard deviation of 1, which is the default in the device software. Then, each multivariate analysis algorithm was applied in turn on both randomized and non‐randomized data using its default input parameter values. Prior to each analysis the data were transformed using the hyperbolic arcsin with a cofactor of five.


However, I agree that the wording of this could be improved. I interpret it as follows:

1) IMD -> FCS without randomisation.
2) Data is normalised + gated.
3) Randomisation.
4) Transform.
5) Analysis / visualisation.
<<

AxelOlin

Participant

Posts: 3

Joined: Tue Aug 12, 2014 9:24 am

Post Thu Nov 14, 2019 9:02 am

Re: Randomization / randomisation consensus

Hmm, yes you're probably right! I assumed this primarily on the order of events illustrated in Fig. 1 and the accompanying text:

"The third task is data rescaling transformation to allow their adequate representation across the full range of abundance values (Fig. 1E)..."

"Fourth is randomization that it is performed to avoid the poor display of low abundance distributions (Fig. 1E)..."
<<

geopap

Participant

Posts: 3

Joined: Mon Nov 05, 2018 11:28 am

Post Thu Nov 14, 2019 12:14 pm

Re: Randomization / randomisation consensus

Hi all. Thank you for the discussion.

Using randomization introduces artificial noise into the measurements. This noise may be further multiplied in the subsequent preprocessing steps (e.g., transformations). In addition, it may negatively affect the data analysis methods. Of course, this is not a necessary condition. There may be no subsequent preprocessing steps, or the data analysis method used maybe robust to this noise. However, why should we keep adding noise and artifacts to our measurements before we analyze the data? It makes no real sense. If some method requires "noisy" continuous measurements instead of count data, then we should adapt it to work on the actual measurements, not the perturbed ones. Or, we can use another method.

On this basis, this work presents results for many different scenarios (4 preprocessing schemes, 1 without + 3 with randomization, on 9 different multivariate methods). Indeed, the most important difference is whether randomization is performed before (Type 1 and Type 2 cases) or after arcsinh transformation (maximal case). But, as said in the discussion and above, data transformation (arcsinh) may itself affect the underlying distributions.

It also serves as a cautionary tale: researchers should be as aware as possible regarding how their data are generated, and this is true in cytometry as well as in any other field.
<<

sgranjeaud

Master

Posts: 79

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Thu Nov 14, 2019 6:50 pm

Re: Randomization / randomisation consensus

In the typical workflow that I know from a few core facilities, randomization is done using Fluidigm software to fill the vacuum between the integer counts. What Fluidigm told me is to activate the uniform randomization (called type 1 in the article) during IMD to FCS conversion. No other randomization. This randomization is reversible (although the authors stated none is reversible) as shown in Mike's article (see section "Dealing with randomized values" at http://cytof.biosurf.org/#!/cytof#transformations (copy/paste the link)). Of course, bead normalization renders the uniform randomization irreversible because the scaling due to normalization depends on time.

As for Axel and Jim, it's not clear to me which randomization have be done exactly in the figures, if randomization took place before transformation or after, if type 2 was done with sd = 1 or 0.3. That's good to have the code of all the analyses, but the randomization code is missing... No matter, let's be a detective! I looked a CD45 and tried to compute the difference between the non randomized data and the 3 randomizations. Here are some figures that show the randomization distributions and their parameters. The full code (last figure) allows to define at each step of the pipeline it occurs.

Type 1: uniform -1..0 before asinh
histo-type1.png
Histogram of the difference between the reference and type 1 randomized data


Type 2: gaussian (sd = 1) before asinh
histo-type2.png
Histogram of the difference between the reference and type 2 randomized data


Type 3: gaussian (sd = 0.3) after asinh
histo-maximal.png
Histogram of the difference between the reference after asinh and maximal randomized data


Let's appreciate the amount of information from CD45 and added noise.
For comparison purpose, here are two gaussians on top of CD45 histogram (non randomized data)
- blue one fits the non randomized transformed CD45
- orange one shows the amount of noise that will be added (aligned at the apex of CD45) during type 3 randomization
histo-amount.png
Comparison of CD45 signal and randomization noise


The reference PBMC non randomized data looked at the lowest values.
histo-nonrand.png
Non randomized reference is not made of integers solely


This puzzles me, because I was expecting raw integer counts, i.e. bars at each integer 1, 2, 3...9. This should indicate that bead normalization was applied.

The following is my very personal view. I feel I spent enough time on that article, not saying that I am not a reviewer. I read half of the article and won't read the second part yet. What is the protocol used to process the data from IMD to flow repository FCS? I can't read it clearly. Why the non randomized sample does not consists in integers solely? I don't know. The M&M presents a generic pipeline, but does not pinpoint what was really done for the available data. Why advocating for releasing IMD and not releasing the IMD? No idea. Sorry for being a little bit angry and disappointed. Maybe I was expecting too much. The article sounds negative but is not crystal clear to me. IMO, when an article is negatively constructive (nevertheless constructive, because it is raising points that should be clarified and standardized in the community), protocols and data have to be double-checked and representative. Data processing has to be fair, not extreme, in that I will never add that amount of noise to any of my data in order to expect them "speak to me" intelligibly. Currently, I feel this article deserves the CyTOF technology and worries users, as David stated. Users, your protocol is probably matching the type 1 randomization, not the most extreme one of this article. Fluidigm, could you bring your point of view/recommendations about setting up your software please?

Cheers.

Overall analysis.
pipeline.png
Pipeline
<<

dtelad11

Master

Posts: 107

Joined: Mon Oct 31, 2016 6:26 pm

Post Thu Nov 14, 2019 7:03 pm

Re: Randomization / randomisation consensus

Somewhat off-topic: Do the terms Type 1/2/3 randomization exist elsewhere in the literature? I'm afraid I'm not familiar with them, is that an accepted taxonomy?

Otherwise, I think it will be easier to refer to them via explicit names (uniform, Gaussian, etc.) rather than introduce an additional terminology which does not exist outside of this paper.
<<

geopap

Participant

Posts: 3

Joined: Mon Nov 05, 2018 11:28 am

Post Fri Nov 15, 2019 10:20 am

Re: Randomization / randomisation consensus

This is indeed a very fruitful discussion. Hope that this skepticism will increase people’s awareness. Thanks again.

Now, first, allow me to make a clarification about the protocol:

The nonrandomized data were preprocessed as the pipeline in Figs 1 (A) – (D) shows. That is, IMD -> FCS (randomize off) -> normalize -> gate.

Then, the default randomization was applied. That is, uniform (referred to as Type 1) and gaussian with sigma = 1 (referred to as Type 2).
For the extreme case, the nonrandomized data were transformed (arcsinh) and then randomized with gaussian noise with sigma = 0.3.

As Samuel very nicely shows on CD45, this is exactly the case.

For completeness, I re-quote from the text:
Cells were acquired on a CyTOF2 (Fluidigm) mass cytometer, CyTOF software version 6.0.626 with noise reduction, a lower convolution threshold of 200, event length limits of 10–150 pushes, a sigma value of 3

and
Data were preprocessed having the automatic randomization disabled. For data normalization we used the algorithm of Finck et al. whereas to remove cell doublets and dead cells, the data were manually pre‐gated (Supporting Information Fig. S13). Both Type 1 and Type 2 randomization were applied on the data. For Type 2 randomization we substituted every value x with a random number drawn from a Gaussian density centered at x and a standard deviation of 1, which is the default in the device software. Then, each multivariate analysis algorithm was applied in turn on both randomized and non‐randomized data using its default input parameter values. Prior to each analysis the data were transformed using the hyperbolic arcsin with a cofactor of five. To investigate the reasonable limits of the effect of randomization we also employ data preprocessed for bivariate visualization and gating as shown in Figure 1. For this, we applied Type 2 randomization with a standard deviation of 0.3. We will refer to this randomization scheme as the maximal randomization type.


Then, regarding Samuel’s questions:

Why the non randomized sample does not consists in integers solely?


This is because of bead normalization (this is indicated in Flowrepository (comments): “Please note that the non-randomized data are already normalized and the maximal randomized data are already scaled using arcsinh with co-factor 5.")

Why advocating for releasing IMD and not releasing the IMD?


This PBMC sample is a control sample in a larger study where several multiplexed samples were run. Because our group is still working on these data we cannot publish their raw version yet.

Data processing has to be fair, not extreme, in that I will never add that amount of noise to any of my data in order to expect them "speak to me" intelligibly.


Agreed, and let me requote from the Discussion of the paper,
As expected, the Type 1 randomization we employed conferred the smallest effect among the three cases we examined while the maximal randomization case the largest ones. Regarding maximal randomization, we should emphasize that Supporting Information Figures S1 and S2 clearly depict that there is no consensus neither as to what type of randomization is typically used nor as to how many times it is applied during data preprocessing. Therefore, even if the maximal randomization is not default as Type 1 is, it stands as a reasonable case to illustrate the limits of the potential effects.

Return to CyTOF data analysis

Who is online

Users browsing this forum: No registered users and 1 guest