Fri Nov 06, 2020 12:33 am by tomash
Hi Chasman,
It is indeed the *range* that is split into 1024 uniformly ranged bins (as opposed to the bins containing equivalent numbers of cells). The range, in this case, is simply the minimum to maximum values that are _plotted_ by FlowJo (e.g. for CyTOF data, I think defaults would be something like -10^1 to 2x10^4 or, close to it). It is performed on all markers that are chosen for whatever function is being used (e.g. tSNE etc in FlowJo, or for when exporting data as CSV channel values). The helpful thing, is that if you have already optimised the axis settings for each channel (especially the extent of compression of the low-end values), then that adjustment is captured in the channel values for each marker individually. The binning is done regardless of whether samples are combined or not -- as the actual data doesn't play a role in determining the binning, only the plotted range does that.
In terms of the differences between your FlowSOM results from FlowJo and R, are these differences in the fundamental structure of the results, or just different cluster ID numbers of different groups of cells? The later is quite common and would be expected -- as you have said, the simple fact that the values are going to be very different between the two datasets (e.g. one is 0 - 1024, and one is ~0 - 5), and the stochastic nature of clustering runs, mean the the cluster assignments might look quite different. It's possible that the actual cell groupings could be comparable, just with different labels (e.g. T cells in FlowJo were metacluster 1, but metacluster 5 in R, etc). However, if there are serious structural differences between the R and FlowJo runs with FlowSOM, there could be a few causes:
1. Because the level of low-end compression will be different between the channel value and arcsinh transformed data, you may have different levels of background signal in each channel which will effect the FlowSOM grid and clustering results.
2. If we assume that the compression of low-end values was similar in the channel value and arcsinh transformed data, there is still a difference. If the plot max in FlowJo was 5x10^3, for examples, then any values _above_ 5x10^3 will simply be converted to 1024. In Arcsinh transformation, there is no high-value data capping, so you may have some high expression values that would stain proportionally high after arcsinh transformation.
3. The default settings being used for each FlowSOM run may be different. The grid size determines how many first level clusters are generated, and this can impact things, especially if you have more subtle low frequency populations, or are trying to capture phenotypic 'landscapes' more than distinct populations. Are you specifying a target number of metaclusters to be generated in each, or are you letting FlowSOM choose for you?
There is an obvious question here as well: given that the binning reduces the 'resolution' of the data (i.e. a range of 0 - 10,000 is compressed to 0 - 1024), does the quality of clustering suffer? I will say that I have not formally benchmarked this, but my experience is that both perform pretty well in clustering, and I've not noticed substantial differences in how well populations are resolved. I'm very willing to be proven wrong on that, but so far so good. In some cases I've found the channel values have performed better, but this is probably because of the capping of the high-end values, which is something that cane be done to the arcsinh transformed values as well.