FAQ  •  Register  •  Login

Settings for clustering arbitrary? (Phenograph/FlowSOM)

<<

jper33

Participant

Posts: 2

Joined: Tue Mar 31, 2020 4:45 pm

Post Wed Apr 01, 2020 12:39 am

Settings for clustering arbitrary? (Phenograph/FlowSOM)

Dear Cytof community,

I am currently analysing a dataset in which I would like to perform unsupervised clustering to identify cell populations.
Therefore, I am currently using mainly FlowJo with plugins such as Phenograph and FlowSOM. I've played around with the Phenograph settings and the number of clusters that are detected depends on the number of events of my input population and on k (nearest neighbours) that I can specify. Here are 3 examples:
1. If I downscale the data set to 5000 events and I use k=30 I detect 13 clusters.
2. If I use 55,000 events of the same data set and k=30 I detect 32 clusters.
3. If I use 55,000 events and change the setting to k=500, then I detect 15 clusters which is again similar to the first test.

These observations lead to some questions:
1. How do you decide what the appropriate Phenograph setting for the data set is? How do you make sure you do not under/overcluster your data?
2. Do you usually downsample the files for your analysis (e.g. 5000-10000 events)? Or would you recommend to use all possible events (e.g. 55,000)? I am asking because the Phenograph calculation takes around 2-3 hours with k=500.
3. Does downsampling introduce a bias for your analysis? E.g. in regard to big vs small populations?

PS: Has anybody tried ConsensusClusterPlus R package to estimate the appropriate number of clusters for your data set? The test data set from the tutorial consists of microarray data (5000 samples x 128 genes) and this runs within a few seconds when using the test settings (max clusters = 6, repetitions = 25). I tried to run cytometry data of a similar size (5000 events x 45 markers) with settings that take even less computing power (max clusters = 6, reps = 2) but the calculation time is >15 min. This is concerning as I would like to scale this test up to more than 5000 events, max clusters = 20 and around 1000 reps. I am just wondering why cytometry data matrix of similar size as the microarray data matrix would take so much longer to compute? I even tested different transformations of my cytometry data (e.g. linear vs biexponential), but it did not affect calculation time.

Thank you very much for your help in advance,
Kind regards,
JH
<<

jimbomahoney

Master

Posts: 83

Joined: Wed Feb 27, 2019 11:21 am

Post Wed Apr 01, 2020 1:49 pm

Re: Settings for clustering arbitrary? (Phenograph/FlowSOM)

Hi JH,

Whilst I can't answer your questions directly, I have done a lot of (amateur / beginner) work with various clustering algorithms.

The following papers are extremely useful in providing more information, which I think will answer your questions, as well as provide lots of good information.

https://www.jimmunol.org/content/200/1/3

https://genomebiology.biomedcentral.com ... 019-1917-7

https://www.jimmunol.org/content/195/3/773

https://onlinelibrary.wiley.com/doi/ful ... to.a.23030

One thing I will say, and am happy to be corrected, is that I'm pretty sure FlowSOM uses ConsensusClusterPlus as part of its pipeline (as I understand it, FlowSOM first clusters using its own method, then applies CCP to produce what the user sees as clusters).
<<

mleipold

Guru

Posts: 5796

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Wed Apr 01, 2020 4:12 pm

Re: Settings for clustering arbitrary? (Phenograph/FlowSOM)

Hi JH,

I'm sure El-ad will chime in with his experiences with different settings with the different algorithms.

In my opinion:
1. I don't like downsampling, and avoid it whenever I can. I generally work with relatively diverse cell samples like PBMCs, and without even trying, I can easily hand-gate 30-40 populations that no one would argue with immunologically based on my panel. However, some of them like Basophils, pDCs, mDCs, Plasmablasts, TCRgd, Transitional B cells, or CD16+ Monocytes have a high degree of donor-dependent variability and/or biological rarity. Many of these regularly wind up under 1% of total Live Intact Singlets, and therefore many algorithms (using default settings) will miss some or all of them. This is one reason I like FlowSOM: I don't have to downsample.

2. I personally think that it's best to over-cluster and then pare down, rather than undersample and wind up missing populations (more precisely, have more than one population wind up in the same cluster).

3. I think you also need to be careful in understanding your parameter selection. When you use the cytofkit R package v1.11.3 build, you can select a k value for Rphenograph and a k value for FlowSOM. However, if you click on the "?" buttons for each, you'll find that RPhenograph k is the number of nearest neighbors, while the FlowSOM k is the number of clusters for metaclustering. In other words, "k" may mean different things depending on the algorithm. You're using FlowJo plugins, which I'm not as familiar with, so it may be more clear there.


Mike
<<

dtelad11

Master

Posts: 129

Joined: Mon Oct 31, 2016 6:26 pm

Post Wed Apr 01, 2020 9:44 pm

Re: Settings for clustering arbitrary? (Phenograph/FlowSOM)

Ha, I was starting to type my reply, but Mike got most of the points I wanted to convey.

I will add a self-promotion to my Analysis Pipeline webinar: https://www.youtube.com/watch?v=L7jPYJouwsI you might want to check it out.

W.r.t ConsensusClusterPlus, check out https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5473464.2/. With that said, I would not trust it (or any other existing heuristics) for deciding on cluster count. In my opinion, the criterion should be whether the clustering algorithm identifies the populations you're expecting. As Mike pointed out, if you know you panel is supposed to find basophils and pDCs, and you can't find that cluster, you're in trouble.

> I'm pretty sure FlowSOM uses ConsensusClusterPlus as part of its pipeline

The FlowSOM paper is using consensus hierarchical clustering from ConsensusClusterPlus for the meta-clustering. It is not run by default, only if you instruct the package to do so.

> I am just wondering why cytometry data matrix of similar size as the microarray data matrix would take so much longer to compute?

Is it possible that the microarray data is sparse? It's hard to say more without seeing the code you used and the data you included. There are a bunch of R traps that could really amp up your run time. For example, some packages require a matrix, and if you provide them with a data.frame the run time goes up by x10. Not saying that it's happening here, just that it might be an R artifact or some other weird reason.
<<

jper33

Participant

Posts: 2

Joined: Tue Mar 31, 2020 4:45 pm

Post Thu Apr 02, 2020 2:15 pm

Re: Settings for clustering arbitrary? (Phenograph/FlowSOM)

Dear all,

thank you very much for the responses. The information provided was very helpful. Some of those linked information I already stumbled on during my online search and it is indeed very valuable.

From your responses I take I should avoid downsampling if possible and better overcluster and then pare down.
1. Does this mean you would publish an overclustered result and discuss e.g. why two clusters actually represent only one population, or do you play around with cluster settings until you receive the result you would expect?
2. If you have different samples (donors) and cluster each donor individually, would it be important to keep the cluster settings (i.e. Phenograph k value or FlowSOM metacluster value) the same across all donors? Or is it better to choose different values for each donor so you could make sure the clustering is optimal for each donor separately?
3. After playing around with the settings of Phenograph and FlowSOM, it seems like the Phenograph results fit better to the clusters I would expect (detecting the main cell populations). FlowSOM identifies some very small clusters of the size of just a few cells. Is FlowSOM picking up on some artefacts in the dataset here? Have you made similar observations?

Thank you very much for your help and efforts,
Kind regards,
JH
<<

mleipold

Guru

Posts: 5796

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Thu Apr 02, 2020 8:04 pm

Re: Settings for clustering arbitrary? (Phenograph/FlowSOM)

Hi JH,

1. Regarding overclustering: in some ways, I think this is philosophical. To the most finely grained point, if you have 10M cells, you could argue that you have 10M clusters, as there's very little chance that any 2 cells would have *precisely* the same values for *every* parameter. I believe Burkhard Becher made this point at the Berlin meeting in Jan 2020.

But, that's not terribly *useful*. What you're trying to determine is how to have enough clusters to make each cluster a highly-homogeneous population (as measured by every-parameter CVs, SDs, etc), but unfortunately it's difficult to define exactly where "enough" clusters is.


2. Generally, I perform clustering on *all* the files at once, rather than each file individually and "map back" File A's Cluster X to File B's Cluster Y. Therefore, all the files would be clustered using the same parameters. I'm not sure how wise it is to vary your parameters on a file-by-file basis and then try to compare....it seems like you'd be trying to compare the results of "viewpoints" from different "map altitudes" or "pixel resolutions". Maybe some informatics people (El-ad, Lars, Samuel, Mark, Sofie, Zach, etc) could chime in on that?


3. Depending on the fineness of your settings, it's certainly possible to have clusters with very few cells. However, it's not straightforward to say that that's "just noise", based off of Count/Number of cells.
- For example, imagine that you are doing a vaccination response study and comparing a Day 0 sample vs Day 7 sample, and you found that Cluster 10 has very few cells in Day 0. Well, that cluster might be plasmablasts, which you'd have very few of on Day 0 but significantly more of on Day 7.
- Or imagine you're doing an antigen-specific T cell study, and Donor A has been exposed to the antigen but Donor B hasn't.

Note: this is formally a different concept than whether you have enough N to compare them *rigorously in a statistical sense* (ie, have enough events for a good Median, CV, etc). But I'll let statisticians chime in on what they feel the smallest N can be for that.


Mike
<<

dtelad11

Master

Posts: 129

Joined: Mon Oct 31, 2016 6:26 pm

Post Fri Apr 03, 2020 1:30 pm

Re: Settings for clustering arbitrary? (Phenograph/FlowSOM)

Answers below. I suspect that we are going down a rabbit hole that might not be helpful to you with analyzing your project. What is your goal with clustering? More often than not, it's immunophenotyping. If that is the case, I suggest investing the minimal amount of work that will get you to that goal. I get the sense that you're looking for some clear theory behind clustering parameterization and annotation, sadly such a theory does not exist.

> 1. Does this mean you would publish an overclustered result and discuss e.g. why two clusters actually represent only one population, or do you play around with cluster settings until you receive the result you would expect?

Researchers often do the latter. Personally, I do not agree with this strategy, as it is not reproducible. It seems to be the norm, however. I have never seen a paper actually discusses the annotation decisions for the clustering.

> 2. If you have different samples (donors) and cluster each donor individually, would it be important to keep the cluster settings (i.e. Phenograph k value or FlowSOM metacluster value) the same across all donors? Or is it better to choose different values for each donor so you could make sure the clustering is optimal for each donor separately?

In the past, I approached it by clustering each batch individually and using meta-clustering to align them (see webinar). In that case I kept the parameters the same between samples. While it might not be ideal, it's a good-enough solution that kept me sane.

> 3. After playing around with the settings of Phenograph and FlowSOM, it seems like the Phenograph results fit better to the clusters I would expect (detecting the main cell populations). FlowSOM identifies some very small clusters of the size of just a few cells. Is FlowSOM picking up on some artefacts in the dataset here? Have you made similar observations?

Based on my experience the two methods are generally comparable post-annotation. With that said, it is possible that FlowSOM is detecting outliers. Alternatively, your cluster count might be too high, so FlowSOM is forced to break down otherwise coherent clusters. Since Phenograph works for you I would just stick to that (unless your project is about benchmarking clustering methods, in which case we should have another conversation on the proper way to do it).
<<

vtosevski

Contributor

Posts: 44

Joined: Wed Nov 20, 2013 12:50 pm

Location: Zurich, Switzerland

Post Mon Apr 06, 2020 9:57 pm

Re: Settings for clustering arbitrary? (Phenograph/FlowSOM)

Dear JH,

Many sound advises coming from James, Mike and El-ad, not much to add apart from two little comments you made initially and I believe have not been addressed.

Increased number of "detected" clusters as the number of events increases is predictable and expected behavior, especially for graph-based methods.

The "feeling" you had about FlowSOM "identifying some very small clusters" is something that has been communicated and reported before, see also here, in addition to the references James already shared.

Lastly, to reiterate what has been said and also paraphrase another colleague - "clustering is a '100 years old' problem and we won't solve it today". It's very useful but it isn't a magic bullet and a lot depends on the method of choice, parametrisation, structure of the data etc.

Best,
Vinko

Return to CyTOF general discussion

Who is online

Users browsing this forum: No registered users and 10 guests