Settings for clustering arbitrary? (Phenograph/FlowSOM)
I am currently analysing a dataset in which I would like to perform unsupervised clustering to identify cell populations.
Therefore, I am currently using mainly FlowJo with plugins such as Phenograph and FlowSOM. I've played around with the Phenograph settings and the number of clusters that are detected depends on the number of events of my input population and on k (nearest neighbours) that I can specify. Here are 3 examples:
1. If I downscale the data set to 5000 events and I use k=30 I detect 13 clusters.
2. If I use 55,000 events of the same data set and k=30 I detect 32 clusters.
3. If I use 55,000 events and change the setting to k=500, then I detect 15 clusters which is again similar to the first test.
These observations lead to some questions:
1. How do you decide what the appropriate Phenograph setting for the data set is? How do you make sure you do not under/overcluster your data?
2. Do you usually downsample the files for your analysis (e.g. 5000-10000 events)? Or would you recommend to use all possible events (e.g. 55,000)? I am asking because the Phenograph calculation takes around 2-3 hours with k=500.
3. Does downsampling introduce a bias for your analysis? E.g. in regard to big vs small populations?
PS: Has anybody tried ConsensusClusterPlus R package to estimate the appropriate number of clusters for your data set? The test data set from the tutorial consists of microarray data (5000 samples x 128 genes) and this runs within a few seconds when using the test settings (max clusters = 6, repetitions = 25). I tried to run cytometry data of a similar size (5000 events x 45 markers) with settings that take even less computing power (max clusters = 6, reps = 2) but the calculation time is >15 min. This is concerning as I would like to scale this test up to more than 5000 events, max clusters = 20 and around 1000 reps. I am just wondering why cytometry data matrix of similar size as the microarray data matrix would take so much longer to compute? I even tested different transformations of my cytometry data (e.g. linear vs biexponential), but it did not affect calculation time.
Thank you very much for your help in advance,
Kind regards,
JH