Hi James,
Thanks for sharing a plot of your data -- it definitely helps for everyone to see what you're seeing. You've asked some great questions that address issues that I believe many others have run into. I'll try to tackle each of the points/questions you raised, in order:
Phenograph ... generated 29 clusters (default I think)
It looks like you used the default settings for PhenoGraph (k=30), and got 29 clusters. These clusters seem reasonable based on your data -- they are not segmenting the tSNE map into absurd or uninformative regions. However, these clusters might be missing some smaller populations by lumping them into larger clusters. This seems to be happening with Cluster 2, which has several well-resolved islands in tSNE space, but were aggregated into a single cluster. You can try lower values for the Rphenograph_k parameter and it will produce more clusters. I would try k=25, 20, 15, and 10. I wouldn't go lower than 10.
the attached file is the combined tSNE for all 20 files - when separated they actually look quite similar.
That's a good sign and indicates your staining, data collection and normalization were consistent. You don't want to see batch effects where each patient falls in a different part of the plot.
my tSNE plots look ‘busy’
The plot you shared doesn't look busy at all to me, but maybe I'm just numb to these plots by now
Increasing the tSNE iterations to 5000 may tighten up some of the more spread-out islands, but I think you're getting good separation here. I'd be very happy with a plot like this!
I wonder if having more samples might make it very difficult to do this analysis in future
Now this is the real challenge. If you double the number of cells per sample, or if you try this pipeline on twice as many files, it's going to take much longer to run the analysis. The plot will probably also be more "busy". In that case, you'll have to do more iterations to make the plot "tighter", which further exacerbates the computation time. It's a vicious cycle
This is the reality of using tSNE for big datasets -- it struggles with a large number of points. Newer algorithms like UMAP show promise for visualizing larger numbers of cells, but these aren't built into Cytofkit yet.
You can skip tSNE altogether and just run clustering (i.e. PhenoGraph or FlowSOM_meta) on the data. Unfortunately there appears to be a bug in Cytofkit_GUI where it will run tSNE even if you deselect it. I run the functions individually with an R script. Then you can monitor changes in cluster occupancy between your samples.
Since you're looking for populations (clusters) that differ between outcomes, you should probably try out some tools that are built for this purpose:
Cydar,
Statistical SCAFFOLD, or
CITRUS are good options. tSNE is great for interactive data exploration, but it's not well suited A vs. B comparisons in large datasets.
Is 10000 cells okay? (I picked this as my smallest file had just over 10000 cells - CD45+ defined events).
I think this is a useful number to get a sense of the landscape, the markers that are working well (and those that aren't), and if you need to do any clean-up gates. However, it sounds like you're interested in a cell type that's present at 1/1000 frequency, so 10,000 cells per sample will be underpowered to monitor changes in that population.
Should I define a different number of clusters beforehand?
You can't set the number of clusters with PhenoGraph. You can do this with FlowSOM_meta, however. I'm not sure what you mean by "beforehand" in this context, but one general tip is to clean up the data as much as you can before running clustering/tSNE.
Should I try a different seed?
I don't do this, personally. It doesn't have a big effect if the data is robust, and you're looking for changes/trends that are robust.
Is it worth downsampling before (as discussed on the forum)?
If your goal is to look at iNKT cells, then you may want to make them a greater proportion of the data. Do you really need T cells, B cells and monocytes in the same analysis? Can you analyze the iNKT cells separately?
In your experience will it identify low frequency cells (surely with only 10000 cells, iNKT cells [0.1% of PBMCs] are going to be difficult to find?
Let's say you run clustering on a dataset of n=20 samples * 10,000 cells each = 200,000 cells total. If they each have exactly 0.1% iNKT cells, then you've given only 200 iNKT cells to the clustering/tSNE algorithms. Is that enough to define a cluster? Sometimes it is. If the cells have a very unique phenotype (i.e. endothelia), then a rare but consistent group of cells can form its own cluster. However, if they are pretty similar to other cell types in your dataset (in terms the markers that you measured and revealed to the clustering algorithm), then I wouldn't be surprised if they got merged into a similar cluster. In some cases, PhenoGraph will merge clusters, but tSNE keeps them separate. This is what's happening for Cluster 2 in the example dataset you provided. The safer route would be to enrich the iNKT cells by removing the more abundant cell type (T / B / mono). You can always measure those in a different clustering run.
I think this is off to a great start. It certainly looks like you have "structure" in your data. Best of luck with your comparison!