Re: Keep the same viSNE plot when clustering...
Hi James,
You raise a good point about total cell number. When I recommended using "all" events for the clustering, I was assuming that Gael was working with a low number of events in the tSNE plot (roughly <1e6 cells, total). Cytobank's built-in viSNE applet makes it easy to subsample and normalize the number of cells across samples, and I believe it has a hard upper limit of 4e6 cells total.
Including "all" cells has two upsides:
(1) It gives the clustering algorithm (PhenoGraph) the same information that was available to the tSNE algorithm when it was run in Cytobank, so the clusters are more likely to make sense when overlaid in tSNE space
(2) The final tSNE plot will have the same density as what you see on Cytobank, except now you can see the cluster assignments (this was Gael's goal, from what I understood)
There's no hard and fast rule about using all events. You could certainly run PhenoGraph on a subset of the cells exported from Cytobank. Just be aware that when you re-upload those files, the tSNE plots will look sparser than they did before. Also, it's a little bit of an apples / oranges comparison because you're not giving PhenoGraph the same data that was provided to the tSNE algorithm. In an extreme example, suppose you subsampled to a very low number of cells (i.e. 100). This would not be enough information for PhenoGraph to distinguish outliers from rare but robust cell populations. In that case, the clustering would look terrible when they are overlaid on the tSNE plot.
The downside of all of the workflows we have discussed so far on this thread is that they require subsampling when dealing with large numbers of samples. This is a big compromise -- on the one hand, it makes it easy to visualize the clusters in tSNE space (i.e. you can view cell-level data) and to compare across samples (i.e. cluster names are preserved). On the other hand, you lose some resolution of rare cell subsets.
A different way of using PhenoGraph is to run it on each sample separately, and then assemble metaclusters from the clusters (see Levine et al. Cell 2015). Cytofkit's FlowSOM-meta implementation also works this way (see Van Gassen et al Cytometry A 2015). With this type of workflow, you can use many more cells per sample. I've done some recent analyses with 2e5 cells per sample x 80 samples = 1.6e7 cells total. However, the downside is that can't view the cell-level data in tSNE (1.6e7 cells would take forever to produce a tSNE plot). An alternative is to view the centroids of the first-round clusters in tSNE space, but that requires some custom coding -- cytofkit isn't built for that. I still use Cytofkit / PhenoGraph /tSNE for a quick look at sets of 1-20 samples (10,000 cells per sample), but I've moved on to a custom-built metaclustering pipeline for bigger datasets of 20+ samples.
ES
You raise a good point about total cell number. When I recommended using "all" events for the clustering, I was assuming that Gael was working with a low number of events in the tSNE plot (roughly <1e6 cells, total). Cytobank's built-in viSNE applet makes it easy to subsample and normalize the number of cells across samples, and I believe it has a hard upper limit of 4e6 cells total.
Including "all" cells has two upsides:
(1) It gives the clustering algorithm (PhenoGraph) the same information that was available to the tSNE algorithm when it was run in Cytobank, so the clusters are more likely to make sense when overlaid in tSNE space
(2) The final tSNE plot will have the same density as what you see on Cytobank, except now you can see the cluster assignments (this was Gael's goal, from what I understood)
There's no hard and fast rule about using all events. You could certainly run PhenoGraph on a subset of the cells exported from Cytobank. Just be aware that when you re-upload those files, the tSNE plots will look sparser than they did before. Also, it's a little bit of an apples / oranges comparison because you're not giving PhenoGraph the same data that was provided to the tSNE algorithm. In an extreme example, suppose you subsampled to a very low number of cells (i.e. 100). This would not be enough information for PhenoGraph to distinguish outliers from rare but robust cell populations. In that case, the clustering would look terrible when they are overlaid on the tSNE plot.
The downside of all of the workflows we have discussed so far on this thread is that they require subsampling when dealing with large numbers of samples. This is a big compromise -- on the one hand, it makes it easy to visualize the clusters in tSNE space (i.e. you can view cell-level data) and to compare across samples (i.e. cluster names are preserved). On the other hand, you lose some resolution of rare cell subsets.
A different way of using PhenoGraph is to run it on each sample separately, and then assemble metaclusters from the clusters (see Levine et al. Cell 2015). Cytofkit's FlowSOM-meta implementation also works this way (see Van Gassen et al Cytometry A 2015). With this type of workflow, you can use many more cells per sample. I've done some recent analyses with 2e5 cells per sample x 80 samples = 1.6e7 cells total. However, the downside is that can't view the cell-level data in tSNE (1.6e7 cells would take forever to produce a tSNE plot). An alternative is to view the centroids of the first-round clusters in tSNE space, but that requires some custom coding -- cytofkit isn't built for that. I still use Cytofkit / PhenoGraph /tSNE for a quick look at sets of 1-20 samples (10,000 cells per sample), but I've moved on to a custom-built metaclustering pipeline for bigger datasets of 20+ samples.
ES