FAQ  •  Register  •  Login

database of all available analysis tools

Forum rules
Please be as geeky as possible. Reference, reference, reference.
Also, please note that this is a mixed bag of math-gurus and mathematically challenged, so choose your words wisely :-)
<<

lwagar

Participant

Posts: 4

Joined: Fri Jan 08, 2016 7:32 pm

Post Wed Jun 20, 2018 5:54 pm

database of all available analysis tools

In my years as a cytof end-user, a common question from new users is "what's the best clustering algorithm for data analysis"? This has become an increasingly difficult question to answer as rapid development of new methods continues and people have particular needs for their datasets. I think it would be beneficial to the community if we could come up with a database of all publicly available tools for clustering and some end-user feedback on their advantages and disadvantages. This would also be an excellent platform for developers to make their particular tool better known to us wet-lab people. Some features I think are of broad interest include:

-name and link to the tool
-introduction, basic features of the clustering (hierarchical, tsne-based, etc.)
-customizability and ease of use
-relative speed (users could report time for a given number of cells and their computing power)
-whether users have been able to get the tool up and running independent of the original publishing group
-ability to discriminate small populations
-outputs (population frequencies, statistics, etc)
-pros and cons

I am happy to volunteer to compile this list and share with the community if users are willing to provide their experiences.

Lisa
<<

tomash

Contributor

Posts: 25

Joined: Sun Oct 19, 2014 10:15 pm

Post Thu Jun 21, 2018 1:21 am

Re: database of all available analysis tools

Hi Lisa,

This is an excellent idea! This will be really helpful for a lot of people who, as your say, find it difficult to navigate this space. Would you prefer that people just send you information, or would you use some kind of form (like a Google form or similar)?

I would like to make three comments regarding this endeavour.

Firstly, there are some published benchmarking studies comparing different computational tools -- the findings and methodologies from these studies will be invaluable here. Here is one comparing clustering tools that was published in Cytometry A in 2016 (https://www.ncbi.nlm.nih.gov/pubmed/27992111) and another for trajectory inference methods in bioRxiv earlier this year (https://www.biorxiv.org/content/biorxiv ... 7.full.pdf). There is also the FlowCAP studies (http://flowcap.flowsite.org ; https://www.nature.com/articles/nmeth.2365 ; https://www.ncbi.nlm.nih.gov/pubmed/26447924).

Secondly, in addition to a 'database' of tools and their pros/cons etc, it would be worth adding some form of wiki/guidance on deciding which tools is best. As you say, the question of 'which tools is best' is pretty challenging because it depends on the desired outcome. Having some guidance of figuring out which tools to choose from, depending on the questions being asked, would be helpful as people try to navigate the array of options out there. This also extends to some aspects which aren't always considered -- such as the size of the dataset -- where analysing tens of millions of cells might impact not only the overall run-time (some tools scale well to dataset of this size, such as FlowSOM), but might be an inherent limitation of the tool (we find that we can only run <100,000 in a single tSNE analysis before we start getting data crowding). For a helpful example, the trajectory inference benchmarking study above includes a flow chart which guides users to specific groups of tools depending on the nature of the data, and the desired outcome.

Thirdly, any database or discussion should also include some notion of how the tools fit into 'discovery' (looking for everything/new/unknown features) or 'diagnosis' (reproducing a static analysis over many samples over time). I have found that some researchers have a poor view of automated clustering/dimensionality reduction tools because they have not fully understood their purpose, and what they are used for. Many researchers are looking to reproduce an analysis approach over hundreds of patient samples, collected over long periods of time. Sometimes the automated clustering tools are hindered by batch effects or sample heterogeneity, and will not always find the same populations in each dataset**, especially rare subsets (whereas this might be achieved through an approach such as manual gating). As a result, some will take the view that such clustering tools aren't worthwhile. This view is applied to all clustering tools, because the selected tool doesn't fit the use case (or, more often, the WAY the tool is being used doesn't fit the use case), rather than the tool(s) having not valuable function or contribution. Having some comment, or discussion, about how these tools fit with these kinds of projects would give some helpful guidance to users in this regard.

** Just to be clear, there are ways to use automated clustering tools to get consistent analysis over many samples acquired over time, but in some cases such studies might be better suited to pre-defined gating or identification framework tools, rather than clustering.

One final point, our two cents: in our group, after having some experience with a variety of tools (and consulting the benchmarking studies), we've ended up sticking with a couple of 'go-to' options that fit the majority of our analysis questions, as this was far easier and more scalable for us than looking at different tools for every new question (with some exceptions, such as trajectory inference). For most of our analysis on i) population identification, ii) number of cells per cluster, ii) MFI of markers on different clusters, iv) relationship of clusters to other clusters, and v) tracking these changes across different groups/conditions; we've been using FlowSOM to cluster the large dataset (very quick run times, including for datasets of tens of millions of cells), then subsample this to generate tSNE plots to visualise the cells/clusters at a single-cell level. We do this either by running our own pipeline in R studio (https://github.com/sydneycytometry/CAPX) to maintain adaptability and customisability, or through iterative runs in cytofkit (https://bioconductor.org/packages/relea ... ofkit.html) which is more user-friendly.
<<

vtosevski

Contributor

Posts: 44

Joined: Wed Nov 20, 2013 12:50 pm

Location: Zurich, Switzerland

Post Thu Jun 21, 2018 2:33 pm

Re: database of all available analysis tools

Hi Lisa and Tom,

While I appreciate the challenge for a bench scientist to navigate this space, I am not immediately convinced that this effort (which is not negligible) would give the kind of results one anticipates (clarity, guidance, understanding of the strengths and weaknesses of various methods, etc). I am a bench scientist myself so I invite others to chime in on the algorithm side of things, but let me give you an example.

In https://www.ncbi.nlm.nih.gov/pubmed/27992111 authors found FlowSOM to have the highest F1 score (in comparison to expert gating) on Levine_32dim dataset with mean F1 score across all populations of almost 0.8.

If we took that information naively, we could say that the FlowSOM is "the best" since its output, on average, fits most with what the gating result would be, thus making our life easier when it comes to interpreting the clustering results (not taking into account now also the speed, etc, etc).

In https://academic.oup.com/bioinformatics ... 82/4860364, however, FlowSOM performance was apparently much worse, with reported mean F1-score over all considered populations of 0.53. The authors say that's similar to Weber et al, but the value doesn't strike me as such, it appears worse. If FlowSom had 0.5 F1-score in Weber et al., maybe others would take the tile there.

To make the long story short, I don't think there will ever be a definitive resource to tell you which algorithm is best for you. (Un)fortunately, we might be pass that point and there's no turning back...

Of note, A. Tomic presented her work at CYTO2018, titled "SIMON: An Automated Machine Learning System Reveals a Subset of CD8+ T Cells Responsible for Improved Antibody Responses to Seasonal Influenza Vaccine". If I remember her talk correctly, her framework systematically applied and compared dozens of clustering algorithms on the same dataset in an effort to decide on the best one. I really don't recall further details, but maybe something along those lines becomes the way to go when deciding on the first steps, which tools to put in your toolbox for a particular project.

Best,
Vinko
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Thu Jun 21, 2018 4:31 pm

Re: database of all available analysis tools

I agree with Vinko. It's unlikely for any one algorithm to be appropriate for every experiment. Lots of fields are trending towards the approach that SIMON does, throw everything at the data, and see what gives good results. This is by no means a fruitless effort, but in my opinion, a better thing to do would be to create a list of review articles that cover these algorithms. This is where I search first whenever I'm approached with a new analysis. Even in well established fields like RNA seq, there are new algorithms being published all the time, and I still don't use the same one for every experiment.
<<

sgranjeaud

Master

Posts: 123

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Thu Jun 21, 2018 7:34 pm

Re: database of all available analysis tools

Dear Lisa and previous posters,

I think the challenge is useful, but it will be very difficult to test and report every algorithm.

As stated by Vinko, benchmark articles is a good start.

As stated by Brian, probably no algorithm will fit the very different objectives of a cytomics experiment.

But I guess that Brian is setting up a different pipeline for every RNAseq experiment because each of his experiments is really different. My colleagues do reuse standardised pipelines when processing RNAseq, and I feel it's OK.

Samuel
<<

bc2zbUVA

Contributor

Posts: 22

Joined: Thu Nov 19, 2015 4:23 pm

Post Thu Jun 21, 2018 8:14 pm

Re: database of all available analysis tools

Sorry for the confusion. I wasn't implying that I use a different algorithm for every single analysis, just that they are always going to be edge cases, and sometimes people use the tool they are most familiar with instead of the most appropriate tool.
<<

lwagar

Participant

Posts: 4

Joined: Fri Jan 08, 2016 7:32 pm

Post Thu Jun 21, 2018 9:07 pm

Re: database of all available analysis tools

Hi all, I do not foresee the purpose of this database to identify the "best" option out there for every possible scenario. Indeed, everyone is doing something different with their data and has a different experimental goal. I still think it would be useful to be able to at minimum identify all the tools out there (so that people at least know what their options are) and get an idea of actual user experience from those tools. There's nothing more frustrating than identifying a cool tool only to find out it isn't compatible with some aspect of your experimental setup or data analysis requirements. I can prepare a google form for user contributions - these can be anonymous so no one has to feel bad if they had a not-great experience with a particular algorithm for their specific question.
<<

mleipold

Guru

Posts: 5792

Joined: Fri Nov 01, 2013 5:30 pm

Location: Stanford HIMC, CA, USA

Post Fri Jun 22, 2018 7:08 pm

Re: database of all available analysis tools

Hi all,

Like Lisa, I get this question all the time. There definitely are some algorithms I prefer, or steer people to as a "first pass"......like Tom, FlowSOM is usually my own first-pass nowadays.

I definitely agree that not every algorithm is going to be appropriate for every study/dataset. This *is* part of the reason I recently posted about compiling a database of publicly available CyTOF datasets (viewtopic.php?f=3&t=1047): not only for algorithm developers to have a variety of things to use to try to "break" (or at least strongly test) their algorithms, but as a resource for people generating data and then looking at other datasets that might be "similar" enough (panel, cell type, staining conditions, cell rarity, etc), seeing what's been used successfully, and then using that as a place to *start*. Obviously, new ones will have come out, but I find a lot of my users have an analysis paralysis of "where do I even start?!"

Regarding the benchmarking studies: both FlowCAP and some of the others seem to use a lot of the same datasets. Even a lot of the newer algorithms use the same datasets. This is good in some ways, as it gives a way to relate the different results. However, I think it also is a limitation: as good as, say, the Bendall et al Science paper dataset is, it's not the only type of experiment that people are doing. Therefore, if your experiment is designed differently than that type, you may not have a lot of information from the current benchmarking articles.


Let me be clear: I don't think every algorithm is good for every study. And I also don't think it's bad to *demonstrate* both the strengths *and* weaknesses of new algorithms; I personally love it when authors (in the paper or at least on their Github for the code) give examples of the limitations of their approaches!

That also leads me into another topic: documentation. I think algorithm developers (new and old code) can do a better job of documenting how to use their software. One example of this is the original documentation for Citrus on Github: the paper was published in PNAS in July 2014. However, the "Getting Started" Github wiki didn't explicitly state "Citrus requires 8 or more samples in each experimental group for Citrus in order to work as expected" until Nov 2014.

I'm not meaning to single out the Citrus developers on this: a lot (I would personally say most) algorithm documentation suffers in this respect. From the developers' POV, much of this type of thing may be obvious based on the way they wrote their code. However, from the end-user POV, this makes it a lot harder for us to immediately exclude certain algorithms because our dataset (N, cell rarity, etc) isn't appropriate.

Or, put another way, the users don't want to misuse your algorithms any more than the developers *want* them to be misused.


So, Lisa, Tom, and others: are there some newer or more troublemaking datasets in particular that you would especially recommend authors should keep in mind for future benchmarking studies? Developers, are there datasets you would like to *see* to *try* to break your algorithms on (viewtopic.php?f=3&t=874&p=2538)?

One I might recommend are some datasets from Amir Horowitz or Catherine Blish, where there's a highly cell-type-focused panel, that additionally has some rare cells in it.....I think that's a kind that a number of algorithms will have trouble with (particularly those that strongly downsample or have frequency thresholds), but would be relevant to people trying to find things on the rarity level of, say, tetramer-positive cells.


Mike
<<

lwagar

Participant

Posts: 4

Joined: Fri Jan 08, 2016 7:32 pm

Post Wed Jun 27, 2018 6:20 pm

Re: database of all available analysis tools

Here is a Google Survey that I hope will help us address some of the questions that came up. Please distribute widely! I will update once we start receiving responses. And a big thanks to Mike for reviewing the survey to make sure we get some useful data.

https://goo.gl/forms/KtuDuegqISXuQ32b2

Return to CyTOF data analysis

Who is online

Users browsing this forum: No registered users and 3 guests