FAQ  •  Register  •  Login

Large analysis of tSNE and UMAP parameters

Forum rules
Please be as geeky as possible. Reference, reference, reference.
Also, please note that this is a mixed bag of math-gurus and mathematically challenged, so choose your words wisely :-)
<<

sgranjeaud

Master

Posts: 123

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Fri Jun 14, 2019 9:30 am

Large analysis of tSNE and UMAP parameters

Hi,

James Melville https://github.com/jlmelville made a great work by porting UMAP to a R package and removing the need of Python pre-installation. His package is uwot https://cran.r-project.org/package=uwot and is available on CRAN. The umap funtion offers many parameters that could be tuned. There is also a t-umap function and a lvish method inspired from the largeVis package https://github.com/elbamos/largeVis that was previously on CRAN.

James also made an even more in-depth analysis of the influence of the parameters of tSNE and UMAP. He has written the smallvis package https://github.com/jlmelville/smallvis and he tried many variations of those two main algorithms. To be noticed, "this package is therefore not suitable for large scale visualization. Hence the name smallvis."

I just came up to this huge resource https://jlmelville.github.io/smallvis/ (look at section Explorations of smallvis). I started by looking at results from the MNIST dataset, as it usually seems to me to be close to the cytometry data although all groups are balanced (same number of events in each population).

I hope some of you will find answers,
Samuel
<<

dtelad11

Master

Posts: 129

Joined: Mon Oct 31, 2016 6:26 pm

Post Fri Jun 14, 2019 1:04 pm

Re: Large analysis of tSNE and UMAP parameters

Thank you for sharing! Any resource that makes these methods more accessible is welcome in my book.

On comment from the github which I found confusing,

> LargeVis and UMAP are of particular interest because they seem to give visualizations which are very competitive with t-SNE, while being more amenable to scaling up to large datasets.

I am baffled by the outdated misconception that t-SNE cannot be run on large data sets -- the optSNE preprint has been out there for long enough (https://www.biorxiv.org/content/10.1101/451690v1). Anna Belkina is presenting in CYTO next week and will hopefully share some of her data.
<<

AnnaBelkina

Contributor

Posts: 23

Joined: Sat Nov 01, 2014 7:07 pm

Post Fri Jun 14, 2019 1:42 pm

Re: Large analysis of tSNE and UMAP parameters

Thanks for the plug, El-Ad :)
That's right, I am presenting our opt-SNE (the most up to date version of the paper is V3, https://www.biorxiv.org/content/10.1101/451690v3) in the parallel sessions on Wednesday (10:30-12, Parallel Session 17: Application of Algorithms).
Also earlier, on Saturday (10:30-12) me and Josef Spidlen are giving a scientific tutorial on t-SNE optimizations, and he has some neat data on opt-SNE and/vs UMAP. We will be talking about opt-SNE, FIt-SNE, some tweaks to nearest neighbor approximations, and parameter titration in UMAP. Please come over!

To the point - James's smallvis page is a goldmine, please make sure you see it in full: https://jlmelville.github.io/smallvis. It's very up to date, he tests pretty much each and every flavor of dimensionality reduction methods similar to t-SNE and beyond. His explanations on the methodology are also excellent.

BTW, he has implemented opt-SNE into smallvis and published (very recently, I believe!) his step-by-step analysis of the optimization strategy here:
https://jlmelville.github.io/smallvis/optsne.html

I guess the line that El-Ad pointed to just needs to be synchronized with the more recently updated parts of the smallvis project.
<<

sedejong

Participant

Posts: 7

Joined: Wed Mar 04, 2015 3:42 pm

Post Tue Jun 18, 2019 6:10 pm

Re: Large analysis of tSNE and UMAP parameters

I would like to bring your attention to HSNE, which is a tSNE version/alternative that can handle millions of cells as well. HSNE is incorporated in Cytosplore, which does not require programming knowledge, has many visualization and analysis options (Approximated-tSNE, SPADE, clustering, linked heatmaps), and can be downloaded at https://www.cytosplore.org/.
Last edited by sedejong on Tue Jun 18, 2019 9:44 pm, edited 1 time in total.
<<

sgranjeaud

Master

Posts: 123

Joined: Wed Dec 21, 2016 9:22 pm

Location: Marseille, France

Post Tue Jun 18, 2019 7:28 pm

Re: Large analysis of tSNE and UMAP parameters

Thank you for the reminder about cytosplore. It is definitely an easy way to analyze the data.
Did you look at the code before pretending there are no parameters? I didn't, but if there are any, I guess Thomas chose them wisely.
IMHO, most algorithms have parameters. Sometimes they can be adjusted in the interface, sometimes only some of them are presented, sometimes they are hidden.
Best,
Samuel
<<

AnnaBelkina

Contributor

Posts: 23

Joined: Sat Nov 01, 2014 7:07 pm

Post Tue Jun 18, 2019 9:54 pm

Re: Large analysis of tSNE and UMAP parameters

I don't think Cytosplore code is open. In the GUI package, I do not recall having detailed t-SNE/HSNE or A-tSNE parameters exposed. Of course they are still there even if not explicitly available as user-accessible knobs - in that case, they are hardcoded into the code of the Cytosplore but we cannot see them.

In the Cytosplore paper (https://www.thomashollt.com/assets/pape ... splore.pdf) they say that "Scalability (in terms of data points) is still limited by the input size for A-tSNE. In our tests, tSNE is not only a limiting factor in terms of computational performance, but the embedding quality also quickly degenerates when going beyond a few million data points". I guess this can be helped now. Also, Cytosplore is way more than a tSNE wrapper and deserve even more attention for the other things it does.
<<

PaulNL

Participant

Posts: 5

Joined: Thu Sep 13, 2018 9:54 am

Post Fri Jun 21, 2019 9:09 am

Re: Large analysis of tSNE and UMAP parameters

HSNE most definetely has parameters:

Beta treshold for landmark selection (default: 1.5).
Number of neighbors considered in the first level, aka perplexity*3 (default: 30).
Number of trees used for the AKNN computation (default: 4).
Number of checks used for the AKNN computation (default: 1024).
Pruning treshold for FMC computation, aka saves memory if greater than 0 (default: 1.5, be careful in increasing it, prbly max 5).
Number of walks used for the selection of landmarks (default: 200).
Number of walks used for the selection of landmarks (default: 200). (area of influence values)
Number of scales (default: log10(num_data_points )).
Apply a min-max normalization.

I cannot check right now as I am on ubuntu, but I believe some of these parameters are exposed to the user via an "advanced " button or tab. I'm also not sure if these defaults are the same Cytosplore uses.

Return to CyTOF data analysis

Who is online

Users browsing this forum: Google [Bot] and 14 guests