Update of Rphenograph and Rtsne
I hope everybody is safe and has started to work again.
I have finished to implement some improvements in Rphenograph, features below.
For the geeks who want to test and report problems the code is available
https://github.com/i-cyto/Rphenograph
I implemented the really delicious optsne in Rtsne some weeks ago and I am waiting the approval of the author/maintainer of Rtsne.
For the geeks who want to test and report problems the code is available
https://github.com/SamGG/Rtsne
To be noticed, I am not a maestro of C coding nor openmp, so use these packages at your own risk.
Installation commands at the end of this message.
I will integrate them in a new release of cytofkit for the end of June.
On my laptop, a dataset of 300 k datapoints, 11 dimensions (part of the flow18 dataset from Belkina et al), using 30 nearest neighbors,
optsne is computed in 960 sec, similarly to Python.
Rphenograph is computed in 210 sec, similarly to PARC (195 sec) whereas original Phenograph implementations are 440 sec in R and 1096 sec in Python. The original Python Phenograph carries out 32 Louvain iterations, but this information is not available in the igraph implementation of Louvain, as far as I know. PARC is carrying 5 iterations of the Leiden algorithm by default. When Python Phenograph is set up to carry out 5 iterations of Leiden, it takes nearly the same amount of time as PARC. I let you the pleasure to check which result you prefer.
To be noticed, Jan Stuchly did a Rphenograph implementation with Annoy library, the whole process being fully parallelized, which is not my case.
https://github.com/stuchly/Rphenoannoy
I thank Etienne, Tom, Josef Spidlen, Chris Ciccolella, James Melville and Jan Stuchly for exchanges, and I thank my employer, Inserm.
Rphenograph new features
* S. Thomas Kelly added pruning and graphs clustering methods. This was turned into calling igraph functions for simplifying the graph.
* Etienne K. Becht added approximate HNSW nearest neighbors for speed. The RcppHNSW package is as fast as Python on 1 core, and multi-core is currently developed.
* Etienne K. Becht noticed that some points are not reported because they don't share any neighbors with their neighbors. This is integrated in C code.
* Louvain is the default graph clustering method. Any clustering functions of the (r)igraph package can be specified. Leiden is not yet available.
* S. Granjeaud improves the the Jaccard_coefficient function by pre-sorting nearest indices in the C code. Now the computation takes only a few seconds for a dataset of 300 k datapoints and 30 nearest neighbors.
* The original implementation of the Jaccard coefficient removes the two cells when looking at the intersection of their neighbors. You can decide to keep them.
* A parameter permits to report only some of the k NN instead of all of them. If k is set to 30, the Jaccard is still computed on 30 kNN, but only the 10 NN could be reported. This lowers the number of edges in the graph speeding up the clustering and allowing a finer clustering.
Windows installations
To ease windows installation and not avoiding the installation of Rtools, I released windows Binary for 64 bits installation of R (3.6.x, 4.0.x)
install.packages("https://github.com/i-cyto/Rphenograph/releases/download/Rphenograph_0.99.1.9003/Rphenograph_0.99.1.9003.zip", repos = NULL, type = "win.binary")
install.packages("https://github.com/SamGG/Rtsne/releases/download/v0.15.0.9001/Rtsne_0.15.0.9001.zip", repos = NULL, type = "win.binary")
Linux/Mac installations
The following commands should work. If not, open an issue on github. I don't have access to such machines, so help will be limited.
devtools::install_github("i-cyto/Rphenograph")
devtools::install_github("SamGG/Rtsne")