Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How UMAP Works (towardsdatascience.com)
68 points by anigbrowl on Oct 6, 2019 | hide | past | favorite | 14 comments


Nice article.

We recently used t-SNE and UMAP in the field of radiation damage, for visualising the damage shapes (point defect clusters of different shapes). The results were interesting in many ways.

We found that the general layout of classes of shapes is more or less same in UMAP or t-SNE, so the global & local relationship argument didn't really work for our data. Since, the data was around 1000 points of around 50 dimensional histograms, efficiency was also not a big distinction. For us, real advantage of UMAP came out to be its amenability to work with HDBSCAN and embedding new test data. We are excited to use it on further categorising cascade and sub-cascade shapes (bigger damage areas).

To check results see the following:

- https://haptork.github.io/csaransh/presentation/index.html#/... : select the t-SNE or UMAP on left pane.

- https://haptork.github.io/csaransh/ : Go to last pane "Cluster Classes", click on a point to see the shape on right. Select between t-SNE or UMAP on the left pane. (might take time to load)

- https://arxiv.org/abs/1811.10923 : arXiv paper

- https://github.com/haptork/csaransh : GitHub repo

Ideas and suggestions are welcome.


UMAP is awesome, and we usually recommend it to people who are struggling with tSNE performance issues. If you have access to a GPU, it's worth checking out cuML's UMAP implementation (https://rapidsai.github.io/projects/cuml/en/0.10.0/api.html#...), which is closely based on McInnes' original python code but is much much faster.


How do you people who work with high-dimensional data outside of biology feel about t-SNE and/or UMAP?

Some of the points against t-SNE feel like comments that only non-computer scientists would make (e.g., t-SNE must be run on a cluster/needs a lot of RAM - despite the fact that rather few genetic datasets can be analyzed on the commodity laptops most common among biologists).


We've been doing it increasingly w/ our users at Graphistry. Think security events, fraud accounts, etc. The reason we do graph to beginwith is to help explore relationships & correlations in data in an explainable & interactive way, and GPU-accelerated UMAP over interactive GPU visual analytics helps us bring a bit of that to metadata that doesn't work as well as a graph (continuous dimensions vs. categorical). Loading in 1M flagged accounts or events and visually drilling in is pretty magical :)

Explainability is still worse than user experiences with our graph and temporal methods, and both t-sne/umap struggle on categorical (the most important aspects of most operational data). The good news is neither problem seems inherent. UMAP is definitely on the shortlist of methods we've been tracking here.

I'm hoping to demo some externally accessible IT examples & workflows here by end-of-month. This stuff is very cool, so we're aiming to make tools here easier / more reliable / more practical for operational teams.


UMAP is basically one of the most innovative things I've ever seen. It's highly used in NLP with extremely good results


Example? Anything besides visualizing embedding spaces?


Any type of clustering


UMAP has a reputation for being less "fussy" with parameter choices.


Very interesting read. I interpret this in the way that clustering (eg HDBSCAN) on UMAP-projected data makes some sense at least (contrary to tSNE), are there any differing opinions on this? Interesting related discussions: https://stats.stackexchange.com/questions/263539/clustering-...


Here’s a pretty comprehensive answer on the topic from the original UMAP author: https://github.com/lmcinnes/umap/issues/25

Clustering the output of UMAP is also given a nice tutorial in the docs: https://umap-learn.readthedocs.io/en/latest/clustering.html

Basically, the answer is yes you can do this, but verify and analyze the output to ensure it makes sense (e.g. coloring points by known features/labels). For example, if you have a small number of points in the dataset (<1000), UMAP tends to display a dense cluster that is quite separated from the remaining data. However, this apparent cluster is spurious and contains noisy data points that UMAP couldn’t “figure out what to do with” (they are similar in their dissimilarity to the other data).


Mediumwalled - i can't read this without signing in to something. Does anyone happen to know if there are cookies i can clear to work around this?


Outline is another great option - https://outline.com/5MNPHn


Me too. I used Firefox Focus to read it. You could use private mode on your browser too.


You can use private browsing (or at least this has been my experience so far).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: