We recently used t-SNE and UMAP in the field of radiation damage, for visualising the damage shapes (point defect clusters of different shapes). The results were interesting in many ways.
We found that the general layout of classes of shapes is more or less same in UMAP or t-SNE, so the global & local relationship argument didn't really work for our data. Since, the data was around 1000 points of around 50 dimensional histograms, efficiency was also not a big distinction. For us, real advantage of UMAP came out to be its amenability to work with HDBSCAN and embedding new test data. We are excited to use it on further categorising cascade and sub-cascade shapes (bigger damage areas).
- https://haptork.github.io/csaransh/ : Go to last pane "Cluster Classes", click on a point to see the shape on right. Select between t-SNE or UMAP on the left pane. (might take time to load)
UMAP is awesome, and we usually recommend it to people who are struggling with tSNE performance issues. If you have access to a GPU, it's worth checking out cuML's UMAP implementation (https://rapidsai.github.io/projects/cuml/en/0.10.0/api.html#...), which is closely based on McInnes' original python code but is much much faster.
How do you people who work with high-dimensional data outside of biology feel about t-SNE and/or UMAP?
Some of the points against t-SNE feel like comments that only non-computer scientists would make (e.g., t-SNE must be run on a cluster/needs a lot of RAM - despite the fact that rather few genetic datasets can be analyzed on the commodity laptops most common among biologists).
We've been doing it increasingly w/ our users at Graphistry. Think security events, fraud accounts, etc. The reason we do graph to beginwith is to help explore relationships & correlations in data in an explainable & interactive way, and GPU-accelerated UMAP over interactive GPU visual analytics helps us bring a bit of that to metadata that doesn't work as well as a graph (continuous dimensions vs. categorical). Loading in 1M flagged accounts or events and visually drilling in is pretty magical :)
Explainability is still worse than user experiences with our graph and temporal methods, and both t-sne/umap struggle on categorical (the most important aspects of most operational data). The good news is neither problem seems inherent. UMAP is definitely on the shortlist of methods we've been tracking here.
I'm hoping to demo some externally accessible IT examples & workflows here by end-of-month. This stuff is very cool, so we're aiming to make tools here easier / more reliable / more practical for operational teams.
Very interesting read. I interpret this in the way that clustering (eg HDBSCAN) on UMAP-projected data makes some sense at least (contrary to tSNE), are there any differing opinions on this? Interesting related discussions: https://stats.stackexchange.com/questions/263539/clustering-...
Basically, the answer is yes you can do this, but verify and analyze the output to ensure it makes sense (e.g. coloring points by known features/labels). For example, if you have a small number of points in the dataset (<1000), UMAP tends to display a dense cluster that is quite separated from the remaining data. However, this apparent cluster is spurious and contains noisy data points that UMAP couldn’t “figure out what to do with” (they are similar in their dissimilarity to the other data).
We recently used t-SNE and UMAP in the field of radiation damage, for visualising the damage shapes (point defect clusters of different shapes). The results were interesting in many ways.
We found that the general layout of classes of shapes is more or less same in UMAP or t-SNE, so the global & local relationship argument didn't really work for our data. Since, the data was around 1000 points of around 50 dimensional histograms, efficiency was also not a big distinction. For us, real advantage of UMAP came out to be its amenability to work with HDBSCAN and embedding new test data. We are excited to use it on further categorising cascade and sub-cascade shapes (bigger damage areas).
To check results see the following:
- https://haptork.github.io/csaransh/presentation/index.html#/... : select the t-SNE or UMAP on left pane.
- https://haptork.github.io/csaransh/ : Go to last pane "Cluster Classes", click on a point to see the shape on right. Select between t-SNE or UMAP on the left pane. (might take time to load)
- https://arxiv.org/abs/1811.10923 : arXiv paper
- https://github.com/haptork/csaransh : GitHub repo
Ideas and suggestions are welcome.