t-SNE: data visualisation using t-distributed Stochastic Neighbour Embeddings

Maria Boryslawska
3 min readAug 25, 2021

Overview

T-distributed Stochastic Neighbour embedding is a probabilistic algorithm used for dimensionality reduction. The algorithm computes a measure of the similarity between pairs of vectors. It then tries to optimize these two measures of similarity. Put simply, it is the t-SNE that gives you an intuition of how the data is distributed in space.

The method was invented in 2008 by Laurens van der Maaten and Geoffrey Hinton. You may ask, “Why should I care? I already know the PCA! ”And that would be a great question.

t-SNE, unlike PCA, maintains distances between pairs, reflecting non-linearity and is able to interpret complex polynomial relationships between features. It allows us to separate data that cannot be separated by any straight line, for example:

Linearly inseparable data, Source: https://distill.pub/2016/misread-tsne/ CC-BY 2.0

On the other hand, t-SNE is computationally expensive. For larger samples (> 100,000) and a large number of dimensions (> 100), the calculation of the t-SNE may take up to several hours, where the PCA will be completed in seconds or minutes.

It is also worth mentioning that t-SNE can be used to select the number of clusters in the case of the clustering issue. However, bear in mind that t-SNE is not a grouping approach and is only for exploration.

Example in Python

For this example, a dataset from Sklean library was used. It’s a very easy binary classification set with 2 classes and 569 samples. The data were classified using an ID number and differentiate between malignant and benign breast cancer diagnoses.

The sklearn class TSNE has a couple of adjustable hyperparameters. The most important one, tested here, are going to be the perplexity and the number of components parameters. n_components is a parameter for the dimension of the lower space to which you want to convert your dataset to. Perplexity is a parameter for the number of nearest neighbors based on which t-SNE will determine the potential neighbours.

In this example, the random state was set to 0. It is the internal random number generator which decides on indices splitting in the dataset.

It is generally advisable to choose a perplexity value between 5 and 50. It depends on the dataset and should be revised on a case by case basis. Here, the four values of perplexity were chosen to be: 5, 10, 30 and 50.

It can be seen that with smaller perplexities, local variations dominate. Whereas with larger perplexity values, global variations dominate. Make sure you take perplexity into account when plotting your t-SNE graphs!

Now we’re going to run the graphs with the same parameters, but 4 times.

We can see that not setting any value for the random seed generates 4 different datasets.

The random initialisation and the t-SNE’s cost function cause it to have different local minima and different results in every run. This highlights the complexity of the algorithm.

--

--

Maria Boryslawska

Data Scientist and NLP Researcher. Experienced in conducting transdisciplinary research accross the area of Machine Learning and building various models.