Clustering and how to begin

Important steps to consider

Athina B
3 min readMay 7, 2020

Can one talk about properties that are sufficient (or both necessary and sufficient) for natural algorithms to succeed?

Before diving into details on how to optimize the clustering, let’s go back and think the problem that we need to solve. “Clustering” has to do with the grouping of similar or related objects.

Cluster tendency

First of all, even before start thinking of clustering the problem is: Do the data have a tendency to form clusters? If the answer in no then stop, but if the answer is yes then we can move on to find meaningful clusters.

Let’s see an example. We used a dataset with geolocations of 300 places in the US. We would like to see, as a first step to our analysis if there is a tendency to form clusters, such that locations in the same cluster are geographically closer. Since our data are already in the low dimensional space (2-D) we will plot the points directly in a contour plot. What we observe is 3 distinct spherical probability distributions, so we already have an intuition that our dataset can form 3 clusters. In the case of a high dimensional space we can use use T-distributed Stochastic Neighbor Embedding (t-SNE) — a machine learning algorithm for visualization — to model each high-dimensional object by a two-dimensional point.

Dataset information
Kernel density plot of the 2 features

Algorithm selection and tuning

But now, before proceeding, in order to find clusters, can we answer these 2 very critical questions. What is meant by similar? and How many groups are there? And then we must be aware that there always exist data that do not fit in the determined groups, and are indeed important, so we must have an algorithm or a structure that will carry information about them. Maybe in that case we can start thinking of soft clustering.

In our example K-Means is probably a good solution as we deal with a low dimensional dataset, normal distributions and we are actually talking about points on the geographical coordinate system projected on the plane (the latitude and longitude). K-Means can only detect clusters that are linearly separable, convex clusters. In order to be able to detect non-convex clusters, we can use Kernel K-means. Actually the spectral clustering is a variant of Kernel K-means. As K-Means is an algorithm that relies on distance metrics and what represents a good measure of distance it is usually necessary to do the appropriate scaling of features. But in our example, we use the latitude and longitude to cluster cities based on distances east/west and distances north/south then, we may be happy just to use unscaled distances in either kilometres or miles (though we might want to adjust degrees of longitude and latitude for the curvature of the earth). In our case, we do not scale the features.

K-means algorithm with k=3

Now let’s visualize the clusters.

Clusters on the map

Cluster validity

Once we find the clusters, the last question is Do we like these clusters? Is this a useful solution for us or not? So, this is the cluster validity step. If the answer to the above questions is no then we can go back and look for clusters in a different way, we can try different features, different algorithms, different number of clusters and so forth.

Different features affect differently some are important for clusters while others may hinder the clustering task. An efficient way of handling it is by selecting a subset of important features. But how do I know which are the important ones? The traditional approach is to use Principal Component Analysis (PCA).

The quality of its results is dependent on the data distribution. The existence of noisy features degrades the performance of learning algorithms. For this reason, deep neural networks such as Autoencoders(AE) can be used for learning better representations of the data.

--

--