Clustering Tales

A simple introduction to clustering

Athina B
6 min readOct 16, 2019
Don Morris Art, UnOrganized Chaos

“Data, information and knowledge: have we got it right?”
Max Boisot, Agustí Canals

Clustering relates data to knowledge and is a basic human activity. Human brains are naturally good at identifying groups and patterns. It affects knowledge representation and discovery. It is pervasive.

Cluster Analysis or Data Clustering or simply Clustering is the task of organizing data into sensible groupings (called clusters) in such a way that objects in the same group are more similar (in some sense) — according to measured or perceived intrinsic characteristics - to each other than to those in other groups. The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, is K-means that was first published in 1954 in an article dealing with anthropological data.

In machine learning Clustering is an unsupervised machine learning approach. The difficulty with unsupervised clustering is that there are a huge number of possibilities regarding what will be done with it and no abstraction akin to a loss function which distills the end-user intent. Depending on the use to which a clustering is to be put, the same clustering can either be helpful or useless. It is often presumed that for any situation where clustering may be used there is a single “right” clustering. Others take this further and maintain that the right answer is determinable by the data (alone, without reference to intended use): “the data should vote for their preferred model type and model complexity”. It seems to be based on the notion that categories exist independently of human experience and intent. Well, these absolute approaches have been convincingly discredited.

It’s worth mentioning that clustering algorithms cannot be evaluated in a problem independent way: whether a clustering of a specific data set is good or bad (useful or not) cannot be evaluated without taking into account what I want to do with the clustering once I have it. And how we define similarity? Most of the times the ideal grouping is subjective, that is in the eye of the beholder and whose significance and interpretation requires domain knowledge. Several techniques exist that we can evaluate the quality of clusters.

Clustering algorithms have been widely used in many applications such as Business Intelligence, Image Pattern Recognition, Web Search, Biology, and Security. As a data mining function, clustering can be used as a standalone tool to gain insight into the distribution of data, to observe characteristics of each cluster or alternatively it may serve as a preprocessing step for other algorithms, which would operate on the detected clusters.

Clustering algorithms can be categorized based on their cluster model:

  1. Partitioning methods
    They find all the clusters simultaneously by partitioning the data.
  2. Hierarchical methods
    They recursively find nested partitions (clusters) either in agglomerative mode (starting with each data point in its own cluster and then merge these atomic clusters into larger and larger clusters) or in divisive mode (starting with all the data points in one cluster and recursively dividing each cluster into smaller clusters).
  3. Density-based methods
    They search the data space for areas of varied density of data points in the data space. They identify clusters as dense regions in the data space, separated by sparse regions.
  4. Grid-based methods
    They quantize the data space into a finite number of cells that form a grid structure and then merge the cells to build clusters.
  5. Distribution methods
    They are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian).

For a good clustering algorithm, it is supposed to have the following features:

  1. The ability to detect clusters with various shapes and different distributions.
  2. The capability of finding clusters with considerably different sizes
  3. The ability to work when outliers are present
  4. No or few parameters needed as input and
  5. Scalability to both the size and the dimensionality of data.

Also the following fundamental challenges associated with clustering need to be determined:

  1. What features should be used?
    Data representation is one of the most important factors that influence the performance of a clustering algorithm. If the representation — choice of features — is good, the clusters are likely to be compact and isolated. Unfortunately, there is no universally good representation. The representation must go hand in hand with the end goal of the user.
  2. Should the data be normalized?
    Normalization gives the same importance to all the variables. Normalization before clustering is specifically needed for distance metric, like the Euclidian distance that are sensitive to variations within the magnitude or scales from the attributes.
  3. Does the data contain any outliers?
    Clustering and outlier detection are strongly coupled tasks in data mining area. The outliers are defined by the concept of cluster, which are recognized as the points belonging to none of the clusters. For example, K-means algorithm updates the cluster centers by taking the average (average as a statistical measure is generally sensitive to outliers) of all the data points that are closer to each cluster center. When you have outliers, this can affect the average calculation of the whole cluster. As a result, this will push your cluster center closer to the outlier. Depending on your application it may be worth using a different approach than the K-means algorithm.
  4. How many clusters are present in the data?
    Automatically determining the number of clusters has been one of the most difficult problems in data clustering. Usually, clustering algorithms are run with different values of K; the best value of K is then chosen
    based on a predefined criterion.
  5. Which clustering method should be used?
    Different clustering algorithms often result in entirely different partitions even on the same data. There is no best clustering algorithm. Each clustering algorithm imposes a structure on the data either explicitly or
    implicitly. When there is a good match between the model and
    the data, good partitions are obtained. Since the structure of the
    data is not known a priori, one needs to try competing and diverse
    approaches to determine an appropriate algorithm for the clustering
    task at hand.
  6. Does the data have any clustering tendency?
    Clustering algorithms tend to find clusters in the data irrespective of whether or not any clusters are present.
  7. Are the discovered clusters and partition valid?
    Evaluating the performance of a clustering algorithm is not trivial. Any evaluation metric should take into account if this clustering define separations of the data similar to some ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar that members of different classes according to some similarity metric.

It is often advisable to be careful when confronted with strongly correlated variables, therefore, some data analysts may decide to simply remove one of the variables. Sometimes this is not the best option, as important information may thereby be discarded. Instead, it is suggested that Principal Components Analysis (PCA) be applied, where the common variability in correlated predictors may be translated into a set of uncorrelated principal components.

Is clustering art, or science, or both?

The tendency to classify and categorize objects is a deeply ingrained aspect of human nature. It’s a survival skill that allows our brains to make sense of an endless flow of data. What is the least amount of detail that can be added to create the necessary impact? The least possible.

When we observe a person or an object, we latch on to a few overt details and then use that information to assign that person or thing to a place in our mental library of categories.For example we see a man with a swimsuit in a swimming pool and we assign him to the category of “athlete”. If we don’t have the time or opportunity to interact more, this generality is all we will know.

The whole is other than the sum of the parts.

- Kurt Koffka

This categorization process seems effortless to us, but it is actually a highly sophisticated skill.

Why categorize?

Categorization helps us to organize our world and our thoughts, to remember things, to navigate our world.

--

--