Clusters are everywhere. In school, students are placed in different grades and classes. In business, employees belong to different departments. How do we decide who goes where? Shared characteristics such as age, subject matter, or skill, tell us who belongs together.
In the same way, data analysts cluster data based on similarities among data points. Though data clustering is more complex than “clustering” students or employees, the goal is the same. Data clusters show which data points are closely related so we can structure, analyze, and understand the dataset better.
But what exactly are data clusters? And how do we create them? This article defines data clusters, provides examples, and explains how we make them.
Note: data clusters are a slightly advanced topic. To cover the basics, I recommend you read our free Intro to Data Analysis eBook if this article is challenging for you.
Written formally, a data cluster is a subpopulation of a larger dataset in which each data point is closer to the cluster center than to other cluster centers in the dataset — a closeness determined by iteratively minimizing squared distances in a process called cluster analysis.
In addition to the above definition, it’s imperative to keep in mind the following truths about data clusters:
It’s important to note that the term “cluster” can also refer to data that are stored close together in a dataset. For example, SQL analysts may refer to a row in a dataset as a data cluster because it groups related data points.
Database engineers often group multiple datasets together for ease of access, and they refer to these as data clusters as well. If you’ve ever seen a data model, you can get a good idea of why data engineers call these clusters. Here’s an example data model:
Data Table A could be considered a cluster. Moreover, each of the data models and the database as a whole could also be considered “clusters.”
In most cases, a “cluster” refers to data points whose values are close together, but you should always keep in mind that professionals in various fields apply the word to their jobs in special ways — such as a database analyst who refers to a row as a cluster.
Imagine you have a pig farm with 15 pigs. You want to cluster them together based on age and weight. This means you want to minimize the distance between each pig to a cluster center. Let’s imagine you have a graph of the pigs’ age and weight that looks like this:
You can already see how some of the points fit closely together. Visually, we can group some of them together, thereby creating data clusters. Here’s the same data with circles around the clusters.
It’s as simple as that. Each circle is an example of a data cluster. Keep in mind, however, that this is a relatively easy example for the following reasons:
So data clusters are pretty easy, right? We’ve created them visually, and it’s clear. However, we can’t prove this, and if our example was harder, we wouldn’t be able to do it visually. So let’s talk about formal cluster analysis now.
To really understand data clusters, we need to know how they’re created: through cluster analysis. Cluster analysis is the process of creating data clusters by minimizing the distance between data points and a reference.
There are several types of cluster analysis:
Each type of analysis has it’s advantages and disadvantages, but in industry the most common and most useful one is k-means clustering. Let’s look at the data clusters in our pig example to understand better.
K-means clustering uses a presupposed number of clusters, then minimizes the distance of each data point in the whole set to that number of centers. The key concept to understand in k-means clustering is that only the number of cluster centers is predetermined. It’s only when a computer algorithm starts to minimize distances that we find out where those centers are located.
Disclaimer: the below analysis only starts to make sense when you get to the last step, so if it’s not clear until you’ve gone through it entirely.
Let’s look again at our example of the pigs to understand. The going-in number of clusters we’ll look for is 4. Let’s layout the details here:
Once you have set up this layout, you need to add a cell that looks for the minimum distance that each point has with regards to each cluster center. We can do this using Excel’s MIN() function. In addition, we want another cell to easily identify which cluster the points belong to once Excel runs its calculation. We can do this using Excel’s MATCH() function. Here’s what these formulas look like:
Now that you’ve set this up, we’re ready to let Excel minimize the distances. To do so, we need to use an Excel add-on called Solver. You can install it using this guide. Solver works by optimizing a target cell by changing a single or range of cells, given a set of constraints.
In the below example, we’ll set the target cell as the sum of the minimum distance cells in line 9. To do so, we’ll tell Excel to modify the orange cluster cells. Let’s assume our only constraint is that the values Solver produced must be less than our largest know variable, which is cell P3: the weight of the pig named Kim.
In addition, you have to tell solver if the optimization will be linear or non-linear. Since euclidian distance is non-linear, we need to use the Evolutionary setting you see in the picture below.
Now we have our data clusters. Solver calculated the optimal cluster centers by minimizing the distance between all of the data points and these 4 clusters. Let’s check to see how it worked graphically by creating a new scatter plot:
But wait, there’s a problem here. Two of our 4 clusters are placed as outliers, which is obviously not correct. What’s happened? Excel could not minimize around all of the clusters because the original data points are too close together.
Remember: we decide somewhat arbitrarily how many clusters we want to use at the start of k-means analysis. in this case, it seems like we chose too many. Let’s try the analysis again using only 3 clusters and see if that helps:
We still have an outlying cluster center. We could remove another, but to me, three cluster centers seems reasonable… we’re missing something else here. The range of possible cluster center values is too wide. We’re letting solver choose values for the cluster centers that are outside our range of ages. We need to scope down the value constraints for the age variables to those between the minimum and maximum ages. Likewise, let’s scope down the weight variables to the minimum and maximum values. It looks like this:
Let’s check the output in a new graph. This looks much better:
Now our data clusters look more reasonable. Not surprisingly, they are different from the clusters we created visually at the beginning of the article. To our eyes, the whole core was best as one cluster, but Excel’s solver has determined a better way to organize the clusters, and has better minimized the distances in doing so.
It’s easy to incorrectly group observations based on visuals and intuition. Some even argue that data clusters are defined as the result of this statistical approach. This would mean that our visual clusters at the beginning of the article is were not data clusters at all — just circles on a graph. To get trustworthy data clusters, we need to perform a statistical analysis.
K-Means is the most popular type of clustering because it is the most intuitive. However, it’s far from the only technique. Here are four other popular ones:
Mean-Shift Clustering works very much like k-means, but instead of creating values that serve as cluster centers, mean-shifting uses existing data points to serve as cluster centroids.
In density-based spatial clustering, each data point is analyzed as a potential cluster center. A distance allowance of Epsilon determines the number of other points labelled within the cluster. If there are enough points, the point becomes a cluster center. The process is repeated for all points until an optimum center is determined. This process then repeats for the whole dataset.
Expectation-Maximization is similar to K-means clustering, except that it adds standard deviation to the calculation on top of averaging. This allows the clusters to take on more dynamic forms instead of following a circular structure.
Agglomerative Hierarchical Clustering performs normal clustering using one of the above techniques, then combines determined clusters until the whole data set becomes one “big” cluster. This approach allows the analyst to choose the number of clusters he/she wants based on a the hierarchy of combination – a welcome flexibility for analysis.
In our clustering exercise, we only examined two dimensions: weight and age. However, we could have looked at 5, 10, 20, or more dimensions. This is difficult to understand.
If we introduce more than 3 dimensions, data clusters are no longer a graphical, visual exercise. Instead, they’re abstract by nature. It’s nearly impossible for a human to “visualize” or “imagine” a dynamic in which there are more than the x, y, and z planes, but they exist nevertheless.
A good way to think about 4th and 5th dimensions is to imagine space and color as dimensions on a graph. You have points on the x, y, and z axes. Then imagine that those axes move through space — this would be a 4th dimension. On top of that, imagine that each point is a shade or color — this would be a 5th dimension. Not easy, huh?
Good data clusters are able to provide valuable insights based on a maximum number of variables. The more variables there are at play, the more information we have feeding the analysis. This is why data clusters are only true subsets when they’re based on statistical analysis.
Noah is the founder & Editor-in-Chief at AnalystAnswers. He is a transatlantic professional and entrepreneur with 5+ years of corporate finance and data analytics experience, as well as 3+ years in consumer financial products and business software. He started AnalystAnswers to provide aspiring professionals with accessible explanations of otherwise dense finance and data concepts. Noah believes everyone can benefit from an analytical mindset in growing digital world. When he's not busy at work, Noah likes to explore new European cities, exercise, and spend time with friends and family.