Data Cluster: Definition, Example, & Cluster Analysis

Clusters are everywhere. In school, students are placed in different grades and classes. In business, employees belong to different departments. How do we decide who goes where? Shared characteristics such as age, subject matter, or skill, tell us who belongs together.

In the same way, data analysts cluster data based on similarities among data points. Though data clustering is more complex than “clustering” students or employees, the goal is the same. Data clusters show which data points are closely related so we can structure, analyze, and understand the dataset better.

But what exactly are data clusters? And how do we create them? This article defines data clusters, provides examples, and explains how we make them.

Note: data clusters are a slightly advanced topic. To cover the basics, I recommend you read our free Intro to Data Analysis eBook if this article is challenging for you.

Data Cluster Definition

Written formally, a data cluster is a subpopulation of a larger dataset in which each data point is closer to the cluster center than to other cluster centers in the dataset — a closeness determined by iteratively minimizing squared distances in a process called cluster analysis.

In addition to the above definition, it’s imperative to keep in mind the following truths about data clusters:

Data “Clusters” in SQL Databases

It’s important to note that the term “cluster” can also refer to data that are stored close together in a dataset. For example, SQL analysts may refer to a row in a dataset as a data cluster because it groups related data points.

Database engineers often group multiple datasets together for ease of access, and they refer to these as data clusters as well. If you’ve ever seen a data model, you can get a good idea of why data engineers call these clusters. Here’s an example data model:

Data Table A could be considered a cluster. Moreover, each of the data models and the database as a whole could also be considered “clusters.”

In most cases, a “cluster” refers to data points whose values are close together, but you should always keep in mind that professionals in various fields apply the word to their jobs in special ways — such as a database analyst who refers to a row as a cluster.

Data Clusters Example

Imagine you have a pig farm with 15 pigs. You want to cluster them together based on age and weight. This means you want to minimize the distance between each pig to a cluster center. Let’s imagine you have a graph of the pigs’ age and weight that looks like this:

You can already see how some of the points fit closely together. Visually, we can group some of them together, thereby creating data clusters. Here’s the same data with circles around the clusters.

It’s as simple as that. Each circle is an example of a data cluster. Keep in mind, however, that this is a relatively easy example for the following reasons:

  1. For starters, we only have two dimensions. More complex analysis may have a larger number of dimensions.
  2. In addition, the number of observations (pigs in our case) is small, making it easy to conceptualize the results. If we had 10,000 pigs, it’s harder to visually determine data clusters.
  3. Finally, we’re starting and stopping with 4 data clusters. In a more through analysis, we would need to evaluate the optimal number of clusters. The ultimate goal is to minimize distance between points and cluster centers, and there’s always an optimal number of data clusters to do so.

So data clusters are pretty easy, right? We’ve created them visually, and it’s clear. However, we can’t prove this, and if our example was harder, we wouldn’t be able to do it visually. So let’s talk about formal cluster analysis now.

Cluster Analysis: How to Create Data Clusters

To really understand data clusters, we need to know how they’re created: through cluster analysis. Cluster analysis is the process of creating data clusters by minimizing the distance between data points and a reference.

There are several types of cluster analysis:

Each type of analysis has it’s advantages and disadvantages, but in industry the most common and most useful one is k-means clustering. Let’s look at the data clusters in our pig example to understand better.

Most Popular Clustering Analysis: K-Means Clustering Example

K-means clustering uses a presupposed number of clusters, then minimizes the distance of each data point in the whole set to that number of centers. The key concept to understand in k-means clustering is that only the number of cluster centers is predetermined. It’s only when a computer algorithm starts to minimize distances that we find out where those centers are located.

Disclaimer: the below analysis only starts to make sense when you get to the last step, so if it’s not clear until you’ve gone through it entirely.

Let’s look again at our example of the pigs to understand. The going-in number of clusters we’ll look for is 4. Let’s layout the details here: