An Introduction to Unsupervised Learning.

Blessing Magabane
7 min readDec 31, 2020

Kmeans vs Hierarchical clustering.

Photo: Blessing Magabane

There are different types of machine learning, the is supervised learning, unsupervised learning and reinforcement learning. Supervised learning makes use of labels to make predictions, the input needs to be trained. However, with unsupervised learning labels are not needed the models makes use of centroids and geometric techniques to create clusters. Reinforcement learning on the other hand makes use of an agent and optimal objective function.

In this article the focus is on unsupervised learning, unlike supervised learning the is no need for labels or an objective function. The use of clusters and centroids is central to unsupervised learning. The process of grouping data points with similar characteristics is called clustering. There are different approaches to clustering but the main idea is the same which is segmentation of similar features of data points into distinct groups or clusters. Unsupervised learning is used for many purposes notable of those is dimensionality reduction and pattern recognition.

There are two commonly used clustering approaches, both differ in how they segment data but the objective is the same. The widely used one is kmeans, it makes use of equal variance which means data points with the same variance are grouped together in one cluster. On the other hand, the is agglomerative clustering which is a hierarchical technique. Other useful techniques that are worth mentioning are principal component analysis or simply PCA used to reduce the dimensionality of a dataset.

The basic idea behind the usage of unsupervised learning particularly clustering is to find complex relationships between data points. In this article clustering is going to be employed to identify patterns in shopping data.

Dataset:

In this article data from Instacart online is analysed. Clustering is used on the data to find patterns and other interesting features, the data can be found on the following link,

The Instacart data is in the following format, we have the order id which is a unique identifier associated with the order made. The user id another unique identifier related to the user. The eval set a field indicating the historical status of the order. Order number showing the number of orders a user made. Order hour of the day shows the time during the day the order was made. Days since prior order indicates how long an order took since the previous/prior one.

There are 3 214 874 rows of data under the Instacart dataset. However, in this article 5 000 rows will be sampled from the dataset randomly. A larger dataset affects the clustering process and causes an overlap in the data points on the respective centroids. The other reason to reduce the data size has to do with the computational cost associated with clustering.

Model:

In this section, the kmeans and agglomerative clustering are used on the Instacart data to cluster the data into different segments. The screen print below shows the definition of the kmeans algorithm.

The next screen print shows the agglomerative algorithm,

The screen print above shows the algorithm applied to the days since prior order and order number columns. Similarly, the algorithms were also applied to the order hour of day and order number columns as shown below,

Both kmeans and agglomerative clustering allow the user to define the number of centroids. The algorithms are clustering, the results they produce only make sense when visualised hence a scatter plot function was built as shown below by the screen print,

For data points where the value of the cluster is zero, one, two, three, four and five the colour code is blue, green, orange, gray, magenta and black respectively.

Results

In this section, the results from kmeans and agglomerative clustering are compared. To perform the analysis the clustering process is divided into two areas of interest which are days since prior order and order hour of day. The order number is kept constant in both cases. The idea is to find patterns in the number of orders made in the two instances.

Days since prior order

The distribution of data points on all the plots is the same the difference only arises on the clusters represented by the colour coding.

K = 3

Below is a plot of the kmeans with 3 centroids,

Figure 1: Kmeans with 3 centroids.

There are 3 distinct clusters on the kmeans, the dominate one being green with an associated value 1.

Below is a plot of the agglomerative clustering with 3 centroids,

Figure 2: Agglomerative clustering with 3 centroids.

On the agglomerative clustering with 3 clusters the dominate centroid is blue with an associated value 0.

K = 4

Below is a plot of the kmeans with 4 centroids,

Figure 3: kmeans with 4 centroids.

In this result it is not clear between green and gray which one is more dominate, the two seem to be equal in density. However, looking at the clusters they are both skewed to the left.

Below is a plot of the agglomerative clustering with 4 centroids,

Figure 4: agglomerative clustering with 4 centroids.

Similarly, on agglomerative clustering the clusters in green and gray seem to be equal though they seem to be somersaulted.

K = 5

Below is a plot of the kmeans with 5 centroids,

Figure 5: kmeans with 5 centroids.

The kmeans with 5 centroids, magenta is more dominate followed by the orange.

Below is a plot of the agglomerative clustering with 5 centroids,

Figure 6: agglomerative clustering with 5 centroids.

The blue is more dominate followed by green.

Order hour of day

The distribution of data points is the same on all the plots.

K = 3

Below is a plot of the kmeans with 3 centroids,

Figure 7: kmeans with 3 centroids.

In the above plot the green is clearly more dominate.

Below is a plot of the agglomerative clustering with 3 centroids,

Figure 8: agglomerative clustering with 3 centroids.

The blue is more prominent.

K = 4

Below is a plot of the kmeans with 4 centroids,

Figure 9: kmeans with 4 centroids.

On the plot above the gray has more data points in its centroid.

Below is a plot of the agglomerative clustering with 4 centroids,

Figure 10:agglomerative clustering with 4 centroids.

However, on agglomerative clustering the green is more prominent.

K = 5

Below is a plot of the kmeans with 5 centroids,

Figure 11: kmeans with 3 centroids.

Magenta is more prominent.

Below is a plot of the agglomerative clustering with 5 centroids,

Figure 12: agglomerative with 5 centroids.

Green is more prominent.

Due to the nature of the data set the clustering results on the plot seem to resemble a sedimentary rock layer. The results from both techniques are sandwiched together the centroid’s radius and circumference are not well defined. This is caused by the non-convex nature of data.

Observation:

It is clear that kmeans and agglomerative clustering yield different results, this is observed in the colour code dominance. This is due to their approach and algorithm architecture. Grocery orders that have prior purchase are made between 5 in the morning and 9 in the evening. However, for days since prior order the purchase is within the first 15 days a previous purchase was made. Depending on the clustering algorithm used the results differ and paint a unique picture.

The notebook to work above can be found on the following link,

--

--

Blessing Magabane

A full stack Data Scientist with experience in data engineering and business intelligence.