In today’s world of big data and the internet of things, it is common for a business to find itself sitting atop a mountain of data. Possessing it is one thing, but leveraging it for data driven decision making is a much different ball game. Gut-feelings and institutionalized heuristics have traditionally been used to guide development of protocol and decision making, but the world of artificial intelligence and big disparate data is changing that.
Everyone is trying to make sense of, and extract value from, their data. Those that are not will be left behind. This challenge (and opportunity) is not limited to certain industries. For instance, most companies are exploring how they can use data to make better marketing decisions, most retailers are using data to optimize their supply chains, and most manufacturers are using data for quality control of final products.
Almost all business problems (with surrounding data) can be broken down into two categories: supervised and unsupervised learning. Take for example facial recognition software. One method of recognizing faces is to train a program based off of a data set of pictures and associated tags. Tags may include “Face”, “Face, male”, or anything else. These tags allow algorithms to identify and learn what a face looks like and differentiate between male and female faces, or more granular subtleties if desired. This task can be reformulated as an unsupervised learning problem. The difference being tags in the supervised learning example are no longer present. Rather, the algorithm has to learn how to identify faces on its own. Technically speaking, the algorithm will not be able to identify faces as faces, but rather as sets of objects/images distinct from other objects/images. It is up to the user to tell the computer that those are faces it has identified. Google is an interesting example of how unsupervised learning was used to identify cats in YouTube videos (look here, or for a more technical treatment here).
Clustering (or segmentation) is a commonly encountered form of unsupervised learning in business. This involves grouping different data points (customers, products, movies, etc.) into clusters. Ideally each element contained in a cluster is similar to every other element in that cluster while being as different as possible from elements in other clusters. The goal of clustering is to minimize the difference between items in a cluster and maximize the difference between separate clusters.
Why is the ability to cluster well so important?
Clustering provides businesses the ability to achieve better results for initiatives and understand customers and processes at a much deeper level than a human can achieve alone. If you are a marketer, you may be interested in developing target marketing strategies. Before you do this, you must know who to market to. This can be accomplished through the grouping of customers based off of similar attributes of existing customers. This is a problem where clusters are determined by attributes used to define a customer: age, payment history, purchase history, etc.
Suppose you are a publishing firm and want to decide how to sell new books or determine how to reprice or market old books. Books can be grouped together using a clustering scheme based around the attributes of the books. These may include length, subject matter, reoccurring groups of words, etc. Clustering even pops up in insurance, city-planning, and identifying land usage. These can express themselves in identifying groups of insurance policy holders that have a higher than average claim cost, identifying groups of houses based around location, type and value, or identification of parcels of land around usage.
It is important that the original purpose of clustering is met in all of these examples: minimize differences between elements in a cluster while maximizing differences between clusters. Data complexity and algorithms in use today can make this a nontrivial problem. Basic algorithms often do not achieve desired results, so something more is needed. Below we will walk you through some common methods used for clustering and articulate the power that Soothsayer brings to the table.
Hierarchical clustering and k-means clustering are the two most basic and widely used methods of clustering.
Hierarchical clustering is based around organizing data points into a set of similar clusters, then recursively grouping clusters together until you are left with a single cluster. In essence, this algorithm assigns a hierarchy to data points. The benefit of this method is that it allows users to select the number of clusters they want and see the relationship between each cluster.
One major drawback of this method is the time it takes to run. Because the algorithm has to run through every data point and compare groups of data points to other groups of data points, the run time increases dramatically.