Unsupervised Learning Defined
In unsupervised learning the model works independently discovering patterns and information that were not previously defined. This learning technique works predominantly with unlabeled data (no defined relationship between inputs and outputs). Using this technique affords the opportunity to address more complex processing tasks vs. supervised learning.
Unsupervised Learning Techniques
An important unsupervised learning technique is clustering. Cluster algorithms find groups within the input data. Clustering allows the user to define the number of clusters that should be identified. The number of clusters determines the specificity of each cluster (e.g. the more clusters the more specific the data within a cluster). Unsupervised learning cluster types include:
- Exclusive – data belongs to only one cluster
- Agglomerative – every data is a cluster with joins to nearest clusters reducing the # of clusters
- Overlapping – data can belong to multiple clusters with an associated membership value
- Probabilistic – probability distribution used for cluster creation
Clustering techniques include:
- Hierarchical – each data is a cluster; related clusters are combined until there is only one cluster
- K-means – Iteration defines the specified number of clusters with cluster centroids being close to assigned cluster data and maximizing the distance between cluster centroids
- K-Nearest Neighbor – algorithm storing all cases and new instances are classified based on a similarity measure
Another unsupervised learning technique is association. In this technique, rules are used to establish associations among objects in large data bases. An application of this technique experienced every day is shopping groups based on eCommerce searches and purchases.
Customer Segmentation – understanding customer groups for building business strategies and marketing campaigns
Genetics – grouping DNA patterns to study evolutionary biology.
Predictive Maintenance – detecting defective mechanical parts
Dimensionality Reduction – problem simplification by reducing random variables resulting in better data visualization
Ecology – comparison of audio recording of regions for comparison of species population for biodiversity
Delivery Routes – optimize delivery efficiency by determining the optimal number of regional locations and efficient truck routes.
Crime Zones – crime data by specific location including area and category for defining crime concentration locations within a city.
System Alert Management – operations alert messages from IT system components prioritized based on mean time to repair, downstream impact and failure predictions.
- No prior data knowledge is needed
- Reduces human error
- Identifies relationships between data not obvious through normal inspection
- Excels when there is insufficient labelled data, unknown patterns or evolving learning patterns.
- Simplify human labelling by grouping similar data and differentiating from remaining data
- Less outcome specificity due to data relationships not being known or named in advance of model building
- Clusters or groupings may not match information areas of interest
- Little control of how clusters or groupings are formed.
- Patterns are identified but uncertainty on next steps to take
- Less appropriate in resolving a well-define problem