## Clustering

**Clustering** is a statistical process used to group “similar” data points together within a dataset. This process is similar in nature to classification, where similar data points are grouped together and then assigned a specific label (eg; dogs or cats). However, clustering is different in the sense that the distinct groups are not given any labels. They are simply referred to as *Cluster 1, Cluster 2, Cluster 3, etc*. Note that in clustering, we do not care about any dependent variables because we just want to see which **inputs** are similar to one another.

Clustering methods discussed on this page:

- K-Means Clustering
- Hierarchical Clustering

### K-Means Clustering

**Form:** Universal. Can be used with any dataset

**When to use it:** Used when you want to define a specific number of clusters in the dataset (eg: maybe you specifically want to see only 5 different clusters). However, you can also use the **Elbow Method** to numerically determine an optimal number of clusters for your dataset.

**Library used:** sklearn.clustering.KMeans

**General workflow:**

- Import
**KMeans**class from**sklearn.clustering** - Create an instance of the
**KMeans()**class - Apply the
**.fit_predict**method to your independent variables *TALK ABOUT ELBOW METHOD HERE*

**Sample code and output:**

*Note that we can easily visualize our results by using a 2D plot. This type of plot is only possible when we have no more than 2 independent variable.*

### Hierarchical Clustering

**Form:** y = w0 + w1*x1

**When to use it:** Used when there is only one independent variable whose degree is assumed to be 1

**Library used:** sklearn.linear_model.LinearRegression

**General workflow:**

- Import
**LinearRegression**library from**sklearn.linear_model** - Create an instance of the
**LinearRegression()**class - Apply the
**.fit**method to your independent and dependent variables - Apply the
**.predict**method to your regressor to make any predictions about your data

**Sample code and output:**

*Note that we can easily visualize our results by using a 2D plot. This type of plot is only possible when we have no more than 1 independent variable.*

# Authors

**Contributing authors:**

Created by *jclaudio* on 2020/12/02 01:24.