forecasting:clustering

Clustering is a statistical process used to group “similar” data points together within a dataset. This process is similar in nature to classification, where similar data points are grouped together and then assigned a specific label (eg; dogs or cats). However, clustering is different in the sense that the distinct groups are not given any labels. They are simply referred to as Cluster 1, Cluster 2, Cluster 3, etc. Note that in clustering, we do not care about any dependent variables because we just want to see which inputs are similar to one another.

Clustering methods discussed on this page:

  • K-Means Clustering
  • Hierarchical Clustering

Form: Universal. Can be used with any dataset
When to use it: Used when you want to define a specific number of clusters in the dataset (eg: maybe you specifically want to see only 5 different clusters). However, you can also use the Elbow Method to numerically determine an optimal number of clusters for your dataset.
Library used: sklearn.clustering.KMeans
General workflow:

  1. Import KMeans class from sklearn.clustering
  2. Create an instance of the KMeans() class
  3. Apply the .fit_predict method to your independent variables
  4. TALK ABOUT ELBOW METHOD HERE

Sample code and output:

Note that we can easily visualize our results by using a 2D plot. This type of plot is only possible when we have no more than 2 independent variable.

Form: y = w0 + w1*x1
When to use it: Used when there is only one independent variable whose degree is assumed to be 1
Library used: sklearn.linear_model.LinearRegression
General workflow:

  1. Import LinearRegression library from sklearn.linear_model
  2. Create an instance of the LinearRegression() class
  3. Apply the .fit method to your independent and dependent variables
  4. Apply the .predict method to your regressor to make any predictions about your data

Sample code and output:

Note that we can easily visualize our results by using a 2D plot. This type of plot is only possible when we have no more than 1 independent variable.

Authors

Contributing authors:

jclaudio

Created by jclaudio on 2020/12/02 01:24.

  • forecasting/clustering.txt
  • Last modified: 2021/09/19 21:59
  • (external edit)