Differences

This shows you the differences between two versions of the page.

--- forecasting:classification [2020/12/01 21:41]
kmacloves
+++ forecasting:classification [2021/09/19 21:59] (current)
@@ Line 67: / Line 67: @@
 **Fig. 10.** Plot showing some lines that would separate the two groups.
-Although there an infinite number of ways to separate the groups and create a margin, how do we create the //optimum margin//? The answer is through the SVM. The steps for SVM goes as follows:
+Although there are an infinite number of ways to separate the groups and create a margin, how do we create the //optimum margin//? The answer is through SVM. The steps for SVM goes as follows:
   - Group the categories and form a convex hull around them (Shown in Fig. 11)
   - Find the shortest line segment that connects the two hulls
@@ Line 80: / Line 80: @@
 **Fig. 12.** Drawing the optimal hyperplane for SVM.
-The implementation for SVMs in python is as follows (Does not include importing the dataset, splitting the dataset, and feature scaling):
+The implementation for SVM in python is as follows (Does not include importing the dataset, splitting the dataset, and feature scaling):
 {{ :forecasting:svmcode.jpg?600 |}}
@@ Line 90: / Line 90: @@
 **Fig. 14.** Test set results for classification using Support Vector Machines.
+====== Kernel SVM ======
+==== Mapping to a Higher Dimension ====
+What happens when sets of data aren't linearly separable? Consider the data in fig. 15.
+{{ :forecasting:kernelsvm1.jpg?600 |}}
+**Fig. 15.** Single dimension dataset with two categories.
+Obviously we can't separate the data with a single line so what do we do? What we can do is map the data onto a higher dimension. Since we're in a single dimension, we have to map the data to two dimensions that will allow us to separate the data. If we mapped the data onto a linear line the data still won't be linearly separable so let's try mapping the data onto a parabola:
+{{ :forecasting:kernelsvm2.jpg?600 |}}
+**Fig. 16.** Mapping the data onto a parabola.
+Now we can see that we can separate the data with a line:
+{{ :forecasting:kernelsvm3.jpg?600 |}}
+**Fig. 17.** Separating the data with a line.
+At this point now we map the data and our optimal margin back into the original space. What does this look like in a higher dimension? Consider the data in Fig. 18.
+{{ :forecasting:kernelsvm4.jpg?600 |}}
+**Fig. 18.** 2-dimensional dataset with two categories.
+Projecting the data into a higher dimension brings us into a third dimension. The data can't be separated by a line in the third dimension but rather a hyperplane. Mapping the data onto a higher space and separating them with the plane is shown in fig. 19.
+{{ :forecasting:kernelsvm5.jpg?600 |}}
+**Fig. 19** Mapping the data into the third dimension and separating the data with a hyperplane.
+Finally after separating the data with the optimal margin hyperplane, we map it back into the second dimension and get the following result:
+{{ :forecasting:kernelsvm6.jpg?600 |}}
+**Fig. 20** Mapping the data and hyperplane back onto the second dimension
+=== The Problem With Mapping to a Higher Dimension ===
+Mapping to a higher dimension is extremely computationally intensive. Typically data is in a 2D space so most of the time the machine will have to map data into the third dimension. Not only is the computer mapping onto a third dimension but it needs to calculate the optimal margin in 3D space for a hyperplane. With large datasets and complex problems this can take a lot of processing power. So a solution to this is called //The Kernel Trick//
+==== The Kernel Trick ====
+The Kernel Trick essentially achieves the same goal that mapping to a higher dimension does. There are many kernels to choose from but let's consider the Gaussian RBF Kernel. The equation along with the plot of the Kernel, visualized in 3D, as a function of **x** and **l** are shown.
+{{ :forecasting:kernelsvm7.jpg?600 |}}
+**Fig. 21.** Gaussian RBF Kernel with K graphed against x, and y
+**x** is the location of some data point and **l** is the //landmark//, or center, of the kernel. Note that this Kernel would work with 1-dimensional data as well; K would still be a function of **x** and **l** but the visualization would have the data on a 1D line while the visualization of the Kernel is K vs x.
+What's happening with the equation is that we're calculating the distance between the landmark and points in our dataset and then adjusting sigma to fit an optimal margin. Sigma controls the base of the RBF function and the base gets mapped onto the dataset which forms the margin that separates the data. Let's consider dataset in Fig. 18 again. To do the kernel trick first we have to set the position of the landmark and then applying the kernel. A visual of this is shown in Fig. 22; We can see that the base of the kernel is mapped onto the 2D space and this is the separation between the categories.
+{{ :forecasting:kernelsvm8.jpg?600 |}}
+**Fig. 22.** Applying the Gaussian RBF Kernel onto the data.
+As mentioned before, sigma controls the size of the base, which in turn controls how large our margin is. Just like in SVM we adjust the margin, which is sigma, to be the maximum margin, so the maximum distance between the two categories. For some intuition if we increased sigma, we'd have a larger base and thus a larger circle in this example. If we decreased sigma, we'd have a smaller base and thus a smaller circle. Larger and smaller sigma are shown in Fig. 23 and Fig. 24, respectively.
+{{ :forecasting:kernelsvm9.jpg?600 |}}
+**Fig. 23.** Graph that provides intuition for a larger sigma.
+{{ :forecasting:kernelsvm10.jpg?600 |}}
+**Fig. 24.** Graph that provides intuition for a smaller sigma.
+After the machine has trained the model on the training set and found the optimal margin, we have effectively separated the categories. Therefore, when predicting or categorizing new data we simply apply the kernel to the data point and calculate for K. If K is outside of the margin then we can set it to 0 and if it is in the margin then we can set it to K > 0.
+=== Computation Restrictions ===
+Recall that the problem with mapping to a higher dimension was computationally intensive because we needed to map each 2D data point into 3D space and then find the optimal hyperplane in the 3D space and finally map everything back to the 2D space. With the kernel trick, none of the data is mapped onto a higher dimension. Recall the Gaussian RBF Kernel shown in Fig. 21; the kernel is shown in 3D to visualize the energy of the kernel within the margin. Everything that is outside of margin is considered to have K = 0, which means that every point that results in k = 0 belongs to 1 category. With the other category we are simply finding to see if K > 0 and if they are then the point belongs in the other category. The kernel is updated as the margin is adjusted to match the training data. There is no mapping into a third dimension or finding a 3D hyperplane.
+The Python implementation for Kernel SVM is as follows (Does not include importing the dataset, splitting the dataset, and feature scaling):
+{{ :forecasting:kernelsvmcode.jpg?600 |}}
+**Fig. 25.** Kernel SVM Python implementation.
+This method yielded 93% accuracy and a visualization of the results are shown:
+{{ :forecasting:kernelsvmresults.png?600 |}}
+**Fig. 26.** Kernel SVM Test Set Results.