Differences

This shows you the differences between two versions of the page.

--- forecasting:data_preprocessing [2020/11/11 19:17]
kmacloves [Importing Libraries]
+++ forecasting:data_preprocessing [2021/09/19 21:59] (current)
@@ Line 19: / Line 19: @@
 Since the level corresponds to the position, we don't care about the first column. Let's say that the Salary is the dependent variable and Level is the independent variable that Salary depends on.
-==== Step 1: Import the Data ====
+Step 1: Import the Data
 Importing the data is done using the Pandas library:
 |dataset = pd.read_csv('nameOfDatasheet.csv')|
-==== Step 2: Select the values ====
+Step 2: Select the values
 //iloc// is a function in the pandas library that locates values from a csv file based on the indexes given in the arguments. Since we want the independent variable, x, to get all the rows and just the second column we can find the values like this:
 |x = dataset.iloc[:, 1:-1].values|
-similarly since we want all the rows for the dependent variable and just the last column we can find the values like this:
+Similarly since we want all the rows for the dependent variable and just the last column we can find the values like this:
 |y = dataset.iloc[:, -1].values|
@@ Line 59: / Line 59: @@
 ** One Hot Encoding **
-One hot encoding simple takes a set of variables/strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/string to one space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this:
+One hot encoding simply takes a set of variables/strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/string to one unique space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this:
-|[(France) (Spain) (Germany)].|
+|[(France) (Spain) (Germany)]|
 When we are calling one country we will be setting the corresponding space equal to 1 and all other spaces equal to 0:
-|France: [1 0 0] |
+|   France: | [1 0 0]  |
-|Spain: [0 1 0]  |
+|    Spain: | [0 1 0]  |
-|Germany: [0 0 1]|
+|  Germany: | [0 0 1]  |
 To use the OneHotEncoding function in python we can use it from the scikitlearn library:
@@ Line 102: / Line 102: @@
 In this case we can name the variables anything but this naming convention is simple and easy to understand. We can also change the test size, so if we wanted to train the model on 90% of the data we would set the test_size equal to 0.1. random_state simply controls the shuffling applied to the data before applying the split.
 ===== Feature Scaling =====
+Feature scaling is a task that we need to perform when we have a wide range of values in our data. For example, take the dataset with the different countries, we need to use feature scaling because the salaries column is significantly greater than that of the age column. To feature scale we will simple use the StandardScaler function from the sklearn library. What StandardScaler does is it normalizes the data in a category according to its mean so that all the values are between -1 and 1.
+|from sklearn.preprocessing import StandardScaler|
+|sc = StandardScaler()|
+|X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])|
+|X_test[:, 3:] = sc.transform(X_test[:, 3:])|