Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
forecasting:data_preprocessing [2020/11/11 19:13]
kmacloves [Splitting Datasets Into Training and Testing Data]
forecasting:data_preprocessing [2021/09/19 21:59] (current)
Line 8: Line 8:
 __**example**__ __**example**__
  
-import numpy as np+|import numpy as np|
  
 ===== Frequently Used Libraries ===== ===== Frequently Used Libraries =====
Line 19: Line 19:
 Since the level corresponds to the position, we don't care about the first column. Let's say that the Salary is the dependent variable and Level is the independent variable that Salary depends on. Since the level corresponds to the position, we don't care about the first column. Let's say that the Salary is the dependent variable and Level is the independent variable that Salary depends on.
  
-==== Step 1: Import the Data ====+Step 1: Import the Data
 Importing the data is done using the Pandas library: Importing the data is done using the Pandas library:
  
-dataset = pd.read_csv('​nameOfDatasheet.csv'​) +|dataset = pd.read_csv('​nameOfDatasheet.csv'​)|
-==== Step 2: Select the values ====+
  
 +Step 2: Select the values
 //iloc// is a function in the pandas library that locates values from a csv file based on the indexes given in the arguments. Since we want the independent variable, x, to get all the rows and just the second column we can find the values like this: //iloc// is a function in the pandas library that locates values from a csv file based on the indexes given in the arguments. Since we want the independent variable, x, to get all the rows and just the second column we can find the values like this:
  
-x = dataset.iloc[:,​ 1:​-1].values+|x = dataset.iloc[:,​ 1:​-1].values|
  
-similarly ​since we want all the rows for the dependent variable and just the last column we can find the values like this:+Similarly ​since we want all the rows for the dependent variable and just the last column we can find the values like this:
  
-y = dataset.iloc[:,​ -1].values+|y = dataset.iloc[:,​ -1].values|
  
 So within the "​[]"​ of the iloc function, the first argument corresponds to the values in the rows and the second argument corresponds to the values in the columns. So within the "​[]"​ of the iloc function, the first argument corresponds to the values in the rows and the second argument corresponds to the values in the columns.
Line 48: Line 48:
 **Here'​s how to implement the SimpleImputer:​** **Here'​s how to implement the SimpleImputer:​**
  
-from sklearn.impute import Simple Imputer +|from sklearn.impute import Simple Imputer ​                           | 
- +|imputer = SimpleImputer(missing_values = np.nan, strategy = '​mean'​) ​ |
-imputer = SimpleImputer(missing_values = np.nan, strategy = '​mean'​)+
 ===== Encoding Categorical Data ===== ===== Encoding Categorical Data =====
 When you have data that is not a numerical value and it's a category that we cannot ignore then we need to turn it into data that is numerical while also not in an ordered format (e.g. 0, 1, 2). Let's take the following dataset: When you have data that is not a numerical value and it's a category that we cannot ignore then we need to turn it into data that is numerical while also not in an ordered format (e.g. 0, 1, 2). Let's take the following dataset:
Line 60: Line 59:
 ** One Hot Encoding ** ** One Hot Encoding **
  
-One hot encoding ​simple ​takes a set of variables/​strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/​string to one space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this:+One hot encoding ​simply ​takes a set of variables/​strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/​string to one unique ​space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this:
  
-[(France) (Spain) (Germany)].+|[(France) (Spain) (Germany)]|
  
 When we are calling one country we will be setting the corresponding space equal to 1 and all other spaces equal to 0: When we are calling one country we will be setting the corresponding space equal to 1 and all other spaces equal to 0:
  
-France: [1 0 0] +|   France: ​[1 0 0]  | 
-Spain: [0 1 0] +|    ​Spain: ​[0 1 0]  | 
-Germany: [0 0 1]+|  ​Germany: ​[0 0 1]  |
  
 To use the OneHotEncoding function in python we can use it from the scikitlearn library: To use the OneHotEncoding function in python we can use it from the scikitlearn library:
  
-from sklearn.compose import ColumnTransformer +|from sklearn.compose import ColumnTransformer ​                                                       | 
-from sklearn.preprocessing import OneHotEncoder +|from sklearn.preprocessing import OneHotEncoder ​                                                     | 
-ct = ColumnTransformer(transformers=[('​encoder',​ OneHotEncoder(),​ [0])], remainder = '​passthrough'​) +|ct = ColumnTransformer(transformers=[('​encoder',​ OneHotEncoder(),​ [0])], remainder = '​passthrough'​) ​ | 
-x = np.array(ct.fit_transform(x))+|x = np.array(ct.fit_transform(x)) ​                                                                   |
  
 In this block of code we need to use the ColumnTransformer function because we are changing a specific column in the dataset and not the whole dataset. So the ColumnTransformer function is encoding using a one hot encoder on the 0th column (the first column). In this block of code we need to use the ColumnTransformer function because we are changing a specific column in the dataset and not the whole dataset. So the ColumnTransformer function is encoding using a one hot encoder on the 0th column (the first column).
Line 90: Line 89:
 Label encoding simple encodes variables in text formats into binary 0 and 1 values. In the data above we still have "​No"​ and "​Yes"​ in the final column, which is our depend variable (y vector). So we need to encode these values so that they are 0s and 1s. We could simply go into the excel sheet and change the "​No"​ and "​Yes"​ to 0 and 1 respectively,​ but with large datasets that would take very long so the LabelEncoder function is very useful for this task. To implement the LabelEncoder we use the sklearn library again: Label encoding simple encodes variables in text formats into binary 0 and 1 values. In the data above we still have "​No"​ and "​Yes"​ in the final column, which is our depend variable (y vector). So we need to encode these values so that they are 0s and 1s. We could simply go into the excel sheet and change the "​No"​ and "​Yes"​ to 0 and 1 respectively,​ but with large datasets that would take very long so the LabelEncoder function is very useful for this task. To implement the LabelEncoder we use the sklearn library again:
  
-from sklearn.preprocessing import LabelEncoder +|from sklearn.preprocessing import LabelEncoder ​| 
-le = LabelEncoder() +|le = LabelEncoder() ​                           | 
-y = le.fit_transform(y)+|y = le.fit_transform(y) ​                       |
  
 Again, this "​fit_transform"​ is fitting the LabelEncoder (le) onto y, then performs the LabelEncoder on y and then the output is set equal to y. Again, this "​fit_transform"​ is fitting the LabelEncoder (le) onto y, then performs the LabelEncoder on y and then the output is set equal to y.
Line 100: Line 99:
 | from sklearn.model_selection import train_test_split ​                                         | | from sklearn.model_selection import train_test_split ​                                         |
 | X_train, X_test, y_train, y_test = train_test_split(X,​ y, test_size = 0.2, random_state = 1)  | | X_train, X_test, y_train, y_test = train_test_split(X,​ y, test_size = 0.2, random_state = 1)  |
 +
 In this case we can name the variables anything but this naming convention is simple and easy to understand. We can also change the test size, so if we wanted to train the model on 90% of the data we would set the test_size equal to 0.1. random_state simply controls the shuffling applied to the data before applying the split. In this case we can name the variables anything but this naming convention is simple and easy to understand. We can also change the test size, so if we wanted to train the model on 90% of the data we would set the test_size equal to 0.1. random_state simply controls the shuffling applied to the data before applying the split.
 ===== Feature Scaling ===== ===== Feature Scaling =====
 +Feature scaling is a task that we need to perform when we have a wide range of values in our data. For example, take the dataset with the different countries, we need to use feature scaling because the salaries column is significantly greater than that of the age column. To feature scale we will simple use the StandardScaler function from the sklearn library. What StandardScaler does is it normalizes the data in a category according to its mean so that all the values are between -1 and 1.
  
 +|from sklearn.preprocessing import StandardScaler|
 +|sc = StandardScaler()|
 +|X_train[:, 3:] = sc.fit_transform(X_train[:,​ 3:])|
 +|X_test[:, 3:] = sc.transform(X_test[:,​ 3:])|
  
  • forecasting/data_preprocessing.1605122007.txt.gz
  • Last modified: 2021/09/19 21:59
  • (external edit)