Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
forecasting:data_preprocessing [2020/11/11 19:21]
kmacloves [Feature Scaling]
forecasting:data_preprocessing [2021/09/19 21:59] (current)
Line 19: Line 19:
 Since the level corresponds to the position, we don't care about the first column. Let's say that the Salary is the dependent variable and Level is the independent variable that Salary depends on. Since the level corresponds to the position, we don't care about the first column. Let's say that the Salary is the dependent variable and Level is the independent variable that Salary depends on.
  
-==== Step 1: Import the Data ====+Step 1: Import the Data
 Importing the data is done using the Pandas library: Importing the data is done using the Pandas library:
  
 |dataset = pd.read_csv('​nameOfDatasheet.csv'​)| |dataset = pd.read_csv('​nameOfDatasheet.csv'​)|
-==== Step 2: Select the values ==== 
  
 +Step 2: Select the values
 //iloc// is a function in the pandas library that locates values from a csv file based on the indexes given in the arguments. Since we want the independent variable, x, to get all the rows and just the second column we can find the values like this: //iloc// is a function in the pandas library that locates values from a csv file based on the indexes given in the arguments. Since we want the independent variable, x, to get all the rows and just the second column we can find the values like this:
  
 |x = dataset.iloc[:,​ 1:​-1].values| |x = dataset.iloc[:,​ 1:​-1].values|
  
-similarly ​since we want all the rows for the dependent variable and just the last column we can find the values like this:+Similarly ​since we want all the rows for the dependent variable and just the last column we can find the values like this:
  
 |y = dataset.iloc[:,​ -1].values| |y = dataset.iloc[:,​ -1].values|
Line 59: Line 59:
 ** One Hot Encoding ** ** One Hot Encoding **
  
-One hot encoding ​simple ​takes a set of variables/​strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/​string to one space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this:+One hot encoding ​simply ​takes a set of variables/​strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/​string to one unique ​space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this:
  
-|[(France) (Spain) (Germany)].|+|[(France) (Spain) (Germany)]|
  
 When we are calling one country we will be setting the corresponding space equal to 1 and all other spaces equal to 0: When we are calling one country we will be setting the corresponding space equal to 1 and all other spaces equal to 0:
  
-|France: [1 0 0] | +  ​France: ​[1 0 0]  
-|Spain: [0 1 0]  | +   Spain: ​[0 1 0]  | 
-|Germany: [0 0 1]|+ Germany: ​[0 0 1]  |
  
 To use the OneHotEncoding function in python we can use it from the scikitlearn library: To use the OneHotEncoding function in python we can use it from the scikitlearn library:
  • forecasting/data_preprocessing.1605122509.txt.gz
  • Last modified: 2021/09/19 21:59
  • (external edit)