Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
forecasting:data_preprocessing [2020/11/11 19:21] kmacloves [Feature Scaling] |
forecasting:data_preprocessing [2021/09/19 21:59] (current) |
||
---|---|---|---|
Line 19: | Line 19: | ||
Since the level corresponds to the position, we don't care about the first column. Let's say that the Salary is the dependent variable and Level is the independent variable that Salary depends on. | Since the level corresponds to the position, we don't care about the first column. Let's say that the Salary is the dependent variable and Level is the independent variable that Salary depends on. | ||
- | ==== Step 1: Import the Data ==== | + | Step 1: Import the Data |
Importing the data is done using the Pandas library: | Importing the data is done using the Pandas library: | ||
|dataset = pd.read_csv('nameOfDatasheet.csv')| | |dataset = pd.read_csv('nameOfDatasheet.csv')| | ||
- | ==== Step 2: Select the values ==== | ||
+ | Step 2: Select the values | ||
//iloc// is a function in the pandas library that locates values from a csv file based on the indexes given in the arguments. Since we want the independent variable, x, to get all the rows and just the second column we can find the values like this: | //iloc// is a function in the pandas library that locates values from a csv file based on the indexes given in the arguments. Since we want the independent variable, x, to get all the rows and just the second column we can find the values like this: | ||
|x = dataset.iloc[:, 1:-1].values| | |x = dataset.iloc[:, 1:-1].values| | ||
- | similarly since we want all the rows for the dependent variable and just the last column we can find the values like this: | + | Similarly since we want all the rows for the dependent variable and just the last column we can find the values like this: |
|y = dataset.iloc[:, -1].values| | |y = dataset.iloc[:, -1].values| | ||
Line 59: | Line 59: | ||
** One Hot Encoding ** | ** One Hot Encoding ** | ||
- | One hot encoding simple takes a set of variables/strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/string to one space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this: | + | One hot encoding simply takes a set of variables/strings and creates a vector with n dimensions equal to the amount of variables in the set and then assigns each variable/string to one unique space in that vector. Using the same dataset we will take the three countries, which means we will have a vector with n = 3 and it will look like this: |
- | |[(France) (Spain) (Germany)].| | + | |[(France) (Spain) (Germany)]| |
When we are calling one country we will be setting the corresponding space equal to 1 and all other spaces equal to 0: | When we are calling one country we will be setting the corresponding space equal to 1 and all other spaces equal to 0: | ||
- | |France: [1 0 0] | | + | | France: | [1 0 0] | |
- | |Spain: [0 1 0] | | + | | Spain: | [0 1 0] | |
- | |Germany: [0 0 1]| | + | | Germany: | [0 0 1] | |
To use the OneHotEncoding function in python we can use it from the scikitlearn library: | To use the OneHotEncoding function in python we can use it from the scikitlearn library: |