Machine Learning

Machine Learning Course – Data Preparation

I’m taking an ambitious 40+ hour video training course on Machine Learning in Python. I’ll do more details on it as I get further and further in it. All of this work is done off hours after long work days and time permitting in evenings and weekends. I do want to get versed it in better so that I can possibly be a better programmer/analyst in it if ever placed on project(s) where predictions based on models are needed. I think I have been putting off learning that enough 🙂

This course is using Python
We use the panda library, matplotlib and numpy libraries.

First Module – Data Preprocessing.
1. This is all about taking your incoming data(s) and getting it ready to use as input so that you can create your training and test data sets.
The “getting it ready” means:
– Replacing the missing data in columns (as will happen in the real world)
It is recommended to fill in those columns with either the “mean” of the column or “most recent value”, etc.. You do not want to be deleting the row.
You can use the imputer lib from sklearn with the “mean” strategy to do this.

2. Encode categorical data if necessary.
Suppose you have a column which is a country “Spain, Italy, France, England”.
You cannot really have a column with values of: 0 = Spain, 1 = Italy, 2 = France, etc..”. Reason is that Italy is not really > Spain, etc.. They are categories that cannot be really be mathematically induced in an equation.
So we take that 1 column and create multiple columns with a 0/1 for Spain, Italy, France, etc..

We can use the LabelEncoder and OneHotEncoder classes with Python to do this. So you end up with a column for each country in the row with a mutually exclusive 0 or 1 value. If the country is Spain, the Spain column is 1 and the Italy, France, England columns are 0.

Locate our independent (features) variables and our dependant variable.

3. Split our dataset (output from 1) into our training and test set.
We want to create 2 datasets.
Training set (usually 70-80% the size of the original dataset)
This is the dataset that we train the model.
Test set (the other 20-30%)
This is the “sample set” that we ‘throw at our model’ to see if it

4. Scale features if necessary. Many times our features will be of a different scaling that will not properly take all of our features into consideration. for example, age and salary features. If you use the euclidean distance calculation on both age and salary, the salary value will be exponentially magnitudes higher than the age and it will hide the age in calculations. We don’t want that.

So we can use feature scaling which puts our variables (independent/feature) into the same range/scale so that no variables’ range is dominated by the other.

We use the StandardScaler() sklearn.preprocessing lib to do this.
We scale both the X training and set set to do this.
Sometimes tho we don’t need to scale our Y. In the case of “classification” (0 or 1, or maybe 0,1,2) we don’t need to. With regression, we do.

Measuring performance of our model
Performs just as well as the “training set” that we used to train our model. This is for measured performance. If our “test set” performance results stink. Then we know we perhaps “overfitted” our model to conform just to the training set. We need to go back and fix that.

Leave a comment