I’m taking an ambitious 40+ hour video training course on Machine Learning in Python. I’ll do more details on it as I get further and further in it. All of this work is done off hours after long work days and time permitting in evenings and weekends. I do want to get versed it in better so that I can possibly be a better programmer/analyst in it if ever placed on project(s) where predictions based on models are needed. I think I have been putting off learning that enough 🙂
This course is using Python
We use the panda library, matplotlib and numpy libraries.
First Module – Data Preprocessing.
1. This is all about taking your incoming data(s) and getting it ready to use as input so that you can create your training and test data sets.
The “getting it ready” means:
– Replacing the missing data in columns (as will happen in the real world)
It is recommended to fill in those columns with either the “mean” of the column or “most recent value”, etc.. You do not want to be deleting the row.
You can use the imputer lib from sklearn with the “mean” strategy to do this.
2. Encode categorical data if necessary.
Suppose you have a column which is a country “Spain, Italy, France, England”.
You cannot really have a column with values of: 0 = Spain, 1 = Italy, 2 = France, etc..”. Reason is that Italy is not really > Spain, etc.. They are categories that cannot be really be mathematically induced in an equation.
So we take that 1 column and create multiple columns with a 0/1 for Spain, Italy, France, etc..
We can use the LabelEncoder and OneHotEncoder classes with Python to do this. So you end up with a column for each country in the row with a mutually exclusive 0 or 1 value. If the country is Spain, the Spain column is 1 and the Italy, France, England columns are 0.
Locate our independent (features) variables and our dependant variable.
3. Split our dataset (output from 1) into our training and test set.
We want to create 2 datasets.
Training set (usually 70-80% the size of the original dataset)
This is the dataset that we train the model.
Test set (the other 20-30%)
This is the “sample set” that we ‘throw at our model’ to see if it
4. Scale features if necessary. Many times our features will be of a different scaling that will not properly take all of our features into consideration. for example, age and salary features. If you use the euclidean distance calculation on both age and salary, the salary value will be exponentially magnitudes higher than the age and it will hide the age in calculations. We don’t want that.
So we can use feature scaling which puts our variables (independent/feature) into the same range/scale so that no variables’ range is dominated by the other.
We use the StandardScaler() sklearn.preprocessing lib to do this.
We scale both the X training and set set to do this.
Sometimes tho we don’t need to scale our Y. In the case of “classification” (0 or 1, or maybe 0,1,2) we don’t need to. With regression, we do.
Measuring performance of our model
Performs just as well as the “training set” that we used to train our model. This is for measured performance. If our “test set” performance results stink. Then we know we perhaps “overfitted” our model to conform just to the training set. We need to go back and fix that.
This is still a work in progress as I’m taking some online courses in Machine Learning Theory right now. This is “not” to make me a data scientist (as I only did my 3 college calculus, diffeq, business statistics, linear algebra and set theory/binary mathematics courses years ago and nothing beyond that) but rather to at least familiarize myself with more of a higher-level overview of what ML is from a theory standpoint. So in no way/shape or form is this my attempt to switch careers to be a true “data scientist” 🙂 I want to understand the things that a data scientist would do in order to help take/realize those into actual programming models to display and do some computing with it to realize predictions.
In no particular order:
Overview
Machine Learning – Building computational artifics that learn over time based on experience. It’s the math/science/engineering and computing behind it. Has to learn over time. You have data and you do analysis from the data and try to gleen things from the data and use various kinds of computational structure to do that.
Supervised Learning – You have predefined labels and can derive whether or not something is something you’re targeting as the input events occur. It is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.[1] It infers a function from labeled training data consisting of a set of training examples
Classification Learning:
The process of taking some kind of input and mapping it to some discreet label (ie, true/false). Ie, an image recognition or a dog/cat.
The Pieces
Instances
Input data. Vectors of attributes that define your input space (ie, pictures/images, credit scores, whatever you use to describe your values.
Concept
Function. The function that maps inputs (the instances above) to some kind of output. Mainly binary (true) or multiple known values
Idea that describes a set of things. Concept of “tallness” (if you give me something, I can tell you if I think it’s tall or not. Mapping between objects in a world and members in a set (which makes it a function).
Target Concept
The thing we’re trying to find. It’s the actual answer. The function that determines whether the thing is a car or a dog is the target function (or concept). We have the notion of things in our head that make us determine whether it’s a dog or a car but unless we have it written down somewhere, we don’t know whether it’s write or wrong. You have to convey/teach that to the system.
Hypothesis Class
It’s the set of all “conceps” that we’re willing to entertain (or all functions that we care about). All the functions I’m willing to think about. Warning :don’t make it all “possible” functions as that makes hard to figure out which function is the right one given finite data.
Sample (training set)
We want to find a particular concept within the set of functions we have. Determining the right answer is by using this training set. It’s a set of all our inputs paired with a label which is the correct output (ie, “It is a dog”, “Person is tall”, “Image is a 5”, etc..). A bunch of examples of input/output pairs is a training set. You give me a lot of examples of “this thing is a dog”, etc.. It’s like someone is with you pointing out things on the street that are cars and not a car rather than giving you a dictionary definition of what a car is. That is “inductive learning” as well. Lots of examples and labels.
Candidate
A concept that you think might be the “target concept”. An “animal you have” can be the concept and if you want to find out whether or not it’s a dog, you would use the “target concept”(dog) to run that against. So the candiate is the “animal you have”.
Testing Set
Given that you have a bunch of examples and you have a particular candidate (ie, a concept), how do you know whether you’re right or wrong? Testing set looks like a training set. Given examples from the training set (images, animal etc..) and I will take your candidate and determine whether it does a good job or not by looking at the testing set. Because (in our case) whether that animal (candidate concept) is a dog or not, we will take the “dog” entries from the training set and determine whether your candidate is a dog or not.
So you can go thru all of the “dogs” in the testing set and apply the candidate to determine whether it’s true or false and then compare to that what the testing set actually says that answer is.
Important: the training set and the testing set should not be the same. If you learn from your training set and you are tested only on your training set, that’s considered “cheating in ML” because you have not shown the ability to generalize. You want the testing set to include lots of examples that you don’t see in your training set. That is proof that you’re able to generalize. “Generalization” is the whole point of ML.
Regression:
This is more about continous value functions. Given a bunch of points, we want to introduce an imaginary line and come up with a new value. There is no “mapping” but really a calculation but mapping from some real input space to a number (ie, predicting an age of someone based on some inputs)
Unsupervised learning – is a term used for Hebbian learning, associate with learning without a teacher, also known as self-organization and a method of modelling the probability density of inputs. Clustering is a form of unsupervised learning as we don’t know what the groupings will be ahead of time.
Reinforcement learning – (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Reinforcement learning is considered as one of three machine learning paradigms, alongside supervised learning and unsupervised learning. A good example of this is a chess game or checkers or tic tac toe game which learns how to get better and better as you play it more times. It remembers the past and learns from its mistakes.
BTW, you’ll see this example using the “mnist.pkl.gz” dataset which is the globally known MNIST dataset. Info about that can be found here: https://en.wikipedia.org/wiki/MNIST_database
Part 1 – Setup/Train and Deploy
Part 2 – Tear down/deleting endpoints, model and S3 artifacts.
The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.[1][2] The database is also widely used for training and testing in the field of machine learning.[3][4] It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments.[5] Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.[5]
AWS recommends using this instead of their original ML service (which is not available for new accounts).
– Simplified ML service which allows you to build/deploy your ML models (using many different out of the box algorithms) to AWS. The built in algorithms are not pre-trained so we need to format the training data to fit the model input specifications. Sagemaker will save the model parameters to S3 once training is completed. You can set up https end points.
– Linear Learner and Factorization Machine algorithms supported for “classification and regression” and Seq2seq for text summarization (speech to text). K-means Clustering for Clustering (logically grouping data) and Principal components analysis (Dimensionality Reduction). Xgboost, DeepAR (Face recognition), etc..
Regression – Output prediction is a continuous real value
Classification – Output prediction is a categorical binary value (a vegetable or mineral for example)
– Uses services like AWS Glue:
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. You can use that to move data around from Redshift, Aurora and such as input to your ML models.
– Has many built in algorithms so you as the developer don’t have to write any code. Each of the “models” (out of the box) are hosted in Docker containers on AWS.
– Uses the open source Jupyter (Python) notebooks which is used by many data scientists to load/train the models with input data.
– Once you train your models, you can create “endpoints” where your deployed model can be accessed programmatically by your software, etc..
– To use a built in algorithm
1) Retrieve the training data (Explore and clean the data)
2) Format and serialize the data (put it in the format that the algorithm wants to see) and then upload it to S3
3) Train with the built in algorithm (stored in containers), set up the estimators and train with the input data.
4) Deploy the model which creates an endpoint configuration and endpoint for the prediction responses.
5) Use the endpoint for inference.
– boto3 python sdk offers access to other AWS services such as S3 and EC2, etc..
So I’m taking a bit of a detour (it’s easy to get positively distracted into that) from my container/cloud learning in my off time. I’ve been hearing some buzz about “Machine Learning” from a former employer who says there’s more and more opportunities in it. One of my favorite instructors, Stephen Grider, has just released a brand new course called “Machine Learning in Javascript”. On the job, I’m doing a lot of professional programming with React/Redux and much of my foundational learning was from Grider’s “React” courses. When I saw that he just released a Javascript Machine Learning course, that’s when I said “Ok, it’s detour time, I don’t know much about what it is and since I know Javascript and really like Grider’s courses, I’m going to buy this course and see what it’s about”.
So far, I’m about 20-25% in it and have taken the introductory pieces and have followed along with his examples.
If you’re a Javascript programmer who wants to learn about Machine Learning but does not want to have to deal with learning Python (due to time constraints, etc..), then this course is for you!
I almost want to say that this is more like AI but it’s really a lot like data mining.
ML helps to solve common everyday problems such as:
“Is this email SPAM? If so, place in your spam box”.
“You just took a photo of the shape of a sign, I’ll examine the shape of the item in the photo and tell you what I think it is”.
“You just ordered a few items from the past, based on what you ordered, here are some other items/products we think you would be interested in”.
“You just viewed these videos on youtube, based on what you saw, we think the following videos would be of interest to you”.
There are so many more applications of it. Even the health industry is getting into it.
Basically any system that needs to make a prediction about “what’s next” as accurate as possible. How to get there is by “previous data” known as test sets. With the advent of the cloud and “infinite space”, there is practically no limit to the amount of data you can collect. There are several algorithms out there based on “features you want to feed as input” that will help reveal “the final result” (which is known as a label). But ML is not 100% accurate. Parameters and algorithms and sample sizes need to be refined and revisited and tweaked in order to get a good prediction.
I watched a useful video on it in which they say efficiently tweaking your ML parameters is more of an art than a science but it’s getting better and better and companies are finding it useful in many of their applications. Unline Nueral Networking (which is based on algorithms without data collection over time), ML tends to get more and more accurate with more “samples”.
Again, I don’t claim to be an expert in this. I’m only 20% into Griders course but my main goal is to understand the basics about it.
I do know that Python has been the “defacto” language for ML apps. Javascipt, however, is starting to take off on it. There are more and more projects/libraries coming out such as Tensorflow.js, etc.. They’re even working on native bindings in node.js which may help allow your algorithms and models (which can now be imported from Python models) to work within your Javascript apps.
See 19:29 at this video to learn about what the node.js bindings will do
The video has a really cool example of ML where they used the camera in the browser to capture photos of your face movements and then based on that, they play a pacman game where the guy just moves his head up/down and turns his head left/right (while the application is capturing his face movements) and that controls where the “pacman” character moves on the game board. It uses ML to learn about the “face movements” and then make a prediction on where the pacman should go (whether it be up/down left or right).
I have a “gut” feeling that ML will be a part of the next generation of Javascript apps. I also have a feeling that Javascript will continue to gain ground rapidly in ML as more and more libraries are developed and may even surpass Python as the language for ML apps. Don’t underestimate the “power of numbers” (no pun intended)
I do not object to Python. I have not really worked with it much. Back in my Unix days in the 1990’s and early 2000’s, I did quite a bit with shell scripting, regular expressions and perl. When I moved more into .NET apps, Python was sort of coming into it’s own. I missed that. I’m more than happy to learn Python and I’d like to but I have so much more to learn (that is higher priority for me right now) so Python is lower on that list. Having said that, I was thrilled when Grider released this ML course in Javascript as it meant that I could really focus more on ML and it’s applications/concepts rather than being bogged down by learning the “python” language at the same time. Again it’s not that I object to Python or that I don’t want to learn it. I just don’t want the ML concepts “muddied” by Python when I am very comfortable with Javascript 🙂
I don’t really know what I expect to get out of this course other than learning quite a bit (hopefully). A Stephen Grider course is a really good investment. If you’re trying to learn React or React native or Javascript, etc, Stephen Grider courses are known for being one of the best courses to take.
I do think after the end of this course (17 hours of video instruction), I will at least be able to know at a high level what ML can do and whether or not it can be utilized in any of the applications we’re doing at work. I don’t expect to be “an expert” or to be able to walk into a place and say “I took that course on it, therefore I’m an expert”. I’m not an expert (I’m a lifelong learner aspiring to get better) in anything anyways so I’d never call myself that to begin with. I think of this course more as a “first drive thru Yellowstone park where I’m not stopping off everywhere to hike for miles off the beaten paths, etc.. This is more of a drive thru the whole park to see the high level main attractions, maybe park the car and take pictures and leave with a *I know what’s at Yellowstone park*”. Someday if I want/need to go back there to hike for miles off the beaten path, I’ll do more and more research (ie, more deep learning on machine learning). So basically I am counting on this course being like a first drive thru Yellowstone to see all of the attractions but not to really go off and spend the winter there in a remote cabin with snow shoes where I can hike thru remote passageways and mountainous backdrops, etc..
Terms
features (Inputs)
label (known answer)
test data – known sample data
Supervised Learning (its test data contains labels)
Unsupervised Learning (its test data does not contain labels)
Classification – known discreet answers
Regression – real time computed numbers
K Nearest
Linear Regression – allows for multi features
Mean Squared Error – algorithm for accuracy
Python, Javascript, R – Languages that you can do ML with
TensorFlow – powerful multi-featured open source library
MNIST – free dataset for image recognition to learn ML