Machine Learning

Machine Learning – K-Nearest Neighbor from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate 🙂 There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Classification but can sometimes be used for regression.

The idea: store previous output and create “K” imaginary clusters/points where K tends to be “in the center of each cluster as much as possible”. As new predictions come in, measure the distance (usually Euclidean but can be Hamming, Minkowski) between your new point.

The trick is to “choose the number of K’s” (there a few ways to do this)

Source: this is from the recommended Udemy course in Machine Learning that I took

knn1

Above, we chose K=5 initially.

Based on the new data point, calculate the distance of the data point from the closest K points (ex: Using euclidean distance). The most K points in a certain category becomes the “class” of your new dependent variable.

Python/R and Javascript have libraries that implement this algorithm and make it relatively easy.

Machine Learning

Machine Learning – Logistic Regression from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate 🙂 There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Classification

Given a set of independent variables, this method is used to estimate discrete (known) values like 0/1, true/false, yes/no, etc..). This is also known as “logit regression” since it uses a logit function (see https://en.wikipedia.org/wiki/Logistic_function).

– A life insurance company may use this to predict the probability that a policy holder will pass away before the term expires based on “features” such as age, smoker/non-smoker, gender, salary.

– An employer may use this to predict the probability that a potential candidate will be successful on the job based on “features” such as “degree earned”, “experience in years”, “major earned”, “minor earned”, “gender”, etc..

This does not perform well on large datasets.

Choose your independent variables wisely. This is another place a data scientist will show her/his value as they will have the domain knowledge/know how and experience in the problem to know which independent variables matter. Doing this is a re-iterative process of forward and backward propogation, error rates, etc.. The smaller the set of variables, the faster the prediction.

Graphing?
Although the Python course in ML that I took showed a resulting graph using logistic regression, that example had only 2 independent variables. In many cases, we will have more.

This site suggests the following:

Graphs aren’t very useful for showing the results of multiple logistic regression; instead, people usually just show a table of the independent variables, with their P values and perhaps the regression coefficients.

Python/R and Javascript have libraries that implement this algorithm and make it relatively easy.

Machine Learning

Machine Learning – Random Forest Regression from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate 🙂 There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Classification and occassionally for continous dependent variables.

Random Forest is also known as “an *Ensemble* of Decision Trees”. What does that mean? It means the model is comprised of many decision trees and the averages of their results make up the final prediction. In layman’s terms, “It uses the power of the crowd” so that you don’t have your model influenced by just 1 decision tree. One may not work great but when you combine that with many others and average them, you get a better result. Thus “the power of the crowd”. Thus also the reason for the term “forest” implying “many trees together”.

This is an iterative process where you decide how many trees you wish to have and you evaluate the results and figure out where to stop. The more trees, the slower your model becomes so there’s a fine line here.

Using “entropy”(ie, A measure of the amount of disorder in a system) and “mean absolute error”(In statistics, mean absolute error (MAE) is a measure of difference between two continuous variables.), outliers can be eliminated from this.

See chart below and compare to the single decision tree chart

Random Forest (average of multiple decision trees):

rf1

Single Decision Tree technique:

dt2

Python/R and Javascript have libraries that implement this algorithm and make it relatively easy.

Machine Learning

Machine Learning – Decision Tree Regression from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate 🙂 There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Classification and occassionally for continous dependent variables.

Use Cases:
– Loan approval
– General business decision-making
– Financial systems that forecast future outcomes and assigning probabilities to those outcomes
– Create knowledge management platforms for customer service that improve first call experience, averages of handling time and customer satisfaction rates.
– Medical system outcomes for determining the best prescriptions based on patients features, etc..

The idea here is that of a nested binary tree where each path (0 or 1) takes you on another journey to another “leafless node” of 0 or 1. Eventually you end up with leaf nodes.

Giving full credit of this image below to this link (I don’t want to get into copyright violations), your model will be built based on a bunch of binary (0/1) rules similar to this:

dt1chart

One example case of using a decision tree is a game called Jezzball by Microsoft and computer games in general.

Note the difference in the resulting chart. This is not a fine curve as the horizontal lines represent the “leaf” nodes (per se) where the resulting points (X axis) fall on them. The model is good as the points tend to fall in the center of the lines (ie, the center of the x min/max ranges for each Y(dependent variable)).

dt2

Python/R and Javascript have libraries that implement this algorithm and make it relatively easy.

Machine Learning

Machine Learning – Support Vector Regression from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate 🙂 There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Supervised

Support Vector Regression performs regression in a higher dimensional space to help accomplish its main goal:

Make sure that errors do not exceed the threshold

You are concerned about outliers and don’t want those outliers to have an effect of the outcome of your model. So those outliers are the one’s that you would eliminate from your prediction as they would exceed a certain defined error threshold. Without needing to know the details of the technique (unless you want to), the Gaussian technique can help achieve this.

Notice here when the model is trained with the training data, the outlier at the top right is not included in the model’s “predictability”. You don’t want those dramatic outliers to detrimentally influence the model’s accuracy when making predictions.

svr1

Python/R and Javascript have libraries that implement this algorithm and make it relatively easy.