Machine Learning

Machine Learning – Apriori Association from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate šŸ™‚ There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Association

Use Cases:
Online shop recommentations (people who purchased this item also tended to purchase..).
People who enjoyed this album may also enjoy the following albums …
Market Basket Analysis to increase sales

(ALL images below from this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves)

Apriori is all about “identifying associations” of unrelated features to the dependent variable. It’s an algorithm used for mining frequent item sets and their association rules.

An example from the course is the following:

A retailer determined thru the use of historical data of purchases per transaction that people who purchased diapers tend to also buy beer. Diapers and beer are completely unrelated and you would not think there would be an association effect.

So here is an example of the dataset used from the course

apri1

Think of this as an example data set of “transactions” at a store. Each transaction contains a list of what was purchased”. From that list (which could be thousands of transactions), we’d like to find an association between those products so that we may be able to better arrange the shelves containing associated items “close by” to increase sales.

Your dataset would be translated (via api calls) to something like this (images from another good course I took(https://www.udemy.com/choosing-the-right-machine-learning-algorithm):

apri2.JPG

Some key important concepts are:
– Support
apri3

I know I’m jumping between examples but here’s a visual from the course on Here, out of our dataset, how many have purchased eggs?

– Confidence
apri5.JPG

You can interpret it as P(Y|X)

So in this image, of all of the people that bought eggs, these are the ones that also bought cheese.

apri6

Which brings us to our next important concept…

– Lift

apri7

So people in green here (based on lift) are the one’s who bought cheese with their eggs

apri8.JPG

As such, a store may want to consider having the cheese aisle very close to the egg shelf.

The idea is to run all of these calcuations for each of the features (products, movies, etc.) in your data set and determine the one’s with the highest “lifts” being the one’s that tend to be associated to each other more.

Going/sifting thru the output of our calculations (using the relatively painless api’s in python), our output would like this (You may need to zoom in this image by hitting the “Ctrl” and “+” keys several times in your browser. To unzoom, hit “Ctrl” and “0” keys once).

apri9

and you can see in this one example here that we have a 24% confidence(chance) that people who buy fromage blanc will buy honey. With a lift being over 3, that’s considered really good. The lift is the “relevance of the rule”.

an example of the line of python that will do that for you

# Training Apriori on the dataset

#min_support = items in your rules will have a higher support than the min support
# How to choose the support? Items that are bought 3-4 times a day
# Support = total number of support items “i”/total number of transactions
# What supports do we want to have so rules are relevant? Lets choose products purchased at least 3/4 times a day.
# By associating them/placing them together, customers more likely to purchase them.
# If product purchased 3/4 times a day, it’s purchased 7x a week which is 21/7501
# it’s the support of a product purchased 3 times a day = 0.00
#min_confidence = 0.2 means rules need to be correct 20% of the time…
#
#min_lift = you can try different values. But you want a lift with min = 3

# Min length = min number of products in the basket
from apyori import apriori
myrules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length=2)

Machine Learning

Machine Learning – K-Means Clustering from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate šŸ™‚ There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Cluster

Use Cases:
Inventory categorization
Detecting bots
Behavioral segmentation (determing which user types (based on interest) are visting your website, etc..)

With clustering, you don’t know what you’re looking for and you are trying to identify “segments”(clusters) in your data for “classification”.

(ALL images below from this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves)

With this clustering technique, you wish to classify your data into K number of groups where K is a random value that you choose.
1. Choose at random “K” points which are “centroids” that are the center of your clusters.
2. With each of those centroids, you want to group your data by assigning each point to the closest (via euclidean distance, etc..) centroid.
3. Then compute and assign a new centroid to each cluster now that you added more data points.

kmeans2

4. Reassign each data point to the new closest centroid.
If any reassignments (ie, points) took place, go to #3, otherwise your model is ready.

Before reassignment

kmeans3

After Reassignment where a point changed from one cluster to another (which means we are still not done and go to #3)

kmeans4

You keep doing #3 and #4 until no more of the points get reassigned which means you’re done with your model.

Pros: works well on small or large datasets. Fast and efficient.

Here is a quick visual overview of K-Means where we chose K=3 randomly:

kmeans1

Chosing the right number of clusters

There is a way to determine this but it involves “math”. Specifically a formula called WCSS (within-cluster sums of squares). You don’t have to know the ins/outs of this but just know that it exists. Here’s the formula

kmeans5

For each of those clusters:
Take every point in clusterN (P with index ā€œiā€ inside of cluster N). We are summing the distance of each point to the centroid inside of cluster N and we are squaring the distance of that point (but we’re doing it for all). So we are getting the sum of those squared distances…

The interesting thing about WCSS is that you can visualize it by running a calculation (ie, the course shows you how) and use the “elbow” method (look for the elbow shape) in the graph to find the “ideal number of clusers”

kmeans6

In the example above, the elbow is at approx 5 (based on our random calculation where we started out with 10 clusters). From there, we were able to narrow it down to 5 based on the elbow method as a result of the WCSS calculations for each of the 10 clusters that we started out “guessing we would need”

Once you use the elbow method to find your optimal number of clusters, you can then set up your classifier (algorithm) like so:

kmeans = KMeans(n_clusters=5, init='k-means++', max_iter = 300, n_init = 10, random_state=0)

and then run predictions of new independent columns against it

y_kmeans = kmeans.fit_predict(X)

In the lab for the course I took, we ended up with this against our test set

kmeans8

Once the model has been trained and we are satisfied with it, new data can come in and where it lands in approximaty to the closest cluster (or within a cluster) will be the chosen “classification” for it.

Machine Learning

Machine Learning – Naive Bayes from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate šŸ™‚ There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Classification

This one is based on the Bayes Theorem
The big assumption is that there are complete independence among the features. It’s called “Naive” because of the “Independence assumption”. In other words, it will assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Along with it’s simplicity, it’s known to outperform even very high sophisticated classification methods.

An example of the Naive Bayes algorithm (it’s nice to try and understand it but for this post, it’s useful to know that this method is based on this algorithm and the api’s will encapsulate that for you):

nb1

In this example (images below from this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves)

nb2

you want to apply the NB formula on the new data point which will attempt to predict the probability that the person either walks or drives to work based on their “independent features”. Please feel free to dig deeper into the algorithm if you like. The course I mentioned discusses it well. Other courses will cover it well too šŸ™‚

nb3

Applying the algorithm and making the “Naive” assumption that each feature is independent (not at all related/influenced) by one another, it will have determined (in this case) that the new data point (prediction Y dependent variable based on a new observation(person) and their features) results in a higher probability that the person will walk to work.

It’s useful for large datasets.

Machine Learning

Machine Learning – Kernal Support Vector Machine from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate šŸ™‚ There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Classification

In SVM, we had an example of how SVM can be applied in the case where your data points (ie, training set data) can be split by the hyperplane with “max margin”.

Now what if we have the case where your training set data points look like this (image from this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves)?

ksvm1

As you can see, we cannot have a Hyperplane (which has to be a straight line) to split the 2. It is not linearly separable.

When you have a non-linearly separable data set and wish to apply SVM, then you can use Kernal SVM.

How it works is that you map the data points to a higher dimension to achieve that separation: (images below from this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves)

ksvm2

after squaring the points, it goes to:

ksvm3

Now you are able to linearly separate it:

ksvm4

What you end up is with this:

ksvm5

Using Python, you would have the following code snippet:

from sklearn.svm import SVC
classifier = SVC(kernel='rbf', random_state=0)
classifier.fit(X_train, y_train)

As it uses the kernal trick ‘rbf’ for Guassian (3d)

Machine Learning

Machine Learning – Support Vector Machine (SVM) from a “non data scientist programmer” point of view

This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate šŸ™‚ There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”

Algorithm Type: Classification

With SVM, we will plot each data item as a point in n-dimensional space. “n” is number of features. The value of each feature is the value of a particular coordinate.

This algorithm was developed in the 1960’s and refined a bit in the 1990’s and it’s still very useful. It ends up finding the “optimal” line by determining the “maximum margin”. See this chart (courtesy of this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves)

svm1

Recall that the support vectors there are the 2 points on each side of the divider closest to the divider. The “max margin” basically aims to be equidistant to the divider. Each side (green and red) are the classifications (ie, “Yes/No” or “Buy/Sell”, etc..).

Note: This particular one is in 2D space (ie, X-Y coordinate) where you will be able to easily split up the clusters. Kernal SVM handles the case where the visual data points cannot be easily split by a hyperplane. You will see that in the “Kernal SVM case”

The line in the middle in the “Max Margin Hyperplane”. You aim to find the “split point” between the two “clusters”. Then everything on one side of the hyperplane is “one classification” and everything on the other side of the hyperplane is “the other classification”. In other words, on where the test data lands on either side of the line is class we can classify the new data as.