This is all from my own “beginners perspective”. I am NOT claiming to be an expert and welcome any constructive criticism and corrections to anything I may have said that might not be completely accurate 🙂 There are no perfect models. The key is to find the best algorithm for the specific job/problem that needs to be solved”
Algorithm Type: Reinforcement
Use Cases:
Training Robot dogs to walk
Games where you give rewards or penalties, etc..
Reinforcement learning involves the model “training itself and making adjustments on the fly”. As more data comes in, it learns more and more by making adjustments. Consider a game of chess where the game is programmed to “learn from its past mistakes with each game it plays in order to get better results”.
There are 2 widely used algorithms for this.
UCB – Upper confidence bound
Thompson Sampling
This will focus on “Thompson Sampling”(which in the end seemed to outperform UCB)
You are trying to combine exploration and exploitation in order to make predictions without needing “a whole ton of data and history from the get go”. There may be times when you are not provided with “that ton of data” or that you don’t have “that history” at your fingertips. It also may not be economically feasible to get that too.
(ALL images below from this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves)
There’s a famous dilema called “The Multi-Armed Bandit Problem“. In a nutshell, you are presented with a bunch of slot machines

and you don’t know which one is going to present you with the “best payback”. So you can try them all a bunch of times (and lose a lot of money too) to figure out what is the best output.
When you figure out which one is best, you can exploit that one (advertising is a key business case).
You want to try to minimize costs and time and get to the optimal result as soon as we can.
This does not tend to really explain it in detail but it shows you a high level difference between this and the UCB. You basically create your own “imaginary configuration ideal point for each machine”

then with each round you get the actual (which is different from the imaginary point) of the one with the highest distribution

If you want to see that there’s some complex math going on between each round, this should convince you (but only dig into the formula if you want to really know about the formula and how it works but the API’s will do most all of this for you).

You go thru many many rounds of this adjusting your points based on the new calculated distributions and eventually you get these points (example shows 3 machines here and we want to find the one with the best chance of a payout).

Here’s a really good site that explains a lot about the Thompson Sampling. It’s worth a read:
https://www.quora.com/What-is-Thompson-sampling-in-laymans-terms

from the lab in the this recommended course on Udemy by Kirill Eremenko, Hadelin de Ponteves), the results here show that it is more accurate than the UCB approach for the same problem. 4 is the predominant value in our test data set.
Thompson Sampling:

and the UCB on the same problem:
