In-progress
Classification Algortims
– K-Nearest Neighbor
a. It is supervised (it knows the dependent variable ahead of time)
b. Looks at labelled points closest to the “unknown point to be classified”
and is “voted in” to that label class.
c. Computationally expensive tho (con)
– SVM (Support Vector Machines)
a. In 2d space, the two points (on opposite sides) closest to the “equidistant line splitting the classes” are known as “support vectors.
b. Use kernal vector SVM trick in the case where a group of classes cannot be
linearly split (think a circle within a circle). So you create a new dimension (Z) and elevate one of the classes so that a split can be created as a result of the “added dimension”
– Logistic Regression
a. Used when the dependent variable is categorical.
b. An example of a use case is email that is either spam or not.
c. Another example, a credit card company can use this to determine when an applicant will have good credit or not.
– Decision Trees
a. single decision tree can have the potential for overfitting(results tied to close to the training set).
– Random Forest (Ensemble) – bunch of algorithms used but take the avg to get the class.
a. Solves the issue with decision trees in that overfitting much less likely due to “the power of the crowd”.
– Naive Bayes
a. tends to outperform very sophisticated methods, useful for large datasets.
b. assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature meaning “independence among features”.