I went over basic concepts, methods and functions in Machine Learning while reading Machine Learning in Action which is written by Peter Harrington and provides executable Python code and is very helpful to those who want to begin learning Machine Learning and Big Data techniques in actual projects.
These are notes and reference from Machine Learning in Action:
- k-Nearest Neighbors
- Decision Tree
- Naive Bayes (probability theory)
- Logistic Regression (optimization, probability estimate)
- Support Vector Machine
- AdaBoosting (improve classification)
- Regression (predict numeric values)
- Tree-based regression
- K-means clustering (grouping unlabeled items)
- Apriori algorithm (association analysis)
- FP-growth (efficiently find frequent itemsets)
- Principal component analysis (simplify data)
- Singular value decomposition (simplify data)
- Big data and MapReduce
Steps in developing a machine learning application
- Collect data.
- Prepare the input data.
- Analyze the input data.
- If you’re working with a production system and you know what the data should look like, or you trust its source, you can skip this step.
- Train the algorithm.
- Test the algorithm.
- Use it
Pros: High accuracy, insensitive to outliers, no assumptions about data
Cons: Computationally expensive, requires a lot of memory
Works with: Numeric values, nominal values
It probably matches our data too well. This problem is known as overfitting. In order to reduce the problem of overfitting, we can prune the tree. This will go through and remove some leaves. If a leaf node adds only a little information, it will be cut off and merged with another leaf.
ID3, is good but not the best. ID3 can’t handle numeric values. We could use continuous values by quantizing them into discrete bins, but ID3 suffers from other problems if we have too many splits.
A decision tree classifier is just like a work-flow diagram with the terminating blocks representing classification decisions. Starting with a dataset, you can measure the inconsistency of a set or the entropy to find a way to split the set until all the data belongs to the same class. The ID3 algorithm can split nominal-valued datasets. Recursion is used in tree-building algorithms to turn a dataset into a decision tree. The tree is easily represented in a Python dictionary rather than a special data structure.
The contact lens data showed that decision trees can try too hard and overfit a dataset. This overfitting can be removed by pruning the decision tree, combining adjacent leaf nodes that don’t provide a large amount of information gain.
There are other decision tree–generating algorithms. The most popular are C4.5 and CART.
That’s Bayesian decision theory in a nutshell: choosing the decision with the highest probability.
Using probabilities can sometimes be more effective than using hard rules for classification. Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities from known values.
Stochastic gradient ascent is an example of an online learning algorithm. This is known as online because we can incrementally update the classifier as new data comes in rather than all at once. The all-at-once method is known as batch processing.
Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid. Methods of optimization can be used to find the best-fit parameters. Among the optimization algorithms, one of the most common algorithms is gradient ascent. Gradient ascent can be simplified with stochastic gradient ascent.
If missing some alues, here are some options:
- Use the feature’s mean value from all the available data.
- Fill in the unknown with a special value like -1.
- Ignore the instance.
- Use a mean value from similar items.
- Use another machine learning algorithm to predict the value
The points closest to the separating hyperplane are known as support vectors. Now that we know that we’re trying to maximize the distance from the separating line to the support vectors, we need to find a way to optimize this problem.
One thing to note is that SVMs are binary classifiers. You’ll need to write a little more code to use an SVM on a problem with more than two classes.
The radial bias function is a kernel that’s often used with support vector machines.
Support vector machines are a type of classifier. They’re called machines because they generate a binary decision; they’re decision machines(unsupervised) .
Support vector machines try to maximize margin by solving a quadratic optimization problem. In the past, complex, slow quadratic solvers were used to train support vector machines.
Meta-algorithms are a way of combining other algorithms. We’ll focus on one of the most popular meta-algorithms called AdaBoost. This is a powerful tool to have in your toolbox because AdaBoost is considered by some to be the best-supervised learning algorithm.
Pros: Low generalization error, easy to code, works with most classifiers, no parameters to adjust
Cons: Sensitive to outliers
Works with: Numeric values, nominal values
Bootstrap aggregating, which is known as bagging, is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement. By “with replacement” I mean that you can select the same example more than once. This property allows you to have values in the new dataset that are repeated, and some values from the original won’t be present in the new set.
- Build the FP-tree.
- Mine frequentt itemsets from the FP-tree.
- Making the dataset easier to use
- Reducing computational cost of many algorithms
- Removing noise
- Making the results easier to understand
To be continue…