I went over basic concepts, methods and functions in Machine Learning while reading Machine Learning in Action which is written by Peter Harrington and provides executable Python code and is very helpful to those who want to begin learning Machine Learning and Big Data techniques in actual projects.

These are notes and reference from Machine Learning in Action:

Classification

- k-Nearest Neighbors
- Decision Tree
- Naive Bayes (probability theory)
- Logistic Regression (optimization, probability estimate)
- Support Vector Machine
- AdaBoosting (improve classification)

Forecasting

- Regression (predict numeric values)
- Tree-based regression

Unsupervised Learning

- K-means clustering (grouping unlabeled items)
- Apriori algorithm (association analysis)
- FP-growth (efficiently find frequent itemsets)

Additional Tools

- Principal component analysis (simplify data)
- Singular value decomposition (simplify data)
- Big data and MapReduce

Steps in developing a machine learning application

- Collect data.
- Prepare the input data.
- Analyze the input data.
- If you’re working with a production system and you know what the data should look like, or you trust its source, you can skip this step.
- Train the algorithm.
- Test the algorithm.
- Use it

**k-Nearest Neighbors **

Pros: High accuracy, insensitive to outliers, no assumptions about data

Cons: Computationally expensive, requires a lot of memory

Works with: Numeric values, nominal values

It probably matches our data too well. This problem is known as **overfitting**. In order to reduce the problem of overfitting, we can prune the tree. This will go through and remove some leaves. If a leaf node adds only a little information, it will be cut off and merged with another leaf.

**ID3**, is good but not the best. ID3 can’t handle numeric values. We could use continuous values by quantizing them into discrete bins, but ID3 suffers from other problems if we have too many splits.

A **decision tree** classifier is just like a work-flow diagram with the terminating blocks representing classification decisions. Starting with a dataset, you can measure the inconsistency of a set or the entropy to find a way to split the set until all the data belongs to the same class. The ID3 algorithm can split nominal-valued datasets. Recursion is used in tree-building algorithms to turn a dataset into a decision tree. The tree is easily represented in a Python dictionary rather than a special data structure.

The contact lens data showed that decision trees can try too hard and overfit a dataset. This overfitting can be removed by pruning the decision tree, combining adjacent leaf nodes that don’t provide a large amount of information gain.

There are other decision tree–generating algorithms. The most popular are C4.5 and CART.

That’s **Bayesian decision** theory in a nutshell: choosing the decision with the highest probability.

Using probabilities can sometimes be more effective than using hard rules for classification. Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities from known values.

**Stochastic gradient ascent** is an example of an online learning algorithm. This is known as online because we can incrementally update the classifier as new data comes in rather than all at once. The all-at-once method is known as batch processing.

**Logistic regression** is finding best-fit parameters to a nonlinear function called the sigmoid. Methods of optimization can be used to find the best-fit parameters. Among the optimization algorithms, one of the most common algorithms is gradient ascent. Gradient ascent can be simplified with stochastic gradient ascent.

If missing some alues, here are some options:

- Use the feature’s mean value from all the available data.
- Fill in the unknown with a special value like -1.
- Ignore the instance.
- Use a mean value from similar items.
- Use another machine learning algorithm to predict the value

The points closest to the separating hyperplane are known as support vectors. Now that we know that we’re trying to maximize the distance from the separating line to the support vectors, we need to find a way to optimize this problem.

One thing to note is that **SVM**s are binary classifiers. You’ll need to write a little more code to use an SVM on a problem with more than two classes.

The **radial bias** function is a kernel that’s often used with support vector machines.

**Support vector machines** are a type of classifier. They’re called machines because they generate a binary decision; they’re decision machines(unsupervised) .

Support vector machines try to maximize margin by solving a quadratic optimization problem. In the past, complex, slow quadratic solvers were used to train support vector machines.

Meta-algorithms are a way of combining other algorithms. We’ll focus on one of the most popular meta-algorithms called AdaBoost. This is a powerful tool to have in your toolbox because AdaBoost is considered by some to be the best-supervised learning algorithm.

**AdaBoost**

Pros: Low generalization error, easy to code, works with most classifiers, no parameters to adjust

Cons: Sensitive to outliers

Works with: Numeric values, nominal values

Bootstrap aggregating, which is known as** bagging**, is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement. By “with replacement” I mean that you can select the same example more than once. This property allows you to have values in the new dataset that are repeated, and some values from the original won’t be present in the new set.

**regression**our target variable is numeric and continuous.

**Linear regression**

**Regression**is the process of predicting a target value similar to classification.

**Shrinkage methods**can also be viewed as adding bias to a model and reducing the

**CART**is an acronym for

**Classification And Regression Trees**. It can be applied to regression or classification.

**Tree-based regression**

**Clustering**is a type of unsupervised learning that automatically forms clusters of similar things. It’s like automatic classification.

**k-means**because it finds k unique clusters, and the center of each cluster is the mean of the values in that cluster.

**unsupervised classification**because it produces the same result as classification but without having predefined classes.

**k-means clustering**

**association analysis**or

**association rule learning**. The problem is, finding different combinations of items can be a time-consuming task and prohibitively expensive in terms of computing power.

**Apriori**

**interesting relationships**in a large set

**frequent itemset**, which shows items that commonly appear in the data

**association rules**.

**FP-growth**algorithm is faster than Apriori because it requires only two scans of

- Build the FP-tree.
- Mine frequentt itemsets from the FP-tree.

**FP-growth**

**FP-growth**algorithm stores data in a compact data structure called an

**FP-tree**. The FP stands for “frequent pattern.” An FP-tree looks like other trees in computer science, but it has links connecting similar items. The linked items can be thought of as a linked list.

**Dimensionality reduction**is the task of reducing the number of inputs you have; this can reduce noise and improve the performance of machine learning algorithms.

- Making the dataset easier to use
- Reducing computational cost of many algorithms
- Removing noise
- Making the results easier to understand

**principal component analysis (PCA)**. In PCA, the dataset is transformed from its original coordinate system to a new coordinate system. The new coordinate system is chosen by the data itself. The first new axis is chosen in the direction of the most variance in the data. The second axis is orthogonal to the first axis and in the direction of an orthogonal axis with the largest variance. This procedure is repeated for as many features as we had in the original data. We’ll find that the majority of the variance is contained in the first few axes. Therefore, we can ignore the rest of the axes, and we reduce the dimensionality of our data.

**Factor analysis**is another method for dimensionality reduction. In factor analysis, we assume that some unobservable latent variables are generating the data we observe. The data we observe is assumed to be a linear combination of the latent variables and some noise. The number of latent variables is possibly lower than the amount of observed data, which gives us the dimensionality reduction. Factor analysis is used in social sciences, finance, and other areas.

**independent component**

**analysis (ICA)**. ICA assumes that the data is generated by N sources, which is similar to

**Principal component analysis**

**singular value decomposition (**

**SVD)**. It’s a powerful tool used to distill information in a number of applications, from bioinformatics to finance.

**The singular value decomposition (SVD)**

**MapReduce**

To be continue…