Review basic concepts in Machine Learning

I went over basic concepts, methods and functions in Machine Learning while reading Machine Learning in Action which is written by Peter Harrington and provides executable Python code and is very helpful to those who want to begin learning Machine Learning and Big Data techniques in actual projects.

These are notes and reference from Machine Learning in Action:

Classification

  • k-Nearest Neighbors
  • Decision Tree
  • Naive Bayes (probability theory)
  • Logistic Regression (optimization, probability estimate)
  • Support Vector Machine
  • AdaBoosting (improve classification)

Forecasting

  • Regression (predict numeric values)
  • Tree-based regression

Unsupervised Learning

  • K-means clustering (grouping unlabeled items)
  • Apriori algorithm (association analysis)
  • FP-growth (efficiently find frequent itemsets)

Additional Tools

  • Principal component analysis (simplify data)
  • Singular value decomposition (simplify data)
  • Big data and MapReduce

 

Steps in developing a machine learning application

  1. Collect data.
  2. Prepare the input data.
  3. Analyze the input data.
  4. If you’re working with a production system and you know what the data should look like, or you trust its source, you can skip this step.
  5. Train the algorithm.
  6. Test the algorithm.
  7. Use it

 

k-Nearest Neighbors

Pros: High accuracy, insensitive to outliers, no assumptions about data

Cons: Computationally expensive, requires a lot of memory

Works with: Numeric values, nominal values

 

It probably matches our data too well. This problem is known as overfitting. In order to reduce the problem of overfitting, we can prune the tree. This will go through and remove some leaves. If a leaf node adds only a little information, it will be cut off and merged with another leaf.

ID3, is good but not the best. ID3 can’t handle numeric values. We could use continuous values by quantizing them into discrete bins, but ID3 suffers from other problems if we have too many splits.

A decision tree classifier is just like a work-flow diagram with the terminating blocks representing classification decisions. Starting with a dataset, you can measure the inconsistency of a set or the entropy to find a way to split the set until all the data belongs to the same class. The ID3 algorithm can split nominal-valued datasets. Recursion is used in tree-building algorithms to turn a dataset into a decision tree. The tree is easily represented in a Python dictionary rather than a special data structure.

The contact lens data showed that decision trees can try too hard and overfit a dataset. This overfitting can be removed by pruning the decision tree, combining adjacent leaf nodes that don’t provide a large amount of information gain.

There are other decision tree–generating algorithms. The most popular are C4.5 and CART.

That’s Bayesian decision theory in a nutshell: choosing the decision with the highest probability.

Using probabilities can sometimes be more effective than using hard rules for classification. Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities from known values.

Stochastic gradient ascent is an example of an online learning algorithm. This is known as online because we can incrementally update the classifier as new data comes in rather than all at once. The all-at-once method is known as batch processing.

 

Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid. Methods of optimization can be used to find the best-fit parameters. Among the optimization algorithms, one of the most common algorithms is gradient ascent. Gradient ascent can be simplified with stochastic gradient ascent.

 

If missing some alues, here are some options:

  •  Use the feature’s mean value from all the available data.
  •  Fill in the unknown with a special value like -1.
  •  Ignore the instance.
  •  Use a mean value from similar items.
  •  Use another machine learning algorithm to predict the value

 

The points closest to the separating hyperplane are known as support vectors. Now that we know that we’re trying to maximize the distance from the separating line to the support vectors, we need to find a way to optimize this problem.

One thing to note is that SVMs are binary classifiers. You’ll need to write a little more code to use an SVM on a problem with more than two classes.

The radial bias function is a kernel that’s often used with support vector machines.

Support vector machines are a type of classifier. They’re called machines because they generate a binary decision; they’re decision machines(unsupervised) .

Support vector machines try to maximize margin by solving a quadratic optimization problem. In the past, complex, slow quadratic solvers were used to train support vector machines.

 

Meta-algorithms are a way of combining other algorithms. We’ll focus on one of the most popular meta-algorithms called AdaBoost. This is a powerful tool to have in your toolbox because AdaBoost is considered by some to be the best-supervised learning algorithm.

AdaBoost

Pros: Low generalization error, easy to code, works with most classifiers, no parameters to adjust

Cons: Sensitive to outliers

Works with: Numeric values, nominal values

Bootstrap aggregating, which is known as bagging, is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement. By “with replacement” I mean that you can select the same example more than once. This property allows you to have values in the new dataset that are repeated, and some values from the original won’t be present in the new set.

 

The difference between regression and classification is that in regression our target variable is numeric and continuous.
Linear regression
Pros: Easy to interpret results, computationally inexpensive
Cons: Poorly models nonlinear data
Works with: Numeric values, nominal values
Regression is the process of predicting a target value similar to classification.
Shrinkage methods can also be viewed as adding bias to a model and reducing the
variance. The bias/variance tradeoff is a powerful concept in understanding how
altering a model impacts the success of a model.
CART is an acronym for Classification And Regression Trees. It can be applied to regression or classification.
Tree-based regression
Pros: Fits complex, nonlinear data
Cons: Difficult to interpret results
Works with: Numeric values, nominal values
Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It’s like automatic classification.
It’s called k-means because it finds k unique clusters, and the center of each cluster is the mean of the values in that cluster.
Clustering is sometimes called unsupervised classificationbecause it produces the same result as classification but without having predefined classes.
k-means clustering
Pros: Easy to implement
Cons: Can converge at local minima; slow on very large datasets
Works with: Numeric value
Looking for hidden relationships in large datasets is known as association analysis or association rule learning. The problem is, finding different combinations of items can be a time-consuming task and prohibitively expensive in terms of computing power.
Apriori
Pros: Easy to code up
Cons: May be slow on large datasets
Works with: Numeric values, nominal values
Association analysis is a set of tools used to find interesting relationships in a large set
of data. There are two ways you can quantify the interesting relationships. The first
way is a frequent itemset, which shows items that commonly appear in the data
together. The second way of measuring interesting relationships is association rules.
Association rules imply an if..then relationship between items.
The FP-growth algorithm is faster than Apriori because it requires only two scans of
the database, whereas Apriori will scan the data set to find if a given pattern is frequent
or not—Apriori scans the dataset for every potential frequent item. On small datasets,
this isn’t a problem, but when you’re dealing with larger datasets, this will be a problem. The FP-growth algorithm scans the dataset only twice. The basic approach to finding frequent itemsets using the FP-growth algorithm is as follows:
  1.  Build the FP-tree.
  2.  Mine frequentt itemsets from the FP-tree.
FP-growth
Pros: Usually faster than Apriori.
Cons: Difficult to implement; certain datasets degrade the performance.
Works with: Nominal values.
The FP-growth algorithm stores data in a compact data structure called an FP-tree. The FP stands for “frequent pattern.” An FP-tree looks like other trees in computer science, but it has links connecting similar items. The linked items can be thought of as a linked list.
Dimensionality reduction is the task of reducing the number of inputs you have; this can reduce noise and improve the performance of machine learning algorithms.
A short list of other reasons we want to simplify our data includes the following:
  •  Making the dataset easier to use
  •  Reducing computational cost of many algorithms
  •  Removing noise
  •  Making the results easier to understand
The first method for dimensionality reduction is called principal component analysis (PCA). In PCA, the dataset is transformed from its original coordinate system to a new coordinate system. The new coordinate system is chosen by the data itself. The first new axis is chosen in the direction of the most variance in the data. The second axis is orthogonal to the first axis and in the direction of an orthogonal axis with the largest variance. This procedure is repeated for as many features as we had in the original data. We’ll find that the majority of the variance is contained in the first few axes. Therefore, we can ignore the rest of the axes, and we reduce the dimensionality of our data.
Factor analysis is another method for dimensionality reduction. In factor analysis, we assume that some unobservable latent variables are generating the data we observe. The data we observe is assumed to be a linear combination of the latent variables and some noise. The number of latent variables is possibly lower than the amount of observed data, which gives us the dimensionality reduction. Factor analysis is used in social sciences, finance, and other areas.
Another common method for dimensionality reduction is independent component analysis (ICA). ICA assumes that the data is generated by N sources, which is similar to
factor analysis. The data is assumed to be a mixture of observations of the sources.
The sources are assumed to be statically independent, unlike PCA, which assumes the
data is uncorrelated. As with factor analysis, if there are fewer sources than the amount of our observed data, we’ll get a dimensionality reduction.
Principal component analysis
Pros: Reduces complexity of data, indentifies most important features
Cons: May not be needed, could throw away useful information
Works with: Numerical values
The method for distilling this information is known as the singular value decomposition (
SVD). It’s a powerful tool used to distill information in a number of applications, from bioinformatics to finance.
The singular value decomposition (SVD)
Pros: Simplifies data, removes noise, may improve algorithm results.
Cons: Transformed data may be difficult to understand.
Works with: Numeric values.
MapReduce
Pros: Processes a massive job in a short period of time.
Cons: Algorithms must be rewritten; requi
res understanding of systems engineering.
Works with: Numeric values, nominal values.

To be continue…

 

Advertisements

About liyao13

Yao Li is a web and iOS developer, blogger and he has a passion for technology and business. In his blogs, he shares code snippets, tutorials, resources and notes to help people develop their skills. Donate $5 to him for a coffee with PayPal at About Me page and read more professional and interesting technical blog articles. Follow him @Yaoli0615 at Twitter to get latest tech updates.
This entry was posted in CS Research&Application, Uncategorized and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s