Clustering

--Originally published at Enro Blog

 
Fair Isle flickr photo by neil1877 shared under a Creative Commons (BY-NC-ND) license

Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances.

The groups that are identified may be exclusive so that any instance belongs in only one group. Or they may be overlapping so that an instance may fall into several groups. Or they may be probabilistic, whereby an instance belongs to each group with a certain probability.

Iterative distance-based clustering

The classic clustering technique is called k-means. First, you specify in advance how many clusters are being sought: this is the parameter k. Then k points are chosen at random as cluster centers. All instances are assigned to their closest cluster center according to the ordinary Euclidean distance metric. Next the centroid, or mean, of the instances in each cluster is calculated—this is the “means” part. These centroids are taken to be new center values for their respective clusters. Finally, the whole process is repeated with the new cluster centers. Iteration continues until the same points are assigned to each cluster in consecutive rounds, at which stage the cluster centers have stabilized and will remain the same forever.

Faster distance calculations

For example, you can project the dataset and make cuts along selected axes, instead of using the arbitrary hyperplane divisions that are implied by choosing the nearest cluster center. But this inevitably compromises the quality of the resulting clusters.

Here’s a better way of speeding things up. Finding the closest cluster center is not so Continue reading "Clustering"

Instance-based learning

--Originally published at Enro Blog

 
house flickr photo by barnyz shared under a Creative Commons (BY-NC-ND) license

In instance-based learning the training examples are stored verbatim, and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located, its class is predicted for the test instance.

Distance function

Although there are other possible choices, most instance-based learners use Euclidean distance.

5

When comparing distances it is not necessary to perform the square root operation; the sums of squares can be compared directly.

Different attributes are measured on different scales, so if the Euclidean distance formula were used directly, the effects of some attributes might be completely dwarfed by others that had larger scales of measurement. Consequently, it is usual to normalize all attribute values to lie between 0 and 1, by calculating

6

Nearest-neighbor instance-based learning is simple and often works very well. In the method described previously each attribute has exactly the same influence on the decision, just as it does in the Naïve Bayes method. Another problem is that the database can easily become corrupted by noisy exemplars.

 

Bibliography

Ian H. Witten, Eibe Frank. (1999). Data mining practical machine learning tools and techniques. Elsevier

Linear models

--Originally published at Enro Blog


bricks. flickr photo by eaortmann shared under a Creative Commons (BY-NC-ND) license

When the outcome, or class, is numeric, and all the attributes are numeric, linear regression is a natural technique to consider. This is a staple method in statistics. The idea is to express the class as a linear combination of the  attributes, with predetermined weights:

1

Linear regression is an excellent, simple method for numeric prediction, and it has been widely used in statistical applications for decades. Of course, linear models suffer from the disadvantage of, well, linearity. If the data exhibits a nonlinear dependency, the best-fitting straight line will be found, where “best” is interpreted as the least mean-squared difference.

Linear classification: Logistic regression

we can use any regression technique, whether linear or nonlinear, for classification. The trick is to perform a regression for each class, setting the output equal to one for training instances that belong to the class and zero for those that do not. The result is a linear expression for the class. Then, given a test example of unknown class, calculate the value of each linear expression and choose the one that is largest. This method is sometimes
called
multiresponse linear regression.

One way of looking at multiresponse linear regression is to imagine that it approximates a numeric membership function for each class. The membership function is 1 for instances that belong to that class and 0 for other instances. Given a new instance we calculate its membership for each class and select the biggest.

Drawbacks

  • First, the membership values it produces are not proper probabilities because they can fall outside the range 0 to 1.
  • Second, leastsquares regression assumes that the errors are not only statistically independent, but are also normally distributed with the same standard deviation, an assumption that is blatantly violated
    2
    3
    4
    Continue reading "Linear models"

Mining association rules

--Originally published at Enro Blog


I flickr photo by joannapoe shared under a Creative Commons (BY-SA) license

To find such rules, you would have to execute the rule-induction procedure once for every possible combination of attributes, with every possible combination of values, on the right-hand side. That would result in an enormous number of association rules, which would then have to be pruned down on the basis of their coverage (the number of instances that they predict correctly) and their accuracy (the same number expressed as a proportion of the number of instances to which the rule applies). This approach is quite infeasible. Instead, we capitalize on the fact that we are only interested in association rules with high coverage. We ignore, for the moment, the distinction between the left- and right-hand sides of a rule and seek combinations of attribute–value pairs that have a prespecified minimum coverage. These are called item sets: an attribute–value pair is an item.

Association rules

Once all item sets with the required coverage have been generated, the next step is to turn each into a rule, or set of rules, with at least the specified minimum accuracy. Some item sets will produce more than one rule; others will produce none.

Generating rules efficiently

The first stage proceeds by generating all one-item sets with the given
minimum coverage and then using this to generate the two-item sets, three-item sets (third column), and so on.
Each operation involves a pass through the dataset to count the items in each set, and after the pass the surviving item sets are stored in a hash table.  From the one-item sets, candidate two-item sets are generated, and then a pass is made through the dataset, counting the coverage of each two-item set; at the end the candidate sets with less than minimum coverage Continue reading "Mining association rules"

Covering algorithms: Constructing rules

--Originally published at Enro Blog


In “God’s Own Country” flickr photo by www.davidbaxendale.com shared under a Creative Commons (BY-ND) license

An alternative approach to a decision tree is to take each class in turn and seek a way of covering all instances in it, at the same time excluding instances not in the class. This is called a covering approach because at each stage you identify a rule that “covers” some of the instances. By its very nature, this covering approach leads to a set of rules rather than to a decision tree.

A difference between the covering algorithms in comparison with recursive divide and conquer decision trees is that, in the multi class case, a decision tree split takes all classes into account, trying to maximize the purity of the split, whereas the rule-generating method concentrates on one class at a time, disregarding what happens to the other classes.

A simple covering algorithm

Covering algorithms operate by adding tests to the rule that is under construction, always striving to create a rule with maximum accuracy. In contrast, divide and-conquer algorithms operate by adding tests to the tree that is under construction, always striving to maximize the separation among the classes. Each of these involves finding an attribute to split on. But the criterion for the best attribute is different in each case. Whereas divide-and-conquer algorithms such as ID3 choose an attribute to maximize the information gain, the covering algorithm we will describe chooses an attribute–value pair to maximize the probability of the desired classification.

covering

The PRISM method for constructing rules. It generates only correct or “perfect” rules. It measures the success of a rule by the accuracy formula p/t. Any rule with accuracy less than 100% is “incorrect” in that it assigns cases to the class in question Continue reading "Covering algorithms: Constructing rules"

Divide-and-conquer: Constructing decision trees

--Originally published at Enro Blog

 
A Pollinator in Pink……..HFDF! flickr photo by The Manic Macrographer shared under a Creative Commons (BY) license

The problem of constructing a decision tree can be expressed recursively. First, select an attribute to place at the root node and make one branch for each possible value. This splits up the example set into subsets, one for every value of the attribute. Now the process can be repeated recursively for each branch, using only those instances that actually reach the branch. If at any time all instances at a node have the same classification, stop developing that part of the tree.

The only thing left to decide is how to determine which attribute to split on, given a set of examples with different classes. If we had a measure of the purity of each node, we could choose the attribute that produces the purest daughter nodes. The measure of purity that we will use is called the information and is measured in units called bits. Associated with a node of the tree, it represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node.

div

div0

For outlook We can calculate the average information value of these, taking into account the number of instances that go down each branch—five down the first and third and four down the second:

div1

This average represents the amount of information that we expect would be necessary to specify the class of a new instance, given the tree structure in Figure 4.2(a) Before we created any of the nascent tree structures in Figure 4.2, the training examples at the root comprised  line yes and five no nodes, corresponding to an information value of

div2

Thus

div4
div5
div6
div7
div8
div9
Continue reading "Divide-and-conquer: Constructing decision trees"

Algorithms: The Basic Methods

--Originally published at Enro Blog

In the following 2 post I’m gonna go around the 9 most common data mining algorithms.

Inferring rudimentary rules

Here’s an easy way to find very simple classification rules from a set of instances. Called 1R for 1-rule, it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute. The idea is this: we make rules that test a single attribute and branch accordingly. Each branch corresponds to a different value of the attribute. It is obvious what is the best classification to give each branch: use the class that occurs most often in the training data. Then the error rate of the rules can easily be determined. Just count the errors that occur on the training data, that is, the number of instances that do not have the majority class.

R1

Missing values and numeric attributes

The good! Although a very rudimentary learning method, 1R does accommodate both missing values and numeric attributes. It deals with these in simple but effective ways. Missing is treated as just another attribute value so that, for example, if the weather data had contained missing values for the outlook attribute, a rule set formed on outlook would specify four possible class values, one each for sunny, overcast, and rainy and a fourth for missing.

The bad!

this procedure tends to form a large number of categories. The 1R method will naturally gravitate toward choosing an attribute that splits into many categories, because this will partition the dataset into many classes, making it more likely that instances will have the same class as the majority in their partition. This phenomenon is known as overfitting. For 1R, overfitting is likely to occur whenever an attribute has a large number of possible values

statistic
probability
place
place1
place3
Continue reading "Algorithms: The Basic Methods"

Output: Knowledge Representation

--Originally published at Enro Blog

Decision trees

A “divide-and-conquer” approach to the problem of learning from a set of independent instances leads naturally to a style of representation called a decision tree.

If the attribute that is tested at a node is a nominal one, the number of children is usually the number of possible values of the attribute. In this case, because there is one branch for each possible value, the same attribute will not be retested further down the tree.

If the attribute is numeric, the test at a node usually determines whether its value is greater or less than a predetermined constant, giving a two-way split.

The bad!

Missing values pose an obvious problem. It is not clear which branch should be taken when a node tests an attribute whose value is missing. Sometimes, as described in Section 2.4, missing value is treated as an attribute value in its own right. If this is not the case, missing values should be treated in a special way rather than being considered as just another possible value that the attribute might take. A simple solution is to record the number of elements in the training set that go down each branch and to use the most popular branch if the value for a test instance is missing.

 

Classification rules

Generally, the preconditions are logically ANDed together, and all the tests must succeed if the rule is to fire. However, in some rule formulations the preconditions are general logical expressions rather than simple conjunctions. More difficult: transforming a rule set into tree cannot easily express disjunction between rules Example: rules which test different attributes Symmetry needs to be broken the  replicated subtree problem

One reason why rules are popular is that each rule seems to represent an independent “nugget” of knowledge. New rules can be added to an existing rule set without disturbing ones already there, whereas to add to a tree structure

cluster
Continue reading "Output: Knowledge Representation"

Input: Concepts, Instances, and Attributes

--Originally published at Enro Blog

The input takes the form of concepts, instances, and attributes. We call the thing that is to be learned a concept description.

  • The information that the learner is given takes the form of a set of instances

Four basically different styles of learning appear in data mining applications.

  • classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples.
  • In association learning, any association among features is sought, not just ones that predict a particular class value. In clustering, groups of examples that belong together are sought
  • In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity.

Regardless of the type of learning involved, we call the thing to be learned the concept and the output produced by a learning scheme the concept description.

Classification learning is sometimes called supervised because, in a sense, the method operates under supervision by being provided with the actual outcome for each of the training examples

This outcome is called the class of the example.

When there is no specified class, clustering is used to group items that seem to fall naturally together.

The success of clustering is often measured subjectively in terms of how useful the result appears to be to a human user. It may be followed by a second step of classification learning in which rules are learned that give an

intelligible description of how new instances should be placed into the clusters.

What’s in an example?

These instances are the things that are to be classified, associated, or clustered. Although until now we have called them examples, henceforth we will use the more specific term instances to refer to the input. Each instance is an individual, independent example of Continue reading "Input: Concepts, Instances, and Attributes"

Introduction to data mining

--Originally published at Enro Blog

What is data mining.

In data mining, the data is stored electronically and the search is automated— or at least augmented—by computer.

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities.

How are the patterns expressed? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern:

  • as a black box whose innards are effectively incomprehensible and as a
  • transparent box whose construction reveals the structure of the pattern.

Such patterns we call structural because they capture the decision structure in an explicit way

Machine learning

Things learn when they change their behavior in a way that makes them perform better in the future.

domain knowledge

Market basket analysis is the use of association techniques to find groups of items that tend to occur together in transactions, typically supermarket checkout data.

What’s the difference between machine learning and statistics? Cynics, looking wryly at the explosion of commercial interest (and hype) in this area, equate data mining to statistics plus marketing. In truth, you should not look for a dividing line between machine learning and statistics because there is a continuum—and a multidimensional one at that—of data analysis techniques. Some derive from the skills taught in standard statistics courses, and others are more closely associated with the kind of machine learning that has arisen out of computer science. Historically, the two sides have had rather different traditions. If forced to point to a single difference of emphasis, it might be that statistics has been more concerned with testing hypotheses, whereas machine learning has been Continue reading "Introduction to data mining"