Algorithms: The Basic Methods

--Originally published at Enro Blog

In the following 2 post I’m gonna go around the 9 most common data mining algorithms.

Inferring rudimentary rules

Here’s an easy way to find very simple classification rules from a set of instances. Called 1R for 1-rule, it generates a one-level decision tree expressed in the form of a set of rules that all test one particular attribute. The idea is this: we make rules that test a single attribute and branch accordingly. Each branch corresponds to a different value of the attribute. It is obvious what is the best classification to give each branch: use the class that occurs most often in the training data. Then the error rate of the rules can easily be determined. Just count the errors that occur on the training data, that is, the number of instances that do not have the majority class.


Missing values and numeric attributes

The good! Although a very rudimentary learning method, 1R does accommodate both missing values and numeric attributes. It deals with these in simple but effective ways. Missing is treated as just another attribute value so that, for example, if the weather data had contained missing values for the outlook attribute, a rule set formed on outlook would specify four possible class values, one each for sunny, overcast, and rainy and a fourth for missing.

The bad!

this procedure tends to form a large number of categories. The 1R method will naturally gravitate toward choosing an attribute that splits into many categories, because this will partition the dataset into many classes, making it more likely that instances will have the same class as the majority in their partition. This phenomenon is known as overfitting. For 1R, overfitting is likely to occur whenever an attribute has a large number of possible values

Continue reading "Algorithms: The Basic Methods"

Output: Knowledge Representation

--Originally published at Enro Blog

Decision trees

A “divide-and-conquer” approach to the problem of learning from a set of independent instances leads naturally to a style of representation called a decision tree.

If the attribute that is tested at a node is a nominal one, the number of children is usually the number of possible values of the attribute. In this case, because there is one branch for each possible value, the same attribute will not be retested further down the tree.

If the attribute is numeric, the test at a node usually determines whether its value is greater or less than a predetermined constant, giving a two-way split.

The bad!

Missing values pose an obvious problem. It is not clear which branch should be taken when a node tests an attribute whose value is missing. Sometimes, as described in Section 2.4, missing value is treated as an attribute value in its own right. If this is not the case, missing values should be treated in a special way rather than being considered as just another possible value that the attribute might take. A simple solution is to record the number of elements in the training set that go down each branch and to use the most popular branch if the value for a test instance is missing.


Classification rules

Generally, the preconditions are logically ANDed together, and all the tests must succeed if the rule is to fire. However, in some rule formulations the preconditions are general logical expressions rather than simple conjunctions. More difficult: transforming a rule set into tree cannot easily express disjunction between rules Example: rules which test different attributes Symmetry needs to be broken the  replicated subtree problem

One reason why rules are popular is that each rule seems to represent an independent “nugget” of knowledge. New rules can be added to an existing rule set without disturbing ones already there, whereas to add to a tree structure

Continue reading "Output: Knowledge Representation"

Input: Concepts, Instances, and Attributes

--Originally published at Enro Blog

The input takes the form of concepts, instances, and attributes. We call the thing that is to be learned a concept description.

  • The information that the learner is given takes the form of a set of instances

Four basically different styles of learning appear in data mining applications.

  • classification learning, the learning scheme is presented with a set of classified examples from which it is expected to learn a way of classifying unseen examples.
  • In association learning, any association among features is sought, not just ones that predict a particular class value. In clustering, groups of examples that belong together are sought
  • In numeric prediction, the outcome to be predicted is not a discrete class but a numeric quantity.

Regardless of the type of learning involved, we call the thing to be learned the concept and the output produced by a learning scheme the concept description.

Classification learning is sometimes called supervised because, in a sense, the method operates under supervision by being provided with the actual outcome for each of the training examples

This outcome is called the class of the example.

When there is no specified class, clustering is used to group items that seem to fall naturally together.

The success of clustering is often measured subjectively in terms of how useful the result appears to be to a human user. It may be followed by a second step of classification learning in which rules are learned that give an

intelligible description of how new instances should be placed into the clusters.

What’s in an example?

These instances are the things that are to be classified, associated, or clustered. Although until now we have called them examples, henceforth we will use the more specific term instances to refer to the input. Each instance is an individual, independent example of Continue reading "Input: Concepts, Instances, and Attributes"

Introduction to data mining

--Originally published at Enro Blog

What is data mining.

In data mining, the data is stored electronically and the search is automated— or at least augmented—by computer.

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities.

How are the patterns expressed? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern:

  • as a black box whose innards are effectively incomprehensible and as a
  • transparent box whose construction reveals the structure of the pattern.

Such patterns we call structural because they capture the decision structure in an explicit way

Machine learning

Things learn when they change their behavior in a way that makes them perform better in the future.

domain knowledge

Market basket analysis is the use of association techniques to find groups of items that tend to occur together in transactions, typically supermarket checkout data.

What’s the difference between machine learning and statistics? Cynics, looking wryly at the explosion of commercial interest (and hype) in this area, equate data mining to statistics plus marketing. In truth, you should not look for a dividing line between machine learning and statistics because there is a continuum—and a multidimensional one at that—of data analysis techniques. Some derive from the skills taught in standard statistics courses, and others are more closely associated with the kind of machine learning that has arisen out of computer science. Historically, the two sides have had rather different traditions. If forced to point to a single difference of emphasis, it might be that statistics has been more concerned with testing hypotheses, whereas machine learning has been Continue reading "Introduction to data mining"

A new machine learning challenge for the upcoming semester.

--Originally published at Enro Blog

As the semester starts a new challenge has appeared, I was assigned with the task of creating a software capable of filtering information in the abstracts of research papers for the purpose of classifying them and creating a network of people that are working more or less in topics in the same area, as it seems that the universities lack founding for every single researcher since research investigation have grown lots in the last couple of decades, a software capable of aggregating professionals with the same interests could potentially reduce research costs. In the upcoming posts I’ll summaries my research toward my machine learning studies, findings and understandings.

Det var för ett år sedan

--Originally published at Swedish House Troko

De säger det bästa sättet att inte glömma någonting övar mycket. För att vara ärlig efter ett tag blir allt du vet bara damm i ditt huvud.

Jag cyklar till skolan varje dag och jag vaknar tidigt för första klassen. Det här är något jag minns för den svenska klassen (av någon anledning), allt annat är det någonting som google kan göra. Jag antar att jag inte brukar veta när jag översätter någonting i appen. Jag talar om allt utom det jag vill prata om.

Till sist vill jag bara berätta för världen (som är några personer eftersom ingen här talar svenska) att jag älskar mitt land men det finns så många problem jag bara vill gå undan.


Significa Peligro

--Originally published at Internet: LA comodidad por delante

Hemos llegado al final de esta serie de posts y quisiera recapitular el punto más importante que he tocado en cada uno de ellos. La seguridad depende más de uno mismo que de los demás. Esto se debe al segundo punto que he dicho: A la gente le preocupa más su comodidad que su seguridad.

Es por ello que es súper importante saber la importancia y la manera de protegerse. Estos normalmente serían mutuamente excluyentes ¿De qué sirve saber una sin la otra? Por si no recuerdan, ahí les van algunas de las cosas que me hubiera gustado se queden en sus mentes:

  • Que no esté encriptado significa peligro

Resultado de imagen para significa peligro

  • Protégete usando VPNs
  • Investiga si el proveedor de un seguro protege correctamente tu contraseña
  • No repitas contraseñas
  • No confíes en nadie, siempre alerta

Si alguna de estas cosas se les quedaron tatuadas en la mente, diría que cumplí mi cometido. Recuerden que aunque la comodidad es muy tentadora, es engañosa y no siempre es tu amiga. Por el contrario la seguridad te mantendrá un poco más tranquilo y confortable.

Más vale prevenir que lamentar.

97 Things Every Software Architect Should Know

--Originally published at Rudy's Corner

This will be my list of the 30 out of the 97 that I felt most identified, if you want to read all 97 here is the link of the book.

So, let’s begin

1) Don’t put your resume ahead of the requirements

This first one is kinda obvious (most of them are tbh) but not really. The important thing to underline here is that it is more important to use the technology that fits better the user, even if it is not the most challenging, it is better to have something that works perfectly done with an “easy” technology that doesn’t look that good on your resume, than something complicated that looks “pretty” on your resume.

At the end of the day, it is way better for your career to have happy costumers, than an “impressive” CV. I mean, if you have happy costumers, they might tell their friends about you, and before you know it, you have more work and a better reputation.

2) Chances are, your biggest problem isn’t technical

It is very easy to blame the technical aspect when a project fails, buuuuuut in most cases it didn’t fail because of it, it failed because of the people that were involved in such project. Because, well, people are what make the project and if you can’t communicate with the ones that are not performing as well as the others, then your project will probably fail. Conversations are key, and that’s what I’ll talk about next.

3) Communication is king; clarity and leadership its humble servants

To have a good communication between the Software Architect and the Developers, the first thing to do is be there, with them, don’t sit on top of the tower, as if you were more than them. Next, you will need to use clarity

Continue reading "97 Things Every Software Architect Should Know"

The end

--Originally published at Rudy's Corner

The semester is done, my 8th semester studying ISC came to an end and I don’t know how to feel about it, but that is not what this post will be about. I will talk about what I learned throughout the Security course. This won’t be a huge blog post, I’ll just point out a few things that I am taking from this course.

  • Update, the importance of keeping all my devices updated so that I have the latest security.
  • Backup, keep a lot of different backups in case something fails with the computer you won’t lose the data.
  • Layers, it is of great importance to have as many layers of security as possible, so it becomes harder and harder to break into your system.
  • Trust no one (or almost no one), check who your giving permission to see your social media, who sees what and why they see it.
  • Stay up to date, keep checking the new things that come up, the new technologies, the new threats.

And those are the things that I learned more from this course, thanks for reading it has been a great ride!


YouTube’s way of flagging videos

--Originally published at Rudy's Corner

In the last months, right after the series of Apocalypse that YouTube faced, YouTube decided to create a new algorithm to flag videos that weren’t “advertiser friendly”. This new system worked, well kind of, depending of who you ask. According to the Google CEO this new system removed over 8 million videos from YouTube, of that number 6.8 million were first flagged by the computers, which is about 76% of the total.

Now the question is, does it really work? Does it actually just flag bad content, or does it also take down content that follow YouTube’s policies? And that is the core of the problem, because the system is not perfect (nothing really is) but YouTube is leaning a lot in a system to catch the “bad” videos. I

am sure that a lot of the videos that gets flagged are actually non advertiser friendly. But a fare share is okay.

Here is where we get to machine learning and if it will be good or not. No matter how much we try to, we won’t be able to program emotions or common sense to a computer (or at least any time soon). So, how much should we depend on a computer to do work where there is a lot of common sense involved? Yes the systems do the work for us, but at what cost?

In my opinion, we should keep a check on it, monitoring their behavior, us the humans making sure that the computer actually does what is supposed to do without damaging others, just as YouTube is trying to do. I am a little scared of what a system like that could do unchecked, but I hope it never happens.