Disclaimer: Be aware that this topic is more theoretical than most of our other blog posts. After all this is NOT an easy subject to comprehend. But hang in there, because this will be an important building block before we can make our own AI agent.

## Machine Learning

In the last AI blog post we looked at Bayes Networks, where we set up graph-like structures that we derived from known probability values. In this post we will look at Machine Learning, which learns the probability values and graphs from the data. Learning models from data is a major area of AI. Many of the models in Machine Learning are fitted based on data from **learning**. [Intro to Artificial Intelligence – https://classroom.udacity.com/courses/cs271]. We develop something called an **agent**, which is an entity (such as a software). The agent perceives the world surrounding it through input data (such as sensors) and/or previous data. It then thinks intelligently and acts upon that world with the aim of achieving goals that are set by us, its creator [https://en.wikipedia.org/wiki/Intelligent_agent].

## What Do We Mean by Learning?

First things first: Why would we want an agent to learn? Why can’t we just tell it what to do? Well, when we develop an agent to perform a certain task, such as predicting the weather tomorrow, it’s not certain that we can anticipate every single situation that our agent might find itself in. The design might not take into account all the possible changes over time, and maybe the most important factor, we, the human programmer, might not have a clue of how to program an agent to solve the problem for us, ourselves.

Any components of an agent can be improved by **learning** from **data**. The improvements, and the techniques used to make them, depend on four major factors:

- Which component is to be improved
- What prior knowledge does the agent already have
- What representation is used for the data and the component
- What feedback is available to learn from

#### Components That Can Be Learned:

- Learning relevant properties about the world, from input
- Learning information about which actions are more “desirable” than others
- information about how the agent’s surroundings evolve, and the results of possible actions the agent can take
- and many more

#### Representation and Prior Knowledge:

Representation and prior knowledge is often seen in terms of input and output pairs. The input values are often a vector of attributed values, and the output value can be either a continuous numerical value or a discrete value.

#### Feedback

And at last, feedback. Every Machine Learning method is driven from some type of feedback information that you care about. There are three types of feedback that determine the three main types of learning:

- Supervised learning – where the agent observes an input-output pair and learns functions that maps from input to output.
- Unsupervised learning – where the agent learns patterns in the input, even though no explicit output is supplied.
- Reinforcement learning – the agent learns from feedback from a physical environment, through rewards and punishments.

In this blog post we will focus on supervised learning and come back to unsupervised and reinforcement learning later.

## Supervised Learning

Before we delve into the topic of supervised learning, what is it that we want to learn? In the last blog post we looked into Bayes Networks, and saw there was parameter values for the probabilities of certain events occurring that we defined as “known”. These parameters can be learned from Machine Learning, and the structure of the network, such as the arrows pointing from parent to child (see last post for further explanation). By learning the values of the parameters and the structure of a network, you could even discover hidden concepts, such as a correlation between nodes that you did not see before.

So, the main task of supervised learning is this:

In each task you try to solve, you have data, or what we call a training set, which we see as a **feature vector of different parameters**, x1, x2, x3,..xN, and **a target label, y**.

Let’s clarify this with an example:

Say you are trying to predict if a sports team will win a game. x1 could be the outcome of the last five times the two teams met, x2 if there are any injuries in your team, x3 the league position of your team relative to the opponent, and so on. Supervised learning looks at all the previous data (x1, x2,..) and see how they correlate with the outcome y.

x1 x2 x3 … xN -> y

In other words, y is the predictor, which in this example is if your team will win or not.

Machine Learning is typically used on already collected input data and compared with the result of the target label, meaning, if your team won or not. So, by learning from this passed data, we try to obtain a function that could help us predict future games. In supervised learning, you’re given many such examples.

Where the total of all the parameters is the data, and each input vector is xM. Our goal is to find a function so that with any xM vector, we can predict, as close as possible, the target label yM.

f(x_{M}) = y_{M}

In conclusion, the whole point of Machine Learning is to identify the function f(), so that you can predict future cases.

When the hypothesis of our function coincides well with our input data and predicted targets, we say that the hypothesis generalizes well.

When the output y has a finite set of values, such as win, draw, loose, the learning problem Is called **classification, **and the output is binary, and if the question is what is the y temperature tomorrow, the problem is called **regression**, which means it’s continuous.

#### Example

Take a look at the graphs below.

Which function f() is correct? Clearly the first example is wrong, but both the middle and the right image can be correct. This is a typical of fitting a function of single variables to some data points. The difference can be from a quite simple 2^{nd} degree polynomial in the middle to a very high degree polynomial on the right. This leads to a fundamental problem, how do we choose the degree of complexity? According to the English philosopher William of Ockham: *an answer to prefer is the simplest hypothesis consistent with data*. This famous principle is called Ockham’s Razor. So what does this mean? Well, it means that there is a tradeoff between complex hypotheses that fit the data, and simpler hypothesis that may generalize better. We’re not going to go close into this, but for the reader truly interested in Machine Learning, we recommend to read http://www.cs.utah.edu/~zhe/pdf/lec-14-pac-occam.pdf.

## Spam filter

The example so far has been a bit theoretical, so let’s look at one we can all relate to – a spam filter.

Look at these emails:

We’ve all received loads of these emails before, so it’s pretty obvious for us today to distinguish between SPAM from HAM. And why is that? Because we have received so much spam over the years. In other words, we have so much data on the subject, and that’s the key. Our spam filters today are based on previous data, and use Machine Learning to **discriminate** between spam and ham – **using Bayes Networks**. This **discrimination** is our function that the machine learns from all previous data we have on spam email. It can be key words such as FREE, all caps letters, contain adds, come from the same email address as thousands of others have received from. Meaning, we have our input vectors, and the target label (spam/ham) can be decided by us, where we help and mark email as spam, or mark the spam as ham if the filter got it wrong.

#### Example – Counting Method

Let’s look at some examples of SPAM and HAM messages:

Our database of spam and ham emails has a dictionary size of 16 (you can count yourself), meaning that there are 16 different words in our database of 7 messages. So, if we draw a message randomly, we’ll have the probability of P(S) = 3/7.

#### Example – Mathematical Method

Say you build up your database full of emails that are spam and ham, and you end up with 3 spam emails and 4 ham, 7 in total:

SSSHHHH

And we want to find out what is the probability of getting a spam at random, P(S), that fits best with the probability distribution of the data. This is called the maximum likelihood. [https://metacademy.org/graphs/concepts/maximum_likelihood].

We’ll say that the probability of getting a spam is: P(S) = x

The probability of your i’th data item is:

So lets find the maximum likelihood where we define spam as 1 and ham as 0

1110000

And plug it into our probability:

P(y_{i}) = x^{yi}*(1-x)^{1-yi}

(If you’re unsure of this step, try plugging in 1 and 0 for yi, and you’ll see that it is equivalent to the probability of I’th data above)

Lets see what we get for the entire data set of all 7 emails. We see that the joint probability of all data items is the product.

P(data) = ∏p(y_{i}) = x^{3}*(1-x)^{4}

We now want to find the x that maximizes the expression, where we start by taking the logarithm:

Log P(data) = 3log(x) + 4log(1-x)

We take the derivative and set it to zero, to find its maximum:

Log P(data) = 3log(x) + 4log(1-x)

0 =3/x – 4/(1-x)

Which gives us

X = 3/7

#### Example – Maximum Likelihood for Words

We’ve now seen what is the probability for getting a spam message on random, but what is the probability for getting a spam email for a specific keyword? Let’s look at the word “loan”, which appears in both our spam and ham folder, and see what the probability is for it being spam and ham:

P(“Loan”|S): 3/10 – 3 words with loan out of 10 words total

P(“Loan”|H): 1/12 – 1 word with loan out of 12 words total

## Maximum Likelihood in Bayes Network

So, what we have done now is really that we’ve been building up a Bayes Network. The parameters of the Bayes Network are estimated using supervised learning by a maximum likelihood estimate, based on the training data.

The root of the Bayes Network is spam, which is an unobservable variable, and can only have binary result – spam or no spam. The root has as a child node for every word in a message, where the children have an identical conditional distribution of the words occurrence, given spam or not spam. In this Bayes Network we assume that all the words are independent of each other, and just dependent on the root: spam or ham. This type of assumption is called **Naïve Bayes**. You can learn more about Naïve Bayes at https://web.stanford.edu/class/cs124/lec/naivebayes.pdf.

Our dictionary had 16 different words, and if we look at the word “loan”, we have “loan” given that it is a spam, would be:

P(“Loan”|S) = 3/10 and

P(“Loan”|H) = 1/12

So, this means that for any dictionary size, we need

- 1 parameter for P(Spam)/P(Spam’)
- 1 parameter per P(wi|S)
- 1 parameter per P(wi|H)

If you have a dictionary of 16 words you would need 1 parameter for P(S) or P(H) and 15+15 parameters for P(wi|S) and P(wi|H). You only need 15, since the 16^{th} parameter can be figured out through 1-∑P (the 15 parameters) = the 16^{th} parameter. In total 31.

We wrote earlier in the example above that the probability for the word “loan” if it was spam, but a more realistic example would be the probability of the word “loan” being spam:

P(S|”loan”)

From last blog post we learned that Bayes Rule was defined as:

So, through Bayes rule we get:

This was just one word, so let’s try to up the complexity, and look at an entire phrase:

“Remember get loan”

We try again using Bayes Rule:

0But this results in zero, since «remember” isn’t a word that occurs in the spam database… We obviously can’t have a filter that becomes zero just because of introducing one new word. So, what’s wrong? We’re **overfitting**! As you saw in the graphs above, we want to find a smooth function that best fits the most data points approximately, and not an overfitted function that exactly describes every single point. Luckily, there are several techniques to deal with this.

## Laplace Smoothing:

The easiest way to describe Laplace Smoothing is this: try to pretend you saw every outcome k more times than you actually did: P_{MAX}

In the maximum likelihood we used the formula:

where N is the total number of occurrences of x. In Laplace Smoothing, we add a value, k, to the count, and normalize by adding k to every class (spam/ham) we estimate something over:

Where |X| is the number of classes of x, which is in our case two for spam and not spam. For proof of this equation, see https://people.eecs.berkeley.edu/~stephentu/writeups/dirichlet-conjugate-prior.pdf.

Let’s try the previous example where k=1:

P(Spam) =

Our dictionary has a total of 16 words, thereby |X| = 16.

And so on…

Then our result will be:

Which equals 0.40. A much more realistic probability than 0.

#### But which k-value should I pick?

We can now take any vector of either emails, phrases or words and see how they correlate to our target label y – if they are spam or not spam, and set up a Bayes Network. We can then use maximum likelihood and a Laplace Smoother with Bayes Rule to figure out the probability of getting the target label y, given the information of word, phrase, etc. This is called a **generative** model.

We’re done! We can now use a dataset to make a Bayes Network, find the probability and correct for overfitting. Except one thing – what value should you set to k? The whole point of k was to prevent overfitting, so how do find the optimal k for each problem?

You find your k through **cross validation**! This is done by taking all of your data and splitting it into 3 parts:

- 80% in training data
- 10% in cross validation
- 10% for test data

You first use your training data to find all the parameters you need, and the cross validation data to find the optimal k. This is done by using your training data for different values of k, and observe how well the training model performs on the cross validation data, and maximize over all of the k’s to get the best performance on the cross validation data. You iterate over the entire training set, for example 10 times, to find the best k. when you’ve found the best k, you train with that k and test it with the test data to verify its performance. It is important to not train on the test data, because if you do, the k value will be affected by the test data, and you might overfit the your function. This model is used by pretty much anyone in machine learning. Enjoy!

## Bonus round: What about continuous functions?

We’re nearing the end of this post, and we would just like to clarify an important aspect of supervised learning. So far our example of has been mostly on discrete binary values, meaning spam or not spam, while a in many cases you want a continuous result, such as the weather. In a continuous result, f(x) = y, might be:

- y = ω
_{1}x+ ω_{0}– linear result - y = ω
_{2}^{2}x+ ω_{ 1}x+ ω_{0}– 2nd degree polynomial - And so on

We’re not going to derive those equations, this post has been long enough, but check out http://www.cs.cmu.edu/afs/andrew/course/15/381-f08/www/lectures/regression.pdf to truly understand the power of supervised learning. This is how we predict the unknown.

(Examples in this post are either taken or influenced from https://classroom.udacity.com/courses/cs271 and UC Berkeley CS188 Intro to AI)