This post will be the first in a series on Artificial Intelligence (AI), where we will investigate the theory behind AI and incorporate some practical examples. The first, and perhaps most important section of this series, will be on probability, where we will look at the fundamentals of any AI. Some of the most important aspects of probability for AI are be restructured probabilities using something called Bayes Networks. We’ll investigate this through a medical example.
Case: Liver Disorder
Imagine that you’re a researcher investigating if a patient has a liver disorder. Now what could be the cause of this liver disorder? Well, if you open a medical journal you will see that gallstones could be a cause, a history of hepatitis could be another, it could be alcoholism or many others. These causes may initially be unobservable. You can’t check every single patient that comes into the hospital if they have gallstones, hepatitis and enjoy a few too many at home just in the off chance that they may have a liver disorder.
Well, the medical journal also says that a symptom of gallstones is upper abdominal pain, hepatitis can be seen in a blood test and alcoholism can show up as iron deficiency. Finally, some observable symptoms. So now we can say with a higher probability if there are problems with gallstones, hepatitis or alcoholism, if they assert the symptoms mentioned above. Also, the symptoms giving a higher probability of gallstones, hepatitis or alcoholism, will again help explain the probability of having a liver disorder.
And what does liver disorder cause? It can cause fatigue, body hair loss, enlarged spleen, etc. What we end up with is a network – A Bayes Network – of cause and effect based on probability to explain a specific case, given a set of known probabilities. In other words, a Bayesian Network is a network that can explain quite complicated structures, like in our example of the cause of a liver disorder.
A Bayesian Network is composed of nodes, where the nodes correspond to events that you might or might not know. They’re typically called random variables, which may be discrete or continuous. These nodes are connected by arrows, and if there is an arrow from X to Y, X is said to be parent to Y. Each node Xi has a conditional probability distribution P(Xi|Parents(Xi)). Bayes Networks define the probability distribution over graphs of random variables.
In the Bayesian Network above, we have a total of 94 variables! What the graph structures and the associated probabilities specify, is a huge probability distribution in the space of 94 variables. If they are binary, which we will assume throughout this example, they can take 2^94-1 different values if you don’t have a structured graph like above – that’s a lot! This is where the Bayesian Network is key. A Bayesian Network’s advantage is how compact the representation of a probability distribution is, such as this very large Joint Probability Distribution (JPD), compared to unstructured representations (like non-graph structures). Just to clarify, JPD is the probability of every possible event as defined by the combination of the values of all the variables. It’s getting a bit more complicated now, so let’s freshen up some probability theory from high school/college.
Theory: Quick intro to probability
We assume the reader has a decent knowledge of probability and statistics background, but let’s repeat some of the notation anyways:
- P(A) – probability of event A
- P(A’) = 1 – P(A) – Complementary probability of P(A)
- P(A ∩ B) – Probability of events A and B
- P(A ∪ B) – Probability of events A or B
- P(A|B) – Probability of event A given event B occurred.
- A⟂B – A and B are independent of each other. If A⟂B, then we can write that P(A,B) =P(A)*P(B), since they are independent.
Perhaps the most important rule in AI is the Bayes Rule, which was invented by Thomas Bayes, a British mathematician. Bayes Rule is stated as following:
Until now we have a pretty good understanding of calculating the probability B, given that we have A, but not probability A, given we have B. Now it becomes apparent that we can use Bayes Rule to calculate “backwards”, so to speak, in our example of the liver disorder.
Let’s try out Bayes Network with another example:
Say that you want to find out if you are allergic to gluten, but you can’t observe it with your eyes (unobservable), so you have to perform a test.
The probability of you being allergic to gluten is only 2%, the probability that you are allergic and the test detects it is 0.8 (denoted as ‘+’), and the probability that you are not allergic, but the test still says that you are is 0.1(denoted as ‘-’).
– To the reader: these are typical values that can be determined from data of previous testing.
So, we have, -> which gives us
P(G) = 0.02 -> P(G’) = 0.98
P(+|G) = 0.8 -> P(-|G) = 0.2
P(+|G’) = 0.1 -> P(-|G’) = 0.9
Now, what is the probability of you being allergic to gluten, given that the test comes back positive.
P(G|+) = P(+|G)*P(G)/P(+) = 0.8*0.02/P(+).
Now, P(+) is the probability of when you are allergic and have a positive result, and when you are not allergic but still have a positive result:
P(+) = P(+,G) + P(+, G’) = 0.8*0.02 + 0.1*0.98 = 0.114
Giving us the result of P(G|+) = 0.14
Types of Bayesian networks
There are many different types of Bayes networks (see below)
We’ll take a closer look at that last one:
In the example above we have five random variables, where the Bayes Network defines the distribution over those five random variables – P(A,B,C,D,E). So instead of calculating all possible combinations of those five random variables, the Bayes Network is defined by probability distributions that are inherent to each individual node.
A and B are only dependent on their own variable, so their distribution is P(A) and P(B), since there are no arrows (connection) coming into them. C, on the other hand, is conditioned on A and B, so we have P(C|A,B). D and E are conditioned on C, so we have P(D|C), P(E|C).
This gives us the joint probability, represented by a Bayes Network. The joint probability is the product of various Bayes Network probabilities that are defined over the individual nodes, where each node’s probability is only conditioned on the incoming arrows.
P(A,B,C,D,E) = P(A)*P(B)*P(C|A,B)*P(D|C)*P(E|C)
- A and B have no incoming arrows, so they a have a probability distribution of P(A) and P(B).
- C has two incoming arrows, so it’s probability is conditioned on A and B, giving us P(C|A,B).
- D and E are both conditioned on C, giving us P(D|C) and P(E|C).
So, the definition of this setup for the joint distribution, P(A,B,C,D,E), is based on the factors above, and gives us one really BIG advantage. We know that the joint distribution over any five random variables requires 2^5-1=31 probability values, while our Bayes network only requires 10 probability values.
P(A) is one value, since we can also derive P(A’) (and the same is true for B). P(C|A,B) is derived by a distribution over C, conditioned on any combination of A and B. P(D|C) and P(E|C) are conditioned on C and C’, which give two values. If we add these up we get:
1+1+4+2+2 = 10 parameters in total
- 1: (P(A)/P(A’)
- 1: P(B)/P(B’)
- 4: (P(C|A,B)+ P(C|A,B’)+ P(C|A’,B)+ P(C|A’,B’)
- P(D|C) + P(D|C’)
- 2: P(E|C) + P(E|C’)
If the calculations of the 10 parameters above seemed a bit…. abstract, take a look at the example below, taken from UT Austin – CS 343:
Here we are looking at the JPD of an alarm going off, when the causes may be burglary and/or earthquakes, and, the probability that either John or Mary calls to check in.
Advantage of Bayesian networks
In the previous section, we saw that we would only need 10 probability values, compared to 31 for an unstructured non-graph method. It might not seem like such a difference, but when scaling to a larger and more complex problem, the compactness of the network leads to a representation that scales significantly better to large networks. This is a key reason why Bayes Networks are being used so extensively.
So, what does this look when it gets really complicated? Let’s look back at the first example on the liver disorder.
There is no way we can calculate the joint probability with 2^94-1 probability values, but we now have the tools to figure out the joint probability through a Bayes Network instead. After counting through the different nodes, we find that we need to know about only 231 probability values to specify the joint probability of the liver disorder. With the risk of repeating ourselves, this is one of the main advantages of using a compact Bayes Network representation instead an unstructured joint representation.
This was a very short and simple introduction to Bayes Networks, where part of the material comes from Artificial Intelligence – A Modern Approach (Russell, Norvig), http://www.ee.columbia.edu/~vittorio/lecture12.pdf , https://www.youtube.com/channel/UCshmLD2MsyqAKBx8ctivb5Q/feed and https://classroom.udacity.com/courses/cs271 , which we strongly recommend for everyone interested in pursuing this subject.
Before signing off, we want to introduce one of many practical applications for Bayes Networks in AI, namely, a Bayesian Neural Network (BNN). While Bayesian Networks have a determined probability value for each event so that one might derive a value at the wanted end state, BNN’s learn the probabilities and the probability values, and optimize their learning from the input data to the output (or action). This is a bit tricky, so let’s scale it back a bit. We often have to make decisions based on our best guess, which can be based on imperfect and/or incomplete information. So, it’s not that unreasonable to think that the best way to make decisions, based on such uncertain data, is to keep good track of the uncertainty – i.e. the probability distribution. What this means is that we incorporate our prior information into the model.
Our data is used as input into the neural network, which includes levels of nodes set up to give an optimal output value, based on the input. But that’s a big new topic, too big for this post, so it will be covered later in our AI series. Stay tuned.