Machine Understanding: October 2009

Friday, October 23, 2009

On Bayesian Belief Propagation

Towards a Mathematical Understanding of Cortical Micro-circuits [Dileep George & Jeff Hawkins, PLoS Computational Biology, October 2009, Volume 5, Issue 10] assumes some knowledge of Bayesian Belief Propagation. I was hoping the article would explain this topic as it went along, but apparently, to AI specialists, this is a well-known area. I found the Wikipedia entry, belief propagation, jumps into mathematical formulism rather quickly. Some words of explanation without math symbols would probably help many people who are interested in machine understanding; that is my goal here. If some of you Bayesian belief propagation experts think I have it wrong, or want to add anything, feel free to add comments. I have ordered a copy of the classic work on the subject, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, and so hope to be more fluent in this area at a later date.

Bayesian Belief Propagation (BBP) was formulated by Judea Pearl, who later authored Probabilistic Reasoning in Intelligent Systems. It appears to be a system for determining unknown probabilities within certain types of complex systems. Let's start with a review of a very simple probability model.

Much of early probability work was formulated in terms of a pair of ordinary gambling dice, six-sided cubes with faces numbered one through six. Thrown many times, they make a nice distribution pattern of numbers two through twelve. It can be shown with a little math and logic that there are percentages that are expected for each of these numbers. If you make a large number of dice throws, say more than 100, and you get results that differ by very much from the expected distribution of results, you may be looking at one of two things. It is possible that you have a distribution skewed purely by chance. On the other hand, you might want to investigate whether your dice and throwing systems are physically constructed to be truly random in the results they give.

Now imagine a black box, say the size of a shoe box. If you shake the box and then stop shaking, it gives you a number. In fact, imagine a room with a variety of such boxes. You shake them, keeping tracks of the resulting numbers for each. If one box consistently gives numbers between 2 and 12 inclusive, and the frequencies for each number match the pattern that we believe two dice would give, then we can conclude that the internal mechanism of the box, whatever it is, has a probability distribution that is created by the same logic that creates the dice probability distribution. There might even be a tiny pair of nano-dice inside.

In computer programming we tend to think in purely causal mechanics. If something does not happen as expected, we have a bug in our program. But in the case of living groups of biological neurons we cannot (so far) look at the inner mechanics of each neuron while watching how the whole group operates. However, if we can control the inputs and read the outputs of a system, then, if we can do the math, we might figure out the probabilities of individual neurons firing given known inputs from the other neurons in the system.

Formally, BBP allows us to calculate the probability characteristics for nodes we can't observe directly, if we can observe enough of the rest of the system. A node does not need to be a neuron. In an AI system, a node is a processing unit with inputs, outputs, or both, that connect it to the system of nodes in question. In our case the nodes are part of a Bayesian network, also known as a belief network. Communication (or messaging) within the network is in terms of probabilities. (See also Bayesian probability)

simple Bayesian network

Think of a network of three people, where each of the people represents a node. Let there be a piece of paper we will call the message (it is blank; the paper is the message, in this case). Let each person have a single die and the rule: if you have the message, throw the die. If it comes up a one, a two, or a three, pass the message to the left. Otherwise, pass it to the right. If we watch the game progress, the message moves around or back and forth in a random pattern. However, it is not a Bayesian network, because it is cyclical. Bayesian networks are not cyclic. Messages more only in one direction.

Now start a series of games in which the message is always handed to person 1 (node 1). Person 1 throws the die and passes the message to person 2 if the result is 3 or less, and to person 3 if the result is 3 or more. Person 2 also throws a die and simply holds the message if the result is 3 or less, but passes if to person 3 if the result is 3 or more. When person 3 gets a message the game is over.

It is a simple game and a tad boring, but if we focus on the propagation time, the number of die throws from the beginning of a game until the end of a game, we can see a set of variations. Over the long run they should form a probability distribution of the times.

Now change the rules slightly. Say person 2 forwards the message if the die comes out 2 or higher. The range of possible results does not change, but the probability distribution will.

Now suppose you know about this type of game, but you can't see the game because it is in a tent. You see a message handed in, and can time how long it is before the message is handed back out by person 3.

By keeping track of the times, you can get the probability distribution. That should tell you the exact rules of the game being played, including the probability that person 1 will give the message directly to person 3, and how long person 2 might keep the message before passing it.

In more complex networks, you need to pin down the probabilities at more of the nodes, if you want to be able to characterize the probability rules for an unknown node.

Which is what the BBP algorithm is supposed to do. It calculates the probabilities for an unobservable node based on the observed nodes.

Saturday, October 17, 2009

Bottom Up Machine Understanding

The first paragraph of "Towards a Mathematical Theory of Cortical Micro-circuits" [Dileep George & Jeff Hawkins, PLoS Computational Biology, October 2009, Volume 5, Issue 10] states:

Understanding the computational and informtion processing roles of cortical circuitry is one of the oustanding problems in neuroscience. ... the data are not sufficient to derive a computational theory in a purely bottom-up fashion

My own cortex, probably not wanting to give itself a headache by proceeding too rapidly into what looks like a dense and difficult paper, immediately drifted off into thoughts on deriving a computational theory in a purely bottom-up fashion.

The closer we get to the physical bottom, the easier it seems to be to understand that the project might work. Suppose we could model an entire human brain at the molecular level. We imagine a scanner that can tell us, for a living person who we admit is intelligent and conscious, where each molecule is to a sufficient degree of exactitude. We also would have a computational system for the rules of molecular physics, and appropriate inputs and outputs.

Unless you believe that the human mind is not material (a dualist or idealist philosophic view), such a molecular-detail model should run like a human brain. At first it should think (and talk and act, to the extent the outputs allow) exactly like the person whose brain was scanned.

However, that does not mean scientists would understand how the brain works, or how a computational machine could exhibit understanding. Reproducing a phenomena and understanding a phenomena are not the same thing. The advantage of such a molecular computional brain model would be that we could run experiments on it in a way that could not be done on human beings or even on other mammals. We could start inputting and tracing data flows. We could interrupt and view the model in detail at any time. We could change parameters and isolate subsystems. Perhaps, further in the future, such a model could even be constructed without having to scan a human brain for the initial state.

At present, for a bottoms-up approach that might actually be workable in less than a decade, we would probably want to do a neuron-by-neuron model (probably with all the non-neural supporting cells in the brain as well). However, a lot of new issues arise even at this seemingly low level, even if we presume we have some way to scan into the model all of the often-complicated axon to dendrite paths and synapses. If learning is indeed based on synapse strength (Hebbian hypothesis), we would need both a good model for synapse dynamics and a detailed original state of synapses. This would require modeling the synapses themselves at the molecular level, or perhaps one-level up at the some molecular aggregate level. In effect it would not be possible to model an adult brain that has exhibited intelligence and understanding. We would need to start with a baby brain and hope that it would go through a pattern of neural information development similar to that of a human child.

A complete neural-level model would be much easier to test intelligence hypothesese on than a molecular-level model. It would not in itelf indicate that we understand how humans understand the world. By running the model in two parallel instances (with identical starting states), but with selected malfunctions, we could probably isolate most of the sub-systems required for intelligence. This should help us build a comprehensible picture of how intelligence operates once it is established and of how it can be constructed by neural circuits from whatever the raw ingredients of intelligence turn out to be.

Despite our lack of such complete bottom-up models, I don't think it is too early to try to reconstruct how the brain works, or how to make machines intelligent. The paper outlines the HTM approach to this subject. HTM was based on a great deal of prior work in neuroscience and in modeling neural aggregates with computers. Often in science success has come from the combination of bottom-up and top-down approaches. Biological species, and fossil species, were long catalogued and studied before Darwin's theory of evolution revealed the connections between species. Darwin did not invent the concept of evolution of species, or of inherited traits, which many scientists already believed were shown by the fossil record and living organisms. He added the concept of natural selection, and suddenly evolution made sense. The whole picture popped into view in a way that any intelligent human being could see.

Wednesday, October 14, 2009

Markov Chains for Brain Models

I got an email from Numenta today, telling me that Dileep George and Jeff Hawkins had a paper, "Towards a Mathematical Theory of Cortical Micro-circuits", published in the PLoS Computational Biology journal. Part of the abstract states:
“we describe how Bayesian belief propagation in a spatio-temporal hierarchical model, called
Hierarchical Temporal Memory (HTM), can lead to a mathematical model for cortical circuits. An HTM node is abstracted using a coincidence detector and a mixture of Markov chains.”

HTM is the model being used by Numenta and others to crack to machine understanding problem by making an abstract, computable model of the human cortex (brain). I figured the article would explain “Bayesian belief propagation” which parses into three words I understand. I knew I had run into the term “Markov chains” before, in probability theory and elsewhere. I thought I would just refresh my memory about good old Andrey Markov with Wikipedia. But the Markov chains article at Wikipedia was tough going, jumping into the math without clearly explaining it in English. The definition there was comprehensible: “a random process where all information about the future is contained in the present state.” That implies there is a set of rules, expressed as probabilities, so that if you know the current state of a system you can predict its future state probabilities. It neatly combines determinism with indeterminism. For a first impression, I found the Examples of Markov Chains article more helpful.

I thought I might find a better explanation on my book shelf. There was nothing in my Schaum’s Statistics, but my next guess (thank you, cortex) Schaum's Outline of Operations Research
by Richard Bronson and Govindasami Naadimuthu, has an entire chapter, “Finite Markov Chains,” starting on page 369.

Markov chains are a subset of Markov processes. Markov processes consist of a set of objects with a set of states for the objects. At each point in time each object must be in one of the states. As time progresses, “the probability that an object moves from one state to another state in one time period depends only on the two states,” the current state and the probabilities for the future states given the current state.

For example, in the board game Monopoly each “square” or piece of real estate is part of the set of objects. The two dice create a probability structure, the standard one for two dice. The probability that, at any point, a player will transition to the same point, the forward adjacent point, or any point beyond the 12th point forward is zero, because the dice can only roll numbers between 2 and 12. However, more typically there would be a set of states and a probability for the transition to each of the states, including the transition of remaining in the current state.

If there are N possible states, and you know all the probabilities for transitions, you can show this and manipulate it mathematically with a matrix P, the stochastic or transition matrix. There would be N rows and columns in P. As in all probability theory, in any row the probabilities add up to 1.

N=2 is the simplest meaningful number of states. Let’s call them black and white. It is really only interesting if the transitions are different. Say 90% of the time black transitions to white, but only 20% of the time white transitions to black. Then the matrix P would be

.1 .9
.8 .2

There is some ability to predict the future in a statistical way if you have a Markov chain like this. Without doing any math, you would expect over time that if you have only one object, it would probably be white. If it is black, probably the next state of the sequence flips to white, but if it is white, probably the next state will be white. If you have a large number of objects, you can know the likelihood that a given percentage of them will be white or black. Yet you can’t know for sure. Even the least likely state, all objects black, can occur at any time if each black stays black but each white transitions to black.

You can see that Markov chains might be used to represent the states of neurons or of collections of neurons. In my simple examples above, each object influences only its own future state. But in real systems often future state probabilities are determined by the states of a collection of objects. It could be a collection of transistors, or a collection of nerve cells, or HTM nodes.

Tuesday, October 6, 2009

Synaptic Transmission Puzzle

"Over the past twenty years, a wide variety of modes of synaptic transmission have been discovered, in addition to simple, chemically mediated increases in permeability leading to excitation and inhibiation. Such modes of transmission include electrical exictation and inhibition, combined chemical-electrical synapses, chemical synapted changes produced by reductions in memberane permeability, and prolonged synaptic potentials mediated by chemical reactions in the postsynaptic cell."
-- page 208, From Neuron to Brain, Second Edition, by Stephen W. Kuffler, John G. Nicholls, and A. Robert Martin; 1984, Sinauer Associates, Inc., Suderland, Massachusetts

In Jeff Hawkins' On Intelligence, he presents a predictive-memory model for human intelligence which he believes can serve as a basis for intelligent machines [which I call machine understanding to emphasize that the machines would not merely be mimicking surface features of human intelligence, as is the case with, for example, expert systems]. I agree with him that while neuroscience and computing science have made great progress in understanding many details of the human brain, and in writing software, we need to have an overview that allows us to make progress on the centrals issues that interest us. Thus, in Hawkins' model, he does not worry about the exact nature of neural synapses.

But at the very least, we should be aware of how complicated human synapses are. This should allow us a greater freedom of thought when modeling the mechanics of machine understanding than if we simply assumed synapses all work alike. I suspect the Hebbian learning model for neurons would benefit from considering that the real world may complex, and in that complexity there may be keys to progress that we leave out with overly simple models.

What struck me most about the above paragraph about synapses, however, is the role played by evolution. Charles Darwin wrote about the process of species creation, but we have long grown used to the idea that evolution takes place on a molecular level. Each nerve cell, presumably, contains the full set of genes of the organism, but many different types of synapses are manifested. There must be controlling, blueprint genes that tell the cells which typese of synapses they are to form as they develop. In turn, we can expect that many different blueprints have been tried over the last four million years or so. The most successful gave their human organisms survival advantages.

It would be interesting to know how much synaptic variation exists in the current human population. Is this variation responsible, or partly responsible, for variations in basic intelligence capabilities of human beings?

This brings us back to the hard-wiring versus plastic debate. Human beings are very adaptable, as is shown by their ability to learn any of thousands of human languages as a child. We tend to think that we are very plastic and programmable creatures. But nerve transmission speeds and synaptic types are hard wired, as is the basic design of the brain. One might say we have the hard-wired capability to be flexible.

And that is what we aim to build into the new machines: hard wiring (a stable system with a design we understand and can reproduce) that is capable of showing the flexibility required to exhibit intelligence and understanding.

Saturday, October 3, 2009

The Amazing Genetics of Nerves

I'll start with a quote from Stephen W. Kuffler, John G. Nicholls and A. Robert Martin's From Neuron to Brain (second edition, page 177; 1984 Sinauer Associates Inc.):

"... conduction velocity plays a significant role in the scheme of organization of the nervous system. It varies by a factor of more than 100 in nerve fibers that transmit different information content. In general, nerves that conduct most rapidly (more than 100 m/sec) are involved in mediating rapid reflexes, such as those used for regulating posture. Slower conduction velocities are associated with less urgent tasks such as regulating the distribution of blood flow to various parts of the body, controlling the secrection of glands, or regulating the tone of visceral organs."

Presumably much of this fine-tuning was achieved in mammals long before humans evolved. It is a great example of what can be achieved with evolution through natural selection. Apparently there is a cost to speedy nerve connections. There must be genes that can be turned on to produce extra structural elements (proteins, fats, etc) that speed up transmission velocity. There must be controlling genes that turn the structural genes on and off during nerve development, as appropriate.

Given these possibilities, there must be an overall genetic blueprint for nerve transmission velocity types. This blueprint would have been fine-tuned over time through the survival of the fittest. Having high cost, fast transmission where needed, and low-cost slower transmission where that will do, would give a slight evolutionary advantage.

The issue of transmission speed would be relatively simple comparted to creating a genetic blueprint for the orders-of-magnitude more complex human brain. Yet the construction process would be similar. A variety of brain cell types were already developed by mammals and primates. Probably (please comment if you know!) more cell types, including synapse types, evolved (from the usual mutation + selection process) for the human brain. The overall structure of the brain could be one blueprint, but it is an extremely complicated one. We know it involves axons and dendrites often both connecting to neighboring neurons and running long distances to connect to neurons after bypassing thousands, millions, or billions of closer neighbors.

This could only happen through evolution. Again, we have Charles Darwin to thank for setting us on the right path to understanding both nature and ourselves. As to machine understanding, the human brain is our best blueprint.

Machine Understanding