Wednesday, March 31, 2010

Understanding the Bitworm NuPIC HTM Example Program, Part 2: Network Creation Overview

Now that Bitworm is running (See Bitworm Part 1), there are a variety of options. In the Getting Started document the next steps are funning Bitworm with "temporally incoherent data" and then with noisy data. We could go to the data generation functions and play with them, then see how Bitworm reacts. I am more interested in how the network is created, and how it functions internally. An overview of this is covered in "Creating the Untrained HTM Network File," (starting page 21 of Getting Started).

One thing I found helpful is looking at the set of programs in \Numenta\nupic-1.71\share\projects\bitworm\runtimeNetwork\. These include what appears to be an older version of RunOnce.py that uses CreateNetwork.py for network creation. In the "plain" version of RunOnce the network creation segment has just four lines of code:

bitNet = Network()
AddSensor(bitNet, featureVectorLength = inputSize)
AddZeta1Level(bitNet, numNodes = 1)
AddClassifierNode(bitNet, numCategories = 2)

AddSensor(), AddZeta1Level(), and AddClassifier() are imported functions from nupic.network.helpers. They don't seem to be used other than for Bitworm, so they are worth discussing only in the context of understanding the node structure of Bitworm. This network appears to have 4 nodes in the Getting Started (page 22) illustration, but in CreateNetwork.py we find five listed: the sensor node, the category sensor node, an unsupervised node, a supervised node, and an effector node. Getting Started calls 3 of the nodes the same, but instead of supervised and unsupervised, refers to bottom-level and top-level nodes.

Jumping ahead in Getting Started, we find that bitNet = Network() does indeed create an HTM instance that nodes can be added to and arranged in.

The runtime version replaces these with a single command (but a lot more parameters):

createNetwork(untrainedNetwork = untrainedNetwork,
inputSize = inputSize,
maxDistance = maxDistance,
topNeighbors = topNeighbors,
maxGroups = maxGroups)

CreateNetwork.py can also be found in the runtime directory. Open it and the first thing you see
CreateNetwork starts by importing nupic.network. So there is a set of one or more functions or classes we can use to get an overview; we'll look inside them later, if necessary. The following line of code gives us our function parameters, some of which are set specifically for Bitworm. So CreateNetwork.py is not a general-purpose HTM creation function.

def createNetwork(untrainedNetwork,
inputSize = 16,
maxDistance = 0.0,
topNeighbors = 3,
maxGroups = 8):

Next we have some agreement with the plain RunOnce.py:

net = Network()

Network() is an imported function that creates the overall data structure for the HTM.

Nodes are created with the CreateNode() function. The type of node - sensor, category sensor, unsupervised (Zeta1Nodes), supervised (Zeta1TopNodes), and effectors - is chosen with the first parameter of CreateNode(). Among the other parameters of CreateNode you can see spatialPoolerAlgorithm and temporalPoolerAlgorithm. I don't think I having used "pooling" yet. Remember I wrote about quantization points? [See How do HTMs Learn?] There are a number of available points both for spatial and temporal patterns in the unsupervised nodes. They need to be populated, and they may change during the learning phase. Pooling appears to be NuSpeak for this process; a pooler algorithm is the code that matches up incoming data to quantization points.

I did not get as far as I would have liked today, but I am beginning to see some structure, and dinner is calling. Instead of calling this entry HTM Creation Classes and Functions, I'll call it an Overview.

Monday, March 29, 2010

Understanding the Bitworm NuPIC HTM Example Program , Part 1

Now for my least favorite part of intellectual projects, figuring out someone else's computer code.
When I installed the NuPIC package, a program called Bitworm was run to show that NuPIC installed correctly. Bitworm's main program, RunOnce.py is written in Python script and might be characterized as a simplest meaningul example program, which makes it considerably more complicated than your typical Hello World one liner.

The explanation of, and instructions for running and playing with Bitworm can be found in Getting Started With NuPIC (see pages 14-23). If you open RunOnce.py (mine conveniently opened in IDLE, "Python's Integrated Development Environment") there is a good outline of the process too.

The point is to test an HTM (Hierarchical Temporal Memory) with a simple data set. If you got here without knowing about HTMs, see www.numenta.com or my glosss starting with Evaluating HTMs, Part 1.

Bitworm, or RunOnce, starts by creating a minimal HTM. It does this by importing nodes and components using functions that are part of the NuPIC package. It also sets some parameters which have already been built elsewhere. Then the HTM is trained using another already-created data set of bitworms, which are essentially short binary strings easily visualized if 1's as interpreted as black and 0's as white (or whatever colors you like). Later I'll want to look inside the nodes, and at how nodes are interconnected, in order to understand why this works, but for now I'll keep to the top-level-view.

To test if the NuPIC HTM network learned to distinguish 2 types of bitworms, the training data set is again presented to see what outputs the HTM gives. This is also known as pattern recognition, but in temporal memory talk we prefer the term inference. The bitworms are examples of causes (objects in most other systems), and the HTM infers, from the data, which causes are being presented to it.

That seems like too easy of a trick, infering causes based on the training set, so RunOnce also sees how the trained network does trying to infer cuases from a somewhat different set of data.

As output RunOnce gives us the percentages of correct inferences for the training set and second data set, plus some information about the network itself.

Presuming that you are using Windows and downloaded and setup the NuPIC package (see prior blog entry), to run Bitworm with RunOnce.py, open a command prompt (press Start, in the search box type Command. This should show Command Prompt at the top of the program list. Click it once. Since you will need Command Prompt often, you might also return to Start, right-click on Command Prompt, and Pin to Start Menu. Then it is always in your Start Menu. Or create a shortcut).

Type:

cd %NTA%\share\projects\bitworm

and hit Enter. That will get you in the right directory.

Then run RunOnce by typing the following and hitting Enter:

python RunOnce.py

If you get errors, you need to run the Command Prompt as an Administrator. Close the window, then right click on Command Prompt and choose Run As Administrator. Click through security warnings.

The output says there were two sets off 420 data vectors written. Inference with the training set as input data was 100% accurate. Inference with the 2nd data set was 97.85...% accurate.

As it says, you can also open report.txt. Here's what mine says:

General network statistics:
Network has 5 nodes.
Node names are:
category
fileWriter
level1
sensor
topNode

Node Level1 has 40 coincidences and 7 groups.
Node Level2 has 8 coincidences.
------------------------------
Performance statistics:

Comparing: training_results.txt with training_categories.txt
Performance on training set: 100.00%, 420 correct out of 420 vectors
Comparing: test_results.txt with test_categories.txt
Performance on test set: 97.86%, 411 correct out of 420 vectors
------------------------------
Getting groups and coincidences from the node Level1 in network ' trained_bitworm.xml

====> Group = 0
1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0

0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0

0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0

0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0

0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0

0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1

====> Group = 1

0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0

0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0

0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0

0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0

0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

====> Group = 2

0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0

0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0

0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0

0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0

0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0

0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0

1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0

====> Group = 3

0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0

0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0

0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1

====> Group = 4

0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0

0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0

0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0

0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0

0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0

0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0

0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

====> Group = 5

0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0

0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0

0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0

0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1

====> Group = 6

0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1

Full set of Level 2 coincidences:

0 -> [ 0. 0. 1. 0. 0. 0. 0. 0.]

1 -> [ 1. 0. 0. 0. 0. 0. 0. 0.]

2 -> [ 0. 0. 0. 1. 0. 0. 0. 0.]

3 -> [ 0. 0. 0. 0. 0. 1. 0. 0.]

4 -> [ 0. 1. 0. 0. 0. 0. 0. 0.]

5 -> [ 0. 0. 0. 0. 0. 0. 1. 0.]

6 -> [ 0. 0. 0. 0. 0. 0. 0. 1.]

7 -> [ 0. 0. 0. 0. 1. 0. 0. 0.]

Monday, March 22, 2010

Downloading and Installing NuPIC on a Windows computer

At last it is time for me to start playing with NuPIC (Numenta Platform for Intelligent Computing). First I need to get a working copy onto my computer, which is running Windows Vista 32-bit Home Premium on an AMD Athlon dual-core processor.

The main Numenta page is http://www.numenta.com/. From there procede to the NuPIC downloads page. You need to log in, so register if you haven't already done so. The Windows version is 32 bit; there are also Mac and Linux (both 32 and 64 bit) versions available. The Windows version file size is 112 MB, which took my satelite Internet over 20 minutes to download. Then you need NuPIC installation instructions. If you are like me, go straight to Windows NuPIC installation instructions. You also need your license file, which is sent to your email address when you register and download NuPIC.

Oh boy, it come with a Python installer. Another programming language to learn (I hope not). Add it, in my case, to APL, Cobol, Fortran, PL1, Pascal, Basic, C, C++, PHP, Javascript ... I hope I have not forgotten anyone important.

After downloading and running the installation file, I did run into a hitch in the installation wizard. After the Python installation I got the old "not responding" error in the wizard window. Eventually, after closing some other application windows, I saw that a secondary Python window had popped up and needed to have its Continue buttons pressed. Once that was done the "not responding" error in the main install window went away and I completed the install successfully.

That leaves Python on my system at C:/Python25/

and NuPIC on my system at C:/Program Files/Numenta/nupic-1.7.1/

It also means the first example, BitWorm, ran successfully, although I did not learn anything from it yet.

Next up: the BitWorm example in detail

Thursday, March 18, 2010

Evaluating HTMs: CPT details; specific memories

This is the last essay on "Hierarchical Temporary Memory, Concepts, Theory, and Terminology" by Hawkins and George. Here I review two issues raised in Section 6, Questions: details on how conditional probability tables (CPTs) work with HTMs, and why humans can have specific memories of events, but HTMs as currently described do not. The first is very technical, the second has more interesting implications.

CPTs are used in Bayesian networks to allow the belief (a set of probabilities about causes) of one node to modify another node. They can be create from algorithms using probability theory in conjunction with known data, the beliefs already established in the two nodes. In HTMs they are learned. As the quantization points are learned, the CPTs are the same as the learned quantization function that links the points to the temporal variables. There are two separate algorithms, but they run in parallel, creating an output to send up the hierarchy to the next node. This will probably because more transparent when we look at the actual algorithms used by the HTM nodes.

It is claimed that humans can remember specific details and events, as well as model the world, whereas HTMs don't keep specific memories. The authors talk about how the human brain might accomplish this feat, and how the capability might be added to HTMs. I instead wonder whether they are right about humans remembering specific details of specific events.

It certainly is the naive view, and since I subscribe to the common sense school of philosophy (with my own updates), assailing the view is mainly just an exercise at this point. But consider this: numerous studies have shown that eye witnesses are unreliable. I suspect that a visual memory is not like a photograph, nor is the memory of a song like a recording, nor is the memory of an event a sort of whole sensory record of a period of time. I believe humans do remember things, and can train their memories to be more like recorders, and in particular can memorize speeches, poems, sequences of numbers, etc. But I think the HTM model actually is at least approximately the way that the brain works. Different levels of the neurological system remember, or become capable of recognizing, different levels of details about things. There are mechanisms in the brain that allow recall of these memories on different levels. But I would be careful about assuming that because we can recall an event (or picture, etc.) in more or less detail we must be calling up a recording. We seldom learn anything of any length in detail by simply hearing or seeing it. If you have memorized that the first digits of pi are 3.14159, what is that a recording of? The words for the number sequence as sounded out in English, a visual memory of seeing this number in a particular typeface in a particular paragraph on paper of a particular tone, or an abstract memory corresponding to abstract groups of abstract units? Typically we must be exposed to something many times to be able to remember it or recognize it, just like an HTM.

I think we are so good at reconstructing certain types of memories that we think we have photograph or video-like recordings of them. That is why eye witnesses think they are telling the truth, when they often substitute details from other events into a "memory" [notably, a face from a lineup that actually was not present at a crime scene]. That is why our memories are so often mistaken (I could have sworn I turned off that burner!) and why we can recall so much without having a roomful of DVDs in our brains. Our memories are largely indistinguishable from our intelligence, and are both fragmented in detail and yet easily molded into a whole as necessary. This is why recognition is usually much better than recall.

The more I study HTMs, the more curious I get. I don't know what the next step will be in my investigations, but hopefully I'll let you know soon.

Tuesday, March 9, 2010

Why Time is Necessary for HTM's Learning

In "Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 5, Why is Time Necessary to Learn? clarifies the role of temporal sequences and temporal pattern points in both learning and recognition (inference) by HTMs.

The authors use a good example, a cut versus uncut watermelon, to distinguish between pattern matching algorithms, and how HTM's learn to recognize patterns that are created by objects (causes, in HTM vocabulary). Any real world animal, when viewed, presents an almost infinite number of different visual representations. If you use a type of animal, say horses instead of a particular horse, the data is even more divergent. Pattern matching does not work well. But allow an HTM to view an animal or set of animals over time, and it will build up the ability to recognize an animal from different viewpoints: front, back, profile, or against most sorts of backgrounds.To do that requires data presented over time. Data that is close sequentially should be similar but not identical. Early data might be of a horse, head on, far away, which gradually resolves to a horse viewed close up. So over time the HTM can capture the totality of the horse.

Combining recognition of causes with names given by an outside source is also considered. Thus no amount of viewing a horse will tell an HTMs that human's call the thing "horse." You can do "supervised learning" with an HTM, training it to associate a name with a cause by imposing states on the top level of the HTM hierarchy. But it should be a simple extension to have a vocabulary learning HTM and an object learning HTM in a hierarchy with a learn-to-name-the-object HTM on top.

Once an HTM has learned to recognize images (or other types of data) it can recognize static images (or data). The authors say "The Belief Propagation techniques of the hierarchy will try to resolve the ambiguity of which sequences are active." I am not clear on that. It seems to me that static temporal patterns happen often enough in the real world so that some temporal pattern points will represent static causes. If the horse stands still in the real world, it would generate such temporal patterns. As the data goes up the hierarchy it tends to filter out ambiguity and stabilize causes, so a leaping horse should still be the same as a frozen image of a leaping horse at some point high enough in the hierarchy.

Section 6 is a sort of frequently asked questions part of the paper. I'm not sure if I'll cover all the sections or in what order, and I do want to go back to section 3.3 on belief propagation before closing out this series.

Monday, March 8, 2010

Evaluating HTMs, Part 5: Internal Operations of Nodes

"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 4, How does each node discover and infer causes?, covers questions about the internal operations of nodes that were raised or partly covered in earlier sections.

Basically, each node simulataneously does both learning and recognition of spatial and temporal patterns. The output is information about the patterns that can be sent up and down the hierarchy of nodes.

Spatial patterns do not necessarilly mean space as in space-time continuum, although the example used in the essay is of a two dimensional visual space. Space is used in the mathematical sense. The data can be anything quantifiable. The space could be any number of dimensions. For instance the maximum daily surface temperature of the earth would consitute a space within a certain range of degrees Centigrade, to whatever desired level of precision, on a spherical 2D grid representing the surface of the earth to any desired degree of precision. The time sequence in this example would be daily samples. In a digital audio example the time sequence intervals might be something on the order of .0001 seconds.

The node has a significant number of "quantization points" available to categorize the spatial data. Only the most common data patterns, up to the number of quantization points, will be learned. Anything that is not one of the learned patterns will be assigned a probability that it is one of the learned patterns plus some noise.

Having leaned the quantization points, the node can start looking for common temporal sequences of them. Again, a limited number of points or memory units are allocated for learned temporal patterns. I can't find where the authors give them a name, so at the risk of being corrected later, I'll call them temporal pattern points. Again, temporal pattern matches don't need to be exact; some noise is tolerated.

Once learning has taken place (learning can continue), the node can work to infer causes. The patterns held in the quantization points, as well as the patterns of these in time held in the temporal pattern point, can be causes (or call them objects, which is the more typical if less precise vocabulary.). As time passes the data changes and the causes output to the higher level node(s) of the hierarchy change.

Another very important idea to wrap your head around is that the node, and the HTM, need to deal with probabilities. You might get an exact match with 100% probability, you might even be able to design special situations where you don't need to deal with probability. But the whole point of HTMs (from an engineering standpoint) is to be able to deal with complex real world data, in a manner similar to the human brain. So whether dealing with matching a spatial pattern to the quantization points, or a temporal pattern to the temporal pattern points, you need to think in terms of probability. There is a 42% change that it matches point 7, an 18% change it matches point 42, and a 40% chance that it matches none of the quantization points. With the temporal pattern it gets more complicated, since you can't assume that just because a spatial pattern is most likely point 7, it is not in a temporal pattern that goes, say 49, 3, 7 instead of 49, 3, 42.

Hey, but that is what computers are handy for, figuring probabilities and keeping track of them.

The output to the higher-level nodes might be though of as: There is a 12% probability that we are seeing temporal pattern point 7, 35% it is point 16, 48% it is point 34, and 5% it is point 49. It seems to us to be messy, but it is exactly the sort of thing the next level up is looking for. We call it a vector output, but it also can be thought as a set of data pairing probabilities and points that are arbitrarilly assigned to spatial-temporal patterns.

Hey, I think I am beginning to understand this stuff.

Which brings me back to one of those basic science and philosophy questions that made me interested in machine understanding in the first place. If everything is abstract, how do we (the HTM or a living human brain) get the picture of the world that seems all so familiar to us? If all the world does it produce neuronal impulses in our bodies, what makes red different from blue, and the junk on my desk resolve easily into envelopes, pens, gadgets, fake wood patterns and a host of other things?

Next: Why is Time Necessary to Learn?

Saturday, March 6, 2010

Evaluating HTMs, Part 4: The Importance of Hierarchy

See also Part 1, Part 2, Part 3

"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 3, Why is Hierarchy Important?, draws a detailed picture of the relationship between the structuring of the nodes of an HTM and real-world (or even virtual world) data. To stick closely to the subject of hierarchy, I'll cover subsections 1, 2, and 4, here, leaving subsection 3, Belief Propagation, to be treated as a separate topic.

If you don't understand the concept of hierarchy, try Hierarchy at Wikipedia.

I think it is best to start with "3.2 The hierarchy of the HTM matches the spatial and temporal hierarchies of the real world." Hierarchies are not always patterns humans impose upon the sensory data we receive from the external world. Each whole has its parts, as a face has eyes, ears, a mouth and nose as well as other features.

The world itself embodies the principle of locality. Spatial and temporal closeness and distance can be interpreted as a hierarchy, if a more abstract one. One might define "close" as meaning within a nanometer and nano secord, with hierarchical levels covering distances and times grouped by factors of twos or tens, up to the size of the cosmos. Or whatever is convenient for the data you are learning about. The bottom layer of the HTM hierarchy learns from the smallest divisions, and passes its interpretations (beliefs) up the hierarchy. Thus in music if the data is already in the form of notes, the bottom layer might deal with two-note sequences, the next layer with 4 note sequences, then 8 notes, 16 notes, 32 notes, on up to the number of notes in a symphony.

Music offers a one-dimensional (or two, if you plot the frequency of the notes) example, but HTMs should be able to deal with higher numbers of dimensions as long as the causes have a hierarchical structure.

Note the design guidance at the end of the section. Our HTM designs should be targetted at problems have appropriate space-time structure. The designs need to capture local correlations first. And the hierarchical structure of nodes should be designed to efficiently model the problem space.

Now back to 3.1, "Shared representaions lead to generalization and storage efficiency." The belief is that HTMs are efficient at learning complex data and causes. In other words, HTMs scale well. This can be, in computer hardware terms, memory size and computing power. This is possible because the lower levels of the HTM break interpret the data into what might be called micro-causes. Or cause modules. These modules can be reused by any of the causes found much higher in the HTM. This mimics what we know of the human visual pathway, where at the lower levels nerves appear to respond to small features on the retina like spots, short lines at various angles, simple changes in contrast, etc. Using the human face as an example, the HTM might recognize eyes, lips, proportions, etc., and categories within these features. Almost all six billion human faces presently on earth would be interpretable in terms of these basic components and their spatial relationships. Two represent each of the faces you don't need 6 billion 10 megapixel bitmap pictures. You just need 6 billion summaries that could probably be represented with a few bytes of data. Recognition would resolve to summarizing the new picture of a face and then finding the closest summary already held by the HTM.

The authors point out that "the system cannot easily learn to recognize new objects that are not made up of previously learned sub-objects." Which we see in human behavior from the household chore level right up to big pictures like evolution and relativity staring large groups of scientists in the face for decades before a Darwin or Einstein said "I recognize a new, high-level causation here."

Within the section is a helpful explanation about "quantization points," which I said were left unclear in section 2. It gives the reason for having a much lower number of quantization points than there are possible events in the event space. It points out that in a 10 by 10 square of binary (black or white) pixels, there are 2 to the 100th different patterns. By limiting the number of quantization points you force the node to group every input image into a type of pattern (some examples could be lines with various orientations, spots that move in a particular direction, more black on the right or left, etc.). These would be "the most common patterns seen by the node during training."

In section 3.4 the authors give an introductory look at the idea that HTMs can pay attention to certain aspects of the data. In other words, just as you might focus on your newspaper while riding public transportation to work, an HTM can pick some level of the hierarchy of data to focus on. Suppose it is a facial recognition HTM and it thinks a face present to it could be George or Herbert. By focussing on a particular aspect of the face, say the nose-to-lips difference, it might become more certain that the face belongs to George. People can do this both consciously and unconsicously.

It an HTM could do that, it would be really cool.

Next: Evaluating HTMs, Part 5: Belief Propagation

Thursday, March 4, 2010

How Do HTMs Learn?

Evaluating HTMs, Part 3: How do HTMs Learn?

"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 2, "How Do HTMs Discover and Infer Causes", gives an overview of the internal mechanisms of HTMs.

Specifically, it gives an a overview of how HTMs learn. This prompted me to think about the difference between learning and discovery. They could be the same thing, but learning for humans often implies a teacher presenting information to be learned to a student. Discovery implies coming upon something new (perhaps a relationship between already known objects) and realizing that it is new and should be remembered.
Each node in an HTM uses the same algorithm. The nodes are arranged hiearchically, with representations typically showing a row of nodes at the bottom that take input. Each row above the bottom has progressively fewer nodes. The top node is a single node. Its output is a vector that represents a cause, or object related to the data in a causal fashion. In fact each node does this, passing its output vector to the next higher row of nodes. So causes are built up hierarchically. All data and discoveries include a time element. In a visual field, for instance, the time element could be no change in a part of the field, or changing color with time, or following a spot of color from one part of the visual field to another over a course of time.

Get used to the technical use of the term "belief" if you want to follow discussions about HTMs. This term is used extensively in probabilistic reasoning theory. "A belief is an internal state of each node," but it does correspond to a probability that there is a causal relationship in the data. "I believe the lion must have escaped from the zoo," is a sentence that conveys to us that a person lives where lions do not live in the wild; it differs from "I know ..." because the speaker is admitting there are other possible causes. In a simple HTM, in a lower node, a belief might be something like "28% probability that this is a horizontal line, 16% that this is two animal eyes, etc." Again, beliefs are represented in software by vectors, but they are not generally identical to the output vectors of the nodes.

In training or learning, the HTM forms new beliefs at the bottom of the hierarchy of nodes first. More complex beliefs can only be created once lower level beliefs exist, but the entire process is flexible. If a lower level node alters its belief, it tends to effect higher level nodes. So learning is not just memorization.

So how does a node do all this? Nodes are given a set number of "quantization points." Here the authors are not very clear. The input pattern is assigned to one of the quantization points. And/or "the node decides how close (spacially) the current input is to each of the quantization points and assigns a probablity to each quantization point." How it decides is presumably an algorithm. With enough quantization points, each input data set could be matched exactly to a point. Would that set up cause the node to fail? To do as the authors say, the assumption is there are less quantization points than there are possible inputs.

Step two is "the node looks for common sequences of these quantization points" and "represents each sequence with a variable." So you have to ask, why assign probabilities, why not just assign closest fits? In any case the output variable represents a sequence of quantization points based on the sequence of input data.

Admitting that the authors are introducing the topic, and its vocabulary, still I would have liked more than two short paragraphs on the internal operations of HTM nodes.

Interestingly (and copying what is known about the cortex of mammal brains) information can move both up and down the hierarchy of nodes. As just described, data moving up the hierarchy is temporal variables. Data going down the hierarchy represents the "distribution over the quantization points." That would be a probability distribution.

What I suspect the authors mean is that there is a mechanism to alter the quantization points themselves. Points with long-term zero percent probabilities don't help resolve ambiguity. The set of point probabilities being sent down the hierarchy allows the (lower) node to "take a relatively stable pattern from its parent node(s)" and "turn it into a sequence of spatial patterns."

The claim is that over time "patterns move from the bottom of the hierarchy to the top." In effect rather than sending the raw data up the hierarchy, the nodes send "names" of data sequences up the hierarchy.

That would be pretty cool, and I'd like to know exactly how it happens, but this is, afterall, only an introduction.

Next: Why is a hierarchy important?

Wednesday, March 3, 2010

Evaluating HTMs, Part 2: What HTMs Do

For the introduction See Evaluating HTMs, Part 1

"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 1, What Do HTMs Do?, asserts that HTMs can discover causal relationships in data presented to them. Once the causality is established, they can infer the cause of a new input. They can predict (with some accuracy) future data sets, and they can use the above three abilities to direct (choose) behavior.

It should be pointed out immediately that each of these 4 abilities have been demonstrated by other computational systems. Neural network software that could discover certain types of relationships from raw data was available at least as early as the 1980's. Many mathematical forms of analysis can find causal relationships within data, for instance regression analysis. In any system, once a relationship is established, recognizing it should present no great difficulty. Nor should making predictions, or using the system to control behavior.

What is interesting about the set of claims for HTMs is that they work together holistically (like the human brain) and should be stackable. That is, the HTM systems should be able to deal with external relationships of increasing complexity by stacking HTM subsystems into an appropriate system. In addition, the HTMs can find relationships in time (sequential or temporal relationships).

In fact, if we call some of the data "objects," for an HTM the objects "have a persistent structure; they exist over time." The authors call the objects "causes." In theory an HTM system could deal with multiple forms of data coming in directly from the world, but usually (for now) the HTM deals with a specific subset of data (much as when a human, say, concentrates on music, or on reading). The data could be a computer file, or a stream of data from input devices.

For the HTM to work the causes should be relatively stable, but should generate data that changes over time, as a horse moving across a visual field, or a conversation between two people. Causes are typically multiple.

The discovery of the causal relationships is a learning process. During learning the HTM builds representations of causes in the form of vectors. The relationship is expressed as a set of probabilities for causes; this set is called a "belief." The causes, relationships, and beliefs can be quite complex if the HTM is complex enough. In particular, hierarchies of causes and beliefs can be learned.

The authors say an HTM, once it has gone through learning, can "infer causes of a novel input." This means that if it is presented with new data, it will try to match the data up to one of the causes it knows about. This is basically pattern recognition, and there are other systems, including certain neural networks, that do this well in certain situations. A good point made by is that if a million pixel visual field (of a scene with motion) is used as the input, it would be rare that an exact pattern would be input twice. So inference, matching a set of data to the closest cause, is a necessity. In the older neural networks causes were typically static; adding a time dimension to the data usually makes it easier for an HTM to learn and infer. I should point out, however, that "infer causes of novel input," to me can mean something more than is claimed by an HTM. For humans, it can mean a deduction, or even a deep set of deductions, rather than just recognizing a pattern or its degree of ambiguity. Then again, perhaps a sufficiently complete HTM system could do even that.

The ability to predict is the third leg of what HTMs do. In other words, given a sequence already encountered, an HTM will predict that sequence is happening again. This sounds like not much, but it is an ability that is crucial to machine understanding. In particular the authors point to priming. Given the latest data, the HTM makes a prediction and notes differences between what is predicted and what happens. If data is ambiguous or noisy, the HTM may fill in with the predicted data. If a prediction is fed back into the HTM as data, this is akin to thinking or imagining. Thus the machine could plan for the future. The authors claim "HTMs can do this well." Imagine a sheep-herding dog application. The better it can predict the behavior of the sheep, the less energy it should need to herd them.

Finally, HTMs can direct behavior. Of course almost any device, even simple mechanical ones, can direct behavior. A mouse trap, given a certain type of input, will engage in a known set behavior. Still, mentioning this for HTMs is important because that is exactly what we would expect artificial intelligence or machine understanding to be used for: behavior. An important point is that "From the HTM's perspective, the [output] system it is connected to is just another object in the world." In other words, an HTM can learn about how its own outputs act is causes in the world.

If you want to get an idea of the potential power of HTMs, before wading through a lot of other materials, section 1.4 of the paper "Direct Behavior", is a great starting point.

I'm excited after reading this part.

Possible Acronym: LIPD (learn, infer, predict, direct)

Next: How do HTMs discover and infer causes?

Tuesday, March 2, 2010

Evaluating HTMs, Part 1

I am now going to evaluate the particular model of Hierarchical Temporal Memory (HTM) developed by Jeff Hawkins, Dileep George, and members of their team at Numenta, NuPIC. My key question will be: could HTMs serve as a basis for machine understanding (MU)? Since there are many subgoals on the way to true MU, I will be evaluating HTM capabilities on a number of issues that are usually within the realm of AI (artificial intelligence).

This will be a long project. The current plan is to read and critically summarize "Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George; read and comment on the "Getting Started with NuPIC Guide;" downloading and trying NuPIC, and then assessing what whether to continue on the project.

In preparation I have previously read On Intelligence by Jeff Hawkins twice; read and commented on Towards a Mathematical Theory of Cortical Micro-circuits by George and Hawkins [see early entries of this blog]; took a course in neural networks at SDSU long ago; and read From Neuron to Brain to get detailed information on biological neurons. I know how to write software and I am fairly good at math.

Still, my head hurts just thinking about it, but here we go (into "HTM Concepts, Theory, and Terminology'):

In the Introduction the authors remind us that the human mind/brain has capabilities that computers have so far been unable to duplicate. HTMs are a memory system that can learn to solve certain problems. They are organized as hierarchical systems of nodes. HTMs are currently implemented as software on traditional computer hardware. "The learning curve can be steep."

That is all ground I have already covered in this blog. My learning curve is probably going to be steeper than that of most students who would be interested in this topic, but hopefully watching me struggle will be helpful to at least a few people.

Machine Understanding