Saturday, November 27, 2010
Wandering through the multi-dimensional abyss
I found, slogging through Judea Pearl's Probabilistic Reasoning in Intelligent Systems
and Terrence Fine's Probability and Probabilistic Reasoning for Electrical Engineering
, that my mind kept fuzzing up. Maybe I am slowing down in my old age, but the real problem was that my last formal training in probability was when I was 19, and interpreting clinical trial p values is guestimate work. I had to regress to a simpler text than what I used in college, and I can recommend Finite Mathematics with Applications
by A. W. Goodman and J.S. Ratti for introductions to simple probability, conditional probability, Bayes' Theorem, and even Markov Chains that were simple enough for me to feel I really understood easy examples and the concepts themselves.
But my wanderings have been further afield than that. I continue to be fascinated with tensors, and got a lot out of Introduction to Tensor Calculus, Relativity and Cosmology
by D. F. Lawdon. Again, I never got to tensors in college (I ended up a Political Science major), and thinking I was brighter than I really was (brightness is mainly a function of prepartion, I now know), started off with mathematical treatments that were too abstract for me to do more than pretend to follow.
I have even got stuck on Maxwell's equations for electromagnetism. Now we all should admit that if we read broadly in math and science we don't take the time to really understand everything; we trust our fellows to have done their homework before a set of facts or an equation is presented in a paper or textbook. We may like to feel we agree with quantum physics, but who except for professional physicists have the time to really look at the data and the math in detail? I have always assumed that Maxwell's equations are correct, and that if I needed to I could look up the definitions of curl, etc., and do the math. But that is not the same thing as the deep understanding one gets from working in electromagnetics on a regular basis.
I have wandered farther afield than that, to Lie groups and Galois theory, which may have nothing to do with machine understanding. Nevertheless, I wander. And I keep coming back to what is known about the structure of the cortex, of the actual tangles of nerve cells themselves, and in particular to the way pyramidal cells span multiple layers of the cortex with their intricate axons and dendrites. How do you create a math that represents such a tangle? Skipping that, you can do funtional units as Numenta does, or you can try the AI tradition with its tradition trying to get the end results without understanding the details of how neurons actually get stuff done.
Right now I have little paid work going on, so I may be writing in the blog more often. If paid work becomes available, there will be more delays.
Thursday, August 26, 2010
Seeing Predictive Memory Everywhere
Yet all the while I've been watching how my mind works in light of the predictive memory theory. What good would memories be if they did not allow animals to make predictions that help with survival? I have watched my mind make mistakes, in reading for example: wait, that doesn't make sense, I read "farming" for framing. I watch my dog Hugo make decisions (mostly to not obey me). I watch other people make decisions.
I also continue to ponder how the system works. There are computer models like Numenta's, and biological models. When I study math part of me is assessing its utility for modeling machine understanding. Even reading the Excel book, which I really liked, got me to thinking about how advance Excel tools might be used to model neurons or probabilistic reasoning.
But I can't say I have any breakthroughs to report. I can't even say I am going to be writing this blog on a regular basis. There are fires that need to be put out, and fires that need to be lit.
I started on a simple demo program, I mean really simple demo program, just to get going on flowing data through nodes. I started it in Visual Basic, with the intent of also doing it in Python and at least one other language, maybe C++. I might try to restart the MU project their, or I might go back to working through the Numenta examples. No promises. But if I do manage to get anything done, I'll post it.
Thursday, June 10, 2010
Understanding Probability and Probabilistic Reasoning
Months ago I ground to a halt in my reading of Judea Pearl's Probabilistic Reasoning in Intelligent Systems, which provides much of the background to the Numenta discussion. Yesterday I decided to tackle it again and commenced reading at page 143. I noticed that some notation was ambiguous, which is typical of expert writers who assume their readers are right up with them. So I decided to go back and make sure that P(A,B) really does mean the probability that both A and B are true. I thought I'd make sure I understood the Bayes interpretation of probability as well.
I ended up reading starting at page 29, Chapter 2, Bayesian Inference, 2.1, Basic Concepts, 2.1.1 Probabilistic Formulation and Bayesian Inversion. Note that I took two semesters of logic and one semester of probability in college, and as part of my profession deal with biostatistics, the kind reported from clinical trials, on a regular basis. Note also that I have studied philosphic issues of quantum physics and even the math involved.
Yet when I read this simple introduction this time, the scales fell from my eyes, or from my cortical networks.
With probabilistic reasoning, it is fair to say that we are not talking about rolling dice (even though Pearl uses the familar probabilities of two-die rolls to illustrate some points).
We are talking about the math of pobability theory. For most practical purposes, that is the math of fractions. Third or fourth grade stuff. (I had a fifth grade teacher I hated, Mrs. Lopez, who was all about memorizing things. We memorized the decimal equivalents of about 50 common fractions. I knew I could always get the decimal equivalent by dividing, so considered this a stupid exercise.)
When thinking about human memory, you can safely substitute "percentage of like situations" for probability.
Updating the "percentage of like situations" based on experience makes sense. Since we can test for novel situations, like "both A and B" or "A and not C, given B", by multiplying, adding, or subtracting fractions, these updates may effect a chain of knowledge or deductions across the brain (or mind, if you prefer).
Calling all background information and assumptions a person has K (I don't know why K, maybe it stands for Knowledge), I quote Pearl page 30: "However, when the background information undergoes changes, we need to identify specifically the assumptions that account for our beliefs and articulate explicitly K or some of its elements."
Many Philosophers, notably Ludwig Wittgenstein, have shown how reasoning goes awry when we use one word to mean multiple things, or one thing that is vague or complex. We think we are being clear using logic symbols or math equations or tech speak. But when something is amiss, it may not be a problem with our reasoning. It may be that we need to update our background assumptions.
See also Bayes' theorem
Monday, May 10, 2010
New Algorithms from Numenta
I'll quote the key passage from Jeff:
"Last fall I took a fresh look at the problems we faced. I started by
returning to biology and asking what the anatomy of the neocortex
suggests about how the brain solves these problems. Over the course
of three weeks we sketched out a new set of node learning algorithms
that are much more biologically grounded than our previous algorithms
and have the promise of dramatically improving the robustness and
performance of our HTM networks. We have been implementing these new
algorithms for the past six months and they continue to look good."
Sure. Even my own limited reading so mostly-outdated neurology texts seemed to indicate that the early versions of HTM are simplistic (compared to systems of human brain neurons). The new version, styled FDR (Fixed-sparsity Distributed Representation), are somewhat more complicated, but Jeff believes they are more capable. In particular, they deal better with noise and variable-length sequences.
On the other hand, we are certainly hoping to get machines to actually understand the world without having to duplicate (in software) a human brain molecule by molecule.
Jeff gave a lecture on the new algorithm at the University of British Columbia, which will have to do for the rest of us until details are posted at the Numenta web site:
http://www.youtube.com/watch?v=TDzr0_fbnVk
See also my Machine Understanding main web page.
In the meantime I intend, in addition to doing my own thinking & tinkering, to resume my program of going through the already-posted, earlier version examples of HTM.
Tuesday, April 6, 2010
Songbirds, Genes, and Neurons
The article gives a minimum of information on how genes actually affect the ability of a bird to learn and sing a song. The key revelations of the article are that the zebra finch (Taeniopygia guttata), has had its genome decoded and that about 800 genes change their activity levels in neurons when the finch sings. The article implies that defects in these genes might interrupt singing ability, just as mutated FOXP2 genes in humans cause speech defects. In particular the bird version of FOXP2, if defective, prevents songbirds from singing.
This would seem to go against my basic understanding of how systems of neurons work, which I like to think I is up with the current scientific consensus. Once a basically functioning neural network is in place, I thought genetic activity becomes background activity. Of course the genes would function just like they do in any cell, releasing instructions for making proteins that regulate cell activity. And maybe some of the 800 genes mentioned in Wade's article are ones that would up-regulate or down-regulate any neural activity, not just songs, or learning. But according to David F. Clayton, "these transcripts don't result in the cells producing proteins in the usual way. Instead they seem to modulate the activity of other genes involved in listening."
My (learned from textbooks) model is: genes have blueprints for several types of neurons with varying synapses and neurotransmitters and receptor. Signals are conducted by reasonably well understood mechanisms involving membrane potentials along the neurons and either chemical or electrical transmission at synapses. Genes in the neuron are just caretakers once a system is set up. Learning results from a strengthening or weakening of synaptic thresholds. This is called Hebbian learning, and while there are some theories about how Hebbian learning works at the molecular level, at this point I don't take them as proven.
If the article is true as presented, then individual neurons are more complex than I thought. It is implied that many neurons can function just fine with a mutated FOXP2 genes (every gene would be in every neuron, in fact in every cell), but not neurons that are involved in learning songs. But other neurons learn just fine.
What would distinguish a song-learning neuron from a muscle-coordination learning neuron? I don't know.
As is typical with the New York Times, they want to keep you in their ad ghetto, so they provide no link to the research report, but they say it is in the current issue of Nature. Here is the link: The genome of a songbird
Monday, April 5, 2010
Bitworm HTM Example Program, Part 3: Spatial and Temporal Pool Overview
To learn about the pooling algorithms I went to the Numenta Node Algorithm Guide, which is not at the Numenta web site, but installs with NuPIC under \Program Files\Numenta\nupic-1.7.1\share\doc\NodeAlgorithmsGuide.pdf.
There are two node types implementa the NuPIC learning algorithms:
SpatialPoolerNode
TemporalPoolerNode
Some confusion might exist because in more general Numenta discusions a node is treated as a single entity, but both the spatial and the temporal node are needed to create a functioning general node. When the unsupervised node in Bitworm is created with CreateNode(Zeta1Node,...), in effect both a SpatialPoolerNode and a TemporalPoolerNode are created to get full functionality. They refer to both node types being in the same level of the HTM hierarchy. But with you can design more complicated patterns by arranging SpatialPoolerNode and TemporalPoolerNode in an HTM as needed, rather than always pairing them on a level.
"Spatial pooling can be thought of as a quantization process that maps a potentially infinite number of input patters to a finite number of quantization centers." Which in other lit Numenta calls quantization points. Data, in our HTM world, has a spatial aspect. This might not be change along a spatial dimension; space has a more general sense. For instance, the space might be a range of voltages, or sets of voltages from an EKG, for instance. Spatial data usually varies so complexly that we are only interested in the data that is created by objects, or causes. Spatial pooling groups the data into a limited number of causes (or guesses about causes).
Temporal pooling does the same thing with the patterns (objects) identified by the spatial pooler over time sequences. "If pattern A is frequently followed by pattern B, the temporal pooler can assign them to the same group."
A group of nodes forming an HTM level may be able to form invariant representations of objects by combining spatial and temporal pooling. If it can, it passes these representation up the hierarchy.
Once learning is achieved the nodes can be used for inference: they can identify new data as containing patterns that have already been learned.
For now I will focus on the learning phase, since the inference phase is relatively easy to understand if you understand how learning takes place.
SpatialPoolerNode
I just realized the paper I am reading does not actually give the algorithms used. However, the key algorithm is probably related to the maxDistance parameter. Distance here could be ordinary distance, but it is more likely to be distance within a generalized, possible many-dimensional, heterogeneous pattern space. All kinds of problems leap to mind for writing such a generalized algorithm. I would bet that space/data specific algorithms would really help here (sound vs. sight vs. spatial orientation of human fingers), but perhaps if the quantification is always done before the data is fed in, it is just a matter of matching numbers. Anyway, if you have a distance function, you can group the spatial patterns as falling around a set of centers. These centers are your quantization points. As discussed elsewhere these points are flexible; if a lot of patterns fall close to each other, you might want to tighten up the distance parameter because otherwise you don't use all your allocation of quanization points. That should happen automatically, but either it doesn't, so you need to set the maxDistance parameter, or it does but you still have the option of disagreeing with the automatic or default settings.Your number of quantization points is set by maxCoincidenceCount. "Storing too few coincidence patterns can result in loss of accuracy due to loss of information. Storing too many coincidence patterns can result in lower generalization and longer training times."
You can also set the sigma parameter. Here's another insight into the algorithm: "each input pattern is compared to the stroed patterns assuming that the stored patterns are centers of radial basis functions with Gaussian tuning. The sigma parameter specifies the standard deviation of the Gaussian [distribution]." So this would work, along with maxDistance, in matching incoming data patterns to existing quantization points.
The clonedNodes parameter allows a set of spatial nodes to use the same coincidence patterns. This allows all the nodes in a level to detect the same causes. In vision that could be moving lines, spots, etc.
The spatial pooler nodes take inputs with the bottomUpIn parameter. The spatial pattern outputs in inference mode are in bottomUpOut; outputs in learning mode go to a temporal pooler.
TemporalPoolerNode
Temporal pooling has more options than spatial pooling, in particular offering parameters for both first-order and higher-order learning.Your number of temporal groups, or time quantization points, is set by requentedGroupCount.
You can select a variety of algothims to use to compute output probabilities with the temporalPoolerAlgorithm parameter, but it has no impact on the learning algorithm.
There are a number of sequencer parameters that allow control of the of the algorithm. sequencerWindowCount allows for multiple stages of discovery (the default is 10). sequencerWindowLength allows segmentation of the input sequence to look for patterns. sequencerModelComplexity apparently allows you to adjust for how the recognizable patterns are balanced between the spatial and temporal dimensions. Some objects produce mainly spatial patterns, others mainly temporal, and most combine the two to a greater degree.
As with SpatialPoolerNode, you can clone the nodes if you desire. bottomUpIn takes the data in from one or more spatial pooler nodes. bottomUpOut is the resulting vector of real numbers representing "the likelihood that the input belongs to each of the temporal groups of this node."
In addition to parameters, TemporalPoolerNode takes a command: predict, but it works only in inference mode.
Conclusion
Despite not revealing the details of the algorithms, the Guide, plus the previous materials I read, gave me a good overview of what the algorithms need to achieve. I am pretty sure that I would write algorithms that do approximately what the Numenta pooling algorithms do, but since they have been playing with this for years, I would rather catch up by examinging the code inside the Numenta classes.
See also: More on Internal Operations of Nodes
Wednesday, March 31, 2010
Understanding the Bitworm NuPIC HTM Example Program, Part 2: Network Creation Overview
One thing I found helpful is looking at the set of programs in \Numenta\nupic-1.71\share\projects\bitworm\runtimeNetwork\. These include what appears to be an older version of RunOnce.py that uses CreateNetwork.py for network creation. In the "plain" version of RunOnce the network creation segment has just four lines of code:
bitNet = Network()
AddSensor(bitNet, featureVectorLength = inputSize)
AddZeta1Level(bitNet, numNodes = 1)
AddClassifierNode(bitNet, numCategories = 2)
AddSensor(), AddZeta1Level(), and AddClassifier() are imported functions from nupic.network.helpers. They don't seem to be used other than for Bitworm, so they are worth discussing only in the context of understanding the node structure of Bitworm. This network appears to have 4 nodes in the Getting Started (page 22) illustration, but in CreateNetwork.py we find five listed: the sensor node, the category sensor node, an unsupervised node, a supervised node, and an effector node. Getting Started calls 3 of the nodes the same, but instead of supervised and unsupervised, refers to bottom-level and top-level nodes.
Jumping ahead in Getting Started, we find that bitNet = Network() does indeed create an HTM instance that nodes can be added to and arranged in.
The runtime version replaces these with a single command (but a lot more parameters):
createNetwork(untrainedNetwork = untrainedNetwork,
inputSize = inputSize,
maxDistance = maxDistance,
topNeighbors = topNeighbors,
maxGroups = maxGroups) CreateNetwork.py can also be found in the runtime directory. Open it and the first thing you see
CreateNetwork starts by importing nupic.network. So there is a set of one or more functions or classes we can use to get an overview; we'll look inside them later, if necessary. The following line of code gives us our function parameters, some of which are set specifically for Bitworm. So CreateNetwork.py is not a general-purpose HTM creation function.
def createNetwork(untrainedNetwork,
inputSize = 16,
maxDistance = 0.0,
topNeighbors = 3,
maxGroups = 8):
Next we have some agreement with the plain RunOnce.py:
net = Network()
Network() is an imported function that creates the overall data structure for the HTM.
Nodes are created with the CreateNode() function. The type of node - sensor, category sensor, unsupervised (Zeta1Nodes), supervised (Zeta1TopNodes), and effectors - is chosen with the first parameter of CreateNode(). Among the other parameters of CreateNode you can see spatialPoolerAlgorithm and temporalPoolerAlgorithm. I don't think I having used "pooling" yet. Remember I wrote about quantization points? [See How do HTMs Learn?] There are a number of available points both for spatial and temporal patterns in the unsupervised nodes. They need to be populated, and they may change during the learning phase. Pooling appears to be NuSpeak for this process; a pooler algorithm is the code that matches up incoming data to quantization points.
I did not get as far as I would have liked today, but I am beginning to see some structure, and dinner is calling. Instead of calling this entry HTM Creation Classes and Functions, I'll call it an Overview.
Monday, March 29, 2010
Understanding the Bitworm NuPIC HTM Example Program , Part 1
When I installed the NuPIC package, a program called Bitworm was run to show that NuPIC installed correctly. Bitworm's main program, RunOnce.py is written in Python script and might be characterized as a simplest meaningul example program, which makes it considerably more complicated than your typical Hello World one liner.
The explanation of, and instructions for running and playing with Bitworm can be found in Getting Started With NuPIC (see pages 14-23). If you open RunOnce.py (mine conveniently opened in IDLE, "Python's Integrated Development Environment") there is a good outline of the process too.
The point is to test an HTM (Hierarchical Temporal Memory) with a simple data set. If you got here without knowing about HTMs, see www.numenta.com or my glosss starting with Evaluating HTMs, Part 1.
Bitworm, or RunOnce, starts by creating a minimal HTM. It does this by importing nodes and components using functions that are part of the NuPIC package. It also sets some parameters which have already been built elsewhere. Then the HTM is trained using another already-created data set of bitworms, which are essentially short binary strings easily visualized if 1's as interpreted as black and 0's as white (or whatever colors you like). Later I'll want to look inside the nodes, and at how nodes are interconnected, in order to understand why this works, but for now I'll keep to the top-level-view.
To test if the NuPIC HTM network learned to distinguish 2 types of bitworms, the training data set is again presented to see what outputs the HTM gives. This is also known as pattern recognition, but in temporal memory talk we prefer the term inference. The bitworms are examples of causes (objects in most other systems), and the HTM infers, from the data, which causes are being presented to it.
That seems like too easy of a trick, infering causes based on the training set, so RunOnce also sees how the trained network does trying to infer cuases from a somewhat different set of data.
As output RunOnce gives us the percentages of correct inferences for the training set and second data set, plus some information about the network itself.
Presuming that you are using Windows and downloaded and setup the NuPIC package (see prior blog entry), to run Bitworm with RunOnce.py, open a command prompt (press Start, in the search box type Command. This should show Command Prompt at the top of the program list. Click it once. Since you will need Command Prompt often, you might also return to Start, right-click on Command Prompt, and Pin to Start Menu. Then it is always in your Start Menu. Or create a shortcut).
Type:
cd %NTA%\share\projects\bitworm
and hit Enter. That will get you in the right directory.
Then run RunOnce by typing the following and hitting Enter:
python RunOnce.py
If you get errors, you need to run the Command Prompt as an Administrator. Close the window, then right click on Command Prompt and choose Run As Administrator. Click through security warnings.
The output says there were two sets off 420 data vectors written. Inference with the training set as input data was 100% accurate. Inference with the 2nd data set was 97.85...% accurate.
As it says, you can also open report.txt. Here's what mine says:
General network statistics:
Network has 5 nodes.
Node names are:
category
fileWriter
level1
sensor
topNode
Node Level1 has 40 coincidences and 7 groups.
Node Level2 has 8 coincidences.
------------------------------
Performance statistics:
Comparing: training_results.txt with training_categories.txt
Performance on training set: 100.00%, 420 correct out of 420 vectors
Comparing: test_results.txt with test_categories.txt
Performance on test set: 97.86%, 411 correct out of 420 vectors
------------------------------
Getting groups and coincidences from the node Level1 in network ' trained_bitworm.xml
====> Group = 0
1 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0
0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 0
0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0
0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0
0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0
0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1
====> Group = 1
0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0
0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
====> Group = 2
0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0
0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0
0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0
0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0
0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0
0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0
1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0
====> Group = 3
0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0
0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
====> Group = 4
0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0
0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
====> Group = 5
0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1
====> Group = 6
0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1
Full set of Level 2 coincidences:
0 -> [ 0. 0. 1. 0. 0. 0. 0. 0.]
1 -> [ 1. 0. 0. 0. 0. 0. 0. 0.]
2 -> [ 0. 0. 0. 1. 0. 0. 0. 0.]
3 -> [ 0. 0. 0. 0. 0. 1. 0. 0.]
4 -> [ 0. 1. 0. 0. 0. 0. 0. 0.]
5 -> [ 0. 0. 0. 0. 0. 0. 1. 0.]
6 -> [ 0. 0. 0. 0. 0. 0. 0. 1.]
7 -> [ 0. 0. 0. 0. 1. 0. 0. 0.]
Monday, March 22, 2010
Downloading and Installing NuPIC on a Windows computer
The main Numenta page is http://www.numenta.com/. From there procede to the NuPIC downloads page. You need to log in, so register if you haven't already done so. The Windows version is 32 bit; there are also Mac and Linux (both 32 and 64 bit) versions available. The Windows version file size is 112 MB, which took my satelite Internet over 20 minutes to download. Then you need NuPIC installation instructions. If you are like me, go straight to Windows NuPIC installation instructions. You also need your license file, which is sent to your email address when you register and download NuPIC.
Oh boy, it come with a Python installer. Another programming language to learn (I hope not). Add it, in my case, to APL, Cobol, Fortran, PL1, Pascal, Basic, C, C++, PHP, Javascript ... I hope I have not forgotten anyone important.
After downloading and running the installation file, I did run into a hitch in the installation wizard. After the Python installation I got the old "not responding" error in the wizard window. Eventually, after closing some other application windows, I saw that a secondary Python window had popped up and needed to have its Continue buttons pressed. Once that was done the "not responding" error in the main install window went away and I completed the install successfully.
That leaves Python on my system at C:/Python25/
and NuPIC on my system at C:/Program Files/Numenta/nupic-1.7.1/
It also means the first example, BitWorm, ran successfully, although I did not learn anything from it yet.
Next up: the BitWorm example in detail
Thursday, March 18, 2010
Evaluating HTMs: CPT details; specific memories
CPTs are used in Bayesian networks to allow the belief (a set of probabilities about causes) of one node to modify another node. They can be create from algorithms using probability theory in conjunction with known data, the beliefs already established in the two nodes. In HTMs they are learned. As the quantization points are learned, the CPTs are the same as the learned quantization function that links the points to the temporal variables. There are two separate algorithms, but they run in parallel, creating an output to send up the hierarchy to the next node. This will probably because more transparent when we look at the actual algorithms used by the HTM nodes.
It is claimed that humans can remember specific details and events, as well as model the world, whereas HTMs don't keep specific memories. The authors talk about how the human brain might accomplish this feat, and how the capability might be added to HTMs. I instead wonder whether they are right about humans remembering specific details of specific events.
It certainly is the naive view, and since I subscribe to the common sense school of philosophy (with my own updates), assailing the view is mainly just an exercise at this point. But consider this: numerous studies have shown that eye witnesses are unreliable. I suspect that a visual memory is not like a photograph, nor is the memory of a song like a recording, nor is the memory of an event a sort of whole sensory record of a period of time. I believe humans do remember things, and can train their memories to be more like recorders, and in particular can memorize speeches, poems, sequences of numbers, etc. But I think the HTM model actually is at least approximately the way that the brain works. Different levels of the neurological system remember, or become capable of recognizing, different levels of details about things. There are mechanisms in the brain that allow recall of these memories on different levels. But I would be careful about assuming that because we can recall an event (or picture, etc.) in more or less detail we must be calling up a recording. We seldom learn anything of any length in detail by simply hearing or seeing it. If you have memorized that the first digits of pi are 3.14159, what is that a recording of? The words for the number sequence as sounded out in English, a visual memory of seeing this number in a particular typeface in a particular paragraph on paper of a particular tone, or an abstract memory corresponding to abstract groups of abstract units? Typically we must be exposed to something many times to be able to remember it or recognize it, just like an HTM.
I think we are so good at reconstructing certain types of memories that we think we have photograph or video-like recordings of them. That is why eye witnesses think they are telling the truth, when they often substitute details from other events into a "memory" [notably, a face from a lineup that actually was not present at a crime scene]. That is why our memories are so often mistaken (I could have sworn I turned off that burner!) and why we can recall so much without having a roomful of DVDs in our brains. Our memories are largely indistinguishable from our intelligence, and are both fragmented in detail and yet easily molded into a whole as necessary. This is why recognition is usually much better than recall.
The more I study HTMs, the more curious I get. I don't know what the next step will be in my investigations, but hopefully I'll let you know soon.
Tuesday, March 9, 2010
Why Time is Necessary for HTM's Learning
The authors use a good example, a cut versus uncut watermelon, to distinguish between pattern matching algorithms, and how HTM's learn to recognize patterns that are created by objects (causes, in HTM vocabulary). Any real world animal, when viewed, presents an almost infinite number of different visual representations. If you use a type of animal, say horses instead of a particular horse, the data is even more divergent. Pattern matching does not work well. But allow an HTM to view an animal or set of animals over time, and it will build up the ability to recognize an animal from different viewpoints: front, back, profile, or against most sorts of backgrounds.To do that requires data presented over time. Data that is close sequentially should be similar but not identical. Early data might be of a horse, head on, far away, which gradually resolves to a horse viewed close up. So over time the HTM can capture the totality of the horse.
Combining recognition of causes with names given by an outside source is also considered. Thus no amount of viewing a horse will tell an HTMs that human's call the thing "horse." You can do "supervised learning" with an HTM, training it to associate a name with a cause by imposing states on the top level of the HTM hierarchy. But it should be a simple extension to have a vocabulary learning HTM and an object learning HTM in a hierarchy with a learn-to-name-the-object HTM on top.
Once an HTM has learned to recognize images (or other types of data) it can recognize static images (or data). The authors say "The Belief Propagation techniques of the hierarchy will try to resolve the ambiguity of which sequences are active." I am not clear on that. It seems to me that static temporal patterns happen often enough in the real world so that some temporal pattern points will represent static causes. If the horse stands still in the real world, it would generate such temporal patterns. As the data goes up the hierarchy it tends to filter out ambiguity and stabilize causes, so a leaping horse should still be the same as a frozen image of a leaping horse at some point high enough in the hierarchy.
Section 6 is a sort of frequently asked questions part of the paper. I'm not sure if I'll cover all the sections or in what order, and I do want to go back to section 3.3 on belief propagation before closing out this series.
Monday, March 8, 2010
Evaluating HTMs, Part 5: Internal Operations of Nodes
Basically, each node simulataneously does both learning and recognition of spatial and temporal patterns. The output is information about the patterns that can be sent up and down the hierarchy of nodes.
Spatial patterns do not necessarilly mean space as in space-time continuum, although the example used in the essay is of a two dimensional visual space. Space is used in the mathematical sense. The data can be anything quantifiable. The space could be any number of dimensions. For instance the maximum daily surface temperature of the earth would consitute a space within a certain range of degrees Centigrade, to whatever desired level of precision, on a spherical 2D grid representing the surface of the earth to any desired degree of precision. The time sequence in this example would be daily samples. In a digital audio example the time sequence intervals might be something on the order of .0001 seconds.
The node has a significant number of "quantization points" available to categorize the spatial data. Only the most common data patterns, up to the number of quantization points, will be learned. Anything that is not one of the learned patterns will be assigned a probability that it is one of the learned patterns plus some noise.
Having leaned the quantization points, the node can start looking for common temporal sequences of them. Again, a limited number of points or memory units are allocated for learned temporal patterns. I can't find where the authors give them a name, so at the risk of being corrected later, I'll call them temporal pattern points. Again, temporal pattern matches don't need to be exact; some noise is tolerated.
Once learning has taken place (learning can continue), the node can work to infer causes. The patterns held in the quantization points, as well as the patterns of these in time held in the temporal pattern point, can be causes (or call them objects, which is the more typical if less precise vocabulary.). As time passes the data changes and the causes output to the higher level node(s) of the hierarchy change.
Another very important idea to wrap your head around is that the node, and the HTM, need to deal with probabilities. You might get an exact match with 100% probability, you might even be able to design special situations where you don't need to deal with probability. But the whole point of HTMs (from an engineering standpoint) is to be able to deal with complex real world data, in a manner similar to the human brain. So whether dealing with matching a spatial pattern to the quantization points, or a temporal pattern to the temporal pattern points, you need to think in terms of probability. There is a 42% change that it matches point 7, an 18% change it matches point 42, and a 40% chance that it matches none of the quantization points. With the temporal pattern it gets more complicated, since you can't assume that just because a spatial pattern is most likely point 7, it is not in a temporal pattern that goes, say 49, 3, 7 instead of 49, 3, 42.
Hey, but that is what computers are handy for, figuring probabilities and keeping track of them.
The output to the higher-level nodes might be though of as: There is a 12% probability that we are seeing temporal pattern point 7, 35% it is point 16, 48% it is point 34, and 5% it is point 49. It seems to us to be messy, but it is exactly the sort of thing the next level up is looking for. We call it a vector output, but it also can be thought as a set of data pairing probabilities and points that are arbitrarilly assigned to spatial-temporal patterns.
Hey, I think I am beginning to understand this stuff.
Which brings me back to one of those basic science and philosophy questions that made me interested in machine understanding in the first place. If everything is abstract, how do we (the HTM or a living human brain) get the picture of the world that seems all so familiar to us? If all the world does it produce neuronal impulses in our bodies, what makes red different from blue, and the junk on my desk resolve easily into envelopes, pens, gadgets, fake wood patterns and a host of other things?
Next: Why is Time Necessary to Learn?
Saturday, March 6, 2010
Evaluating HTMs, Part 4: The Importance of Hierarchy
"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 3, Why is Hierarchy Important?, draws a detailed picture of the relationship between the structuring of the nodes of an HTM and real-world (or even virtual world) data. To stick closely to the subject of hierarchy, I'll cover subsections 1, 2, and 4, here, leaving subsection 3, Belief Propagation, to be treated as a separate topic.
If you don't understand the concept of hierarchy, try Hierarchy at Wikipedia.
I think it is best to start with "3.2 The hierarchy of the HTM matches the spatial and temporal hierarchies of the real world." Hierarchies are not always patterns humans impose upon the sensory data we receive from the external world. Each whole has its parts, as a face has eyes, ears, a mouth and nose as well as other features.
The world itself embodies the principle of locality. Spatial and temporal closeness and distance can be interpreted as a hierarchy, if a more abstract one. One might define "close" as meaning within a nanometer and nano secord, with hierarchical levels covering distances and times grouped by factors of twos or tens, up to the size of the cosmos. Or whatever is convenient for the data you are learning about. The bottom layer of the HTM hierarchy learns from the smallest divisions, and passes its interpretations (beliefs) up the hierarchy. Thus in music if the data is already in the form of notes, the bottom layer might deal with two-note sequences, the next layer with 4 note sequences, then 8 notes, 16 notes, 32 notes, on up to the number of notes in a symphony.
Music offers a one-dimensional (or two, if you plot the frequency of the notes) example, but HTMs should be able to deal with higher numbers of dimensions as long as the causes have a hierarchical structure.
Note the design guidance at the end of the section. Our HTM designs should be targetted at problems have appropriate space-time structure. The designs need to capture local correlations first. And the hierarchical structure of nodes should be designed to efficiently model the problem space.
Now back to 3.1, "Shared representaions lead to generalization and storage efficiency." The belief is that HTMs are efficient at learning complex data and causes. In other words, HTMs scale well. This can be, in computer hardware terms, memory size and computing power. This is possible because the lower levels of the HTM break interpret the data into what might be called micro-causes. Or cause modules. These modules can be reused by any of the causes found much higher in the HTM. This mimics what we know of the human visual pathway, where at the lower levels nerves appear to respond to small features on the retina like spots, short lines at various angles, simple changes in contrast, etc. Using the human face as an example, the HTM might recognize eyes, lips, proportions, etc., and categories within these features. Almost all six billion human faces presently on earth would be interpretable in terms of these basic components and their spatial relationships. Two represent each of the faces you don't need 6 billion 10 megapixel bitmap pictures. You just need 6 billion summaries that could probably be represented with a few bytes of data. Recognition would resolve to summarizing the new picture of a face and then finding the closest summary already held by the HTM.
The authors point out that "the system cannot easily learn to recognize new objects that are not made up of previously learned sub-objects." Which we see in human behavior from the household chore level right up to big pictures like evolution and relativity staring large groups of scientists in the face for decades before a Darwin or Einstein said "I recognize a new, high-level causation here."
Within the section is a helpful explanation about "quantization points," which I said were left unclear in section 2. It gives the reason for having a much lower number of quantization points than there are possible events in the event space. It points out that in a 10 by 10 square of binary (black or white) pixels, there are 2 to the 100th different patterns. By limiting the number of quantization points you force the node to group every input image into a type of pattern (some examples could be lines with various orientations, spots that move in a particular direction, more black on the right or left, etc.). These would be "the most common patterns seen by the node during training."
In section 3.4 the authors give an introductory look at the idea that HTMs can pay attention to certain aspects of the data. In other words, just as you might focus on your newspaper while riding public transportation to work, an HTM can pick some level of the hierarchy of data to focus on. Suppose it is a facial recognition HTM and it thinks a face present to it could be George or Herbert. By focussing on a particular aspect of the face, say the nose-to-lips difference, it might become more certain that the face belongs to George. People can do this both consciously and unconsicously.
It an HTM could do that, it would be really cool.
Next: Evaluating HTMs, Part 5: Belief Propagation
Thursday, March 4, 2010
How Do HTMs Learn?
Evaluating HTMs, Part 3: How do HTMs Learn?
"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 2, "How Do HTMs Discover and Infer Causes", gives an overview of the internal mechanisms of HTMs.
Specifically, it gives an a overview of how HTMs learn. This prompted me to think about the difference between learning and discovery. They could be the same thing, but learning for humans often implies a teacher presenting information to be learned to a student. Discovery implies coming upon something new (perhaps a relationship between already known objects) and realizing that it is new and should be remembered.
Each node in an HTM uses the same algorithm. The nodes are arranged hiearchically, with representations typically showing a row of nodes at the bottom that take input. Each row above the bottom has progressively fewer nodes. The top node is a single node. Its output is a vector that represents a cause, or object related to the data in a causal fashion. In fact each node does this, passing its output vector to the next higher row of nodes. So causes are built up hierarchically. All data and discoveries include a time element. In a visual field, for instance, the time element could be no change in a part of the field, or changing color with time, or following a spot of color from one part of the visual field to another over a course of time.
Get used to the technical use of the term "belief" if you want to follow discussions about HTMs. This term is used extensively in probabilistic reasoning theory. "A belief is an internal state of each node," but it does correspond to a probability that there is a causal relationship in the data. "I believe the lion must have escaped from the zoo," is a sentence that conveys to us that a person lives where lions do not live in the wild; it differs from "I know ..." because the speaker is admitting there are other possible causes. In a simple HTM, in a lower node, a belief might be something like "28% probability that this is a horizontal line, 16% that this is two animal eyes, etc." Again, beliefs are represented in software by vectors, but they are not generally identical to the output vectors of the nodes.
In training or learning, the HTM forms new beliefs at the bottom of the hierarchy of nodes first. More complex beliefs can only be created once lower level beliefs exist, but the entire process is flexible. If a lower level node alters its belief, it tends to effect higher level nodes. So learning is not just memorization.
So how does a node do all this? Nodes are given a set number of "quantization points." Here the authors are not very clear. The input pattern is assigned to one of the quantization points. And/or "the node decides how close (spacially) the current input is to each of the quantization points and assigns a probablity to each quantization point." How it decides is presumably an algorithm. With enough quantization points, each input data set could be matched exactly to a point. Would that set up cause the node to fail? To do as the authors say, the assumption is there are less quantization points than there are possible inputs.
Step two is "the node looks for common sequences of these quantization points" and "represents each sequence with a variable." So you have to ask, why assign probabilities, why not just assign closest fits? In any case the output variable represents a sequence of quantization points based on the sequence of input data.
Admitting that the authors are introducing the topic, and its vocabulary, still I would have liked more than two short paragraphs on the internal operations of HTM nodes.
Interestingly (and copying what is known about the cortex of mammal brains) information can move both up and down the hierarchy of nodes. As just described, data moving up the hierarchy is temporal variables. Data going down the hierarchy represents the "distribution over the quantization points." That would be a probability distribution.
What I suspect the authors mean is that there is a mechanism to alter the quantization points themselves. Points with long-term zero percent probabilities don't help resolve ambiguity. The set of point probabilities being sent down the hierarchy allows the (lower) node to "take a relatively stable pattern from its parent node(s)" and "turn it into a sequence of spatial patterns."
The claim is that over time "patterns move from the bottom of the hierarchy to the top." In effect rather than sending the raw data up the hierarchy, the nodes send "names" of data sequences up the hierarchy.
That would be pretty cool, and I'd like to know exactly how it happens, but this is, afterall, only an introduction.
Next: Why is a hierarchy important?
Wednesday, March 3, 2010
Evaluating HTMs, Part 2: What HTMs Do
"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 1, What Do HTMs Do?, asserts that HTMs can discover causal relationships in data presented to them. Once the causality is established, they can infer the cause of a new input. They can predict (with some accuracy) future data sets, and they can use the above three abilities to direct (choose) behavior.
It should be pointed out immediately that each of these 4 abilities have been demonstrated by other computational systems. Neural network software that could discover certain types of relationships from raw data was available at least as early as the 1980's. Many mathematical forms of analysis can find causal relationships within data, for instance regression analysis. In any system, once a relationship is established, recognizing it should present no great difficulty. Nor should making predictions, or using the system to control behavior.
What is interesting about the set of claims for HTMs is that they work together holistically (like the human brain) and should be stackable. That is, the HTM systems should be able to deal with external relationships of increasing complexity by stacking HTM subsystems into an appropriate system. In addition, the HTMs can find relationships in time (sequential or temporal relationships).
In fact, if we call some of the data "objects," for an HTM the objects "have a persistent structure; they exist over time." The authors call the objects "causes." In theory an HTM system could deal with multiple forms of data coming in directly from the world, but usually (for now) the HTM deals with a specific subset of data (much as when a human, say, concentrates on music, or on reading). The data could be a computer file, or a stream of data from input devices.
For the HTM to work the causes should be relatively stable, but should generate data that changes over time, as a horse moving across a visual field, or a conversation between two people. Causes are typically multiple.
The discovery of the causal relationships is a learning process. During learning the HTM builds representations of causes in the form of vectors. The relationship is expressed as a set of probabilities for causes; this set is called a "belief." The causes, relationships, and beliefs can be quite complex if the HTM is complex enough. In particular, hierarchies of causes and beliefs can be learned.
The authors say an HTM, once it has gone through learning, can "infer causes of a novel input." This means that if it is presented with new data, it will try to match the data up to one of the causes it knows about. This is basically pattern recognition, and there are other systems, including certain neural networks, that do this well in certain situations. A good point made by is that if a million pixel visual field (of a scene with motion) is used as the input, it would be rare that an exact pattern would be input twice. So inference, matching a set of data to the closest cause, is a necessity. In the older neural networks causes were typically static; adding a time dimension to the data usually makes it easier for an HTM to learn and infer. I should point out, however, that "infer causes of novel input," to me can mean something more than is claimed by an HTM. For humans, it can mean a deduction, or even a deep set of deductions, rather than just recognizing a pattern or its degree of ambiguity. Then again, perhaps a sufficiently complete HTM system could do even that.
The ability to predict is the third leg of what HTMs do. In other words, given a sequence already encountered, an HTM will predict that sequence is happening again. This sounds like not much, but it is an ability that is crucial to machine understanding. In particular the authors point to priming. Given the latest data, the HTM makes a prediction and notes differences between what is predicted and what happens. If data is ambiguous or noisy, the HTM may fill in with the predicted data. If a prediction is fed back into the HTM as data, this is akin to thinking or imagining. Thus the machine could plan for the future. The authors claim "HTMs can do this well." Imagine a sheep-herding dog application. The better it can predict the behavior of the sheep, the less energy it should need to herd them.
Finally, HTMs can direct behavior. Of course almost any device, even simple mechanical ones, can direct behavior. A mouse trap, given a certain type of input, will engage in a known set behavior. Still, mentioning this for HTMs is important because that is exactly what we would expect artificial intelligence or machine understanding to be used for: behavior. An important point is that "From the HTM's perspective, the [output] system it is connected to is just another object in the world." In other words, an HTM can learn about how its own outputs act is causes in the world.
If you want to get an idea of the potential power of HTMs, before wading through a lot of other materials, section 1.4 of the paper "Direct Behavior", is a great starting point.
I'm excited after reading this part.
Possible Acronym: LIPD (learn, infer, predict, direct)
Next: How do HTMs discover and infer causes?
Tuesday, March 2, 2010
Evaluating HTMs, Part 1
This will be a long project. The current plan is to read and critically summarize "Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George; read and comment on the "Getting Started with NuPIC Guide;" downloading and trying NuPIC, and then assessing what whether to continue on the project.
In preparation I have previously read On Intelligence by Jeff Hawkins twice; read and commented on Towards a Mathematical Theory of Cortical Micro-circuits by George and Hawkins [see early entries of this blog]; took a course in neural networks at SDSU long ago; and read From Neuron to Brain to get detailed information on biological neurons. I know how to write software and I am fairly good at math.
Still, my head hurts just thinking about it, but here we go (into "HTM Concepts, Theory, and Terminology'):
In the Introduction the authors remind us that the human mind/brain has capabilities that computers have so far been unable to duplicate. HTMs are a memory system that can learn to solve certain problems. They are organized as hierarchical systems of nodes. HTMs are currently implemented as software on traditional computer hardware. "The learning curve can be steep."
That is all ground I have already covered in this blog. My learning curve is probably going to be steeper than that of most students who would be interested in this topic, but hopefully watching me struggle will be helpful to at least a few people.
Wednesday, February 24, 2010
Neuron to Brain Finished
Except the early chapter on the visual cortex and the last chapter, "Genetic and Environmental Influences in the Mammalian Visual System," there is not much in the book that is directly helpful with issues of machine understanding.
And yet it does give an appreciation of the neural system, including its biochemistry. It would be a wonder of nature even if it were not capable, in human form, of smashing atoms and writing poetry. I had been working with simplistic ideas about how individual neurons work. That is actually fine for computer models. The fact that there are many modes of operation of neurons shows that evolution can make good use of both true redundancy and the fine tunings that come from slight variations.
I uncovered a small, common salamander today when pulling wood from the pile to bring up to the woodstove. There is no pond near the wood pile, so this creature had to wander some distance to get to this shelter. It is a good shelter too, complete with insects and other arthropods that make life easy for a salamander. I would tend to say that a salamander does not offer much in the way of understanding capabilities. But salamanders have been navigating the world and keeping alive, so even if we want to think of them as not capable of thought, still they have the necessary degree of intelligence to get them through their generations of life.
I am looking at the general issue of putting neurons together in patterns that could be said to be capable of at least the rudiments of understanding. There is nothing worth reporting on yet, so I'll probably go back to reporting on what Numenta is doing. The geniuses there are working on the problem full time, and are claiming some progress.
Tuesday, January 12, 2010
Bayesian Wasteland?
Regarding Probabilistic Reasoning, so far I have seen a lot of interesting work on the problem of combining probability calculations with logic. I just finished the section on Markov Networks and am about to read up on Bayesian Networks. My problem so far is that I don't see any advantages to marrying anything I've seen of probabilistic reasoning to Jeff Hawkin's theory of predictive memory. Despite having read "Towards a Mathematical Theory of Cortical Micro-circuits," as reported in previous entries. Then again sometimes I have trouble taking up novel ideas. But my impression so far is that neural networks are not operating on a probability basis. The closest I can get, so far, to that kind of model is a signal-mixing basis, where analog functions might represent Pearl's probabilies. Nor do I see how the probability network models can cope with invariants. The word invariant is not found in the index to Probablistic Reasoning. I have higher hopes for a Tensor model, even though my own work on that is very preliminary. [I am reminded of the two seemingly totally different mathematical methods used in early quantum mechanics, which then were proven to be equivalants.]
In From Neuron, I have been reading astonishing details about how synapses and single nerve cells work, including how experiments were conducted, in mind-numbing detail. I am just getting to where how neurons and sets of neurons have been shown to operate, with the first example being neurons that sense the stretching of muscles. Again, I am pointed to tensors, which can be used to represent how multiple muscles representing various degrees of freedom of motion can lead to a coherent knowledge of where a body part is in three-dimensional, euclidean-modeled space.
More oddly, Neuron has basically nothing about Hebbian learning. True, the book is dated 1984. But did no one even try to find a physical basis for Hebbian learning as of that date? If you know of a definitive paper that appears to prove a biochemical mechanism for Hebbian learning, let me and my readers know.
I intended to explain what I was reading, in suitable chunks, in this blog. But it is easier to just keep reading at this point, rather than writing about details that I am not even sure are important yet. I keep finding I have to go back to basics. Today, worrying about tensors, muscles, feedback, and a neuron-level learning model, I am revisiting a two-neuron learning model that, oddly, I first worked on back in the 1980's. If anything comes of it, I'll let you know here.