See also Part 1, Part 2, Part 3
"Hierarchical Temporary Memory, Concepts, Theory, and Terminology " by Hawkins and George, Section 3, Why is Hierarchy Important?, draws a detailed picture of the relationship between the structuring of the nodes of an HTM and real-world (or even virtual world) data. To stick closely to the subject of hierarchy, I'll cover subsections 1, 2, and 4, here, leaving subsection 3, Belief Propagation, to be treated as a separate topic.
If you don't understand the concept of hierarchy, try Hierarchy at Wikipedia.
I think it is best to start with "3.2 The hierarchy of the HTM matches the spatial and temporal hierarchies of the real world." Hierarchies are not always patterns humans impose upon the sensory data we receive from the external world. Each whole has its parts, as a face has eyes, ears, a mouth and nose as well as other features.
The world itself embodies the principle of locality. Spatial and temporal closeness and distance can be interpreted as a hierarchy, if a more abstract one. One might define "close" as meaning within a nanometer and nano secord, with hierarchical levels covering distances and times grouped by factors of twos or tens, up to the size of the cosmos. Or whatever is convenient for the data you are learning about. The bottom layer of the HTM hierarchy learns from the smallest divisions, and passes its interpretations (beliefs) up the hierarchy. Thus in music if the data is already in the form of notes, the bottom layer might deal with two-note sequences, the next layer with 4 note sequences, then 8 notes, 16 notes, 32 notes, on up to the number of notes in a symphony.
Music offers a one-dimensional (or two, if you plot the frequency of the notes) example, but HTMs should be able to deal with higher numbers of dimensions as long as the causes have a hierarchical structure.
Note the design guidance at the end of the section. Our HTM designs should be targetted at problems have appropriate space-time structure. The designs need to capture local correlations first. And the hierarchical structure of nodes should be designed to efficiently model the problem space.
Now back to 3.1, "Shared representaions lead to generalization and storage efficiency." The belief is that HTMs are efficient at learning complex data and causes. In other words, HTMs scale well. This can be, in computer hardware terms, memory size and computing power. This is possible because the lower levels of the HTM break interpret the data into what might be called micro-causes. Or cause modules. These modules can be reused by any of the causes found much higher in the HTM. This mimics what we know of the human visual pathway, where at the lower levels nerves appear to respond to small features on the retina like spots, short lines at various angles, simple changes in contrast, etc. Using the human face as an example, the HTM might recognize eyes, lips, proportions, etc., and categories within these features. Almost all six billion human faces presently on earth would be interpretable in terms of these basic components and their spatial relationships. Two represent each of the faces you don't need 6 billion 10 megapixel bitmap pictures. You just need 6 billion summaries that could probably be represented with a few bytes of data. Recognition would resolve to summarizing the new picture of a face and then finding the closest summary already held by the HTM.
The authors point out that "the system cannot easily learn to recognize new objects that are not made up of previously learned sub-objects." Which we see in human behavior from the household chore level right up to big pictures like evolution and relativity staring large groups of scientists in the face for decades before a Darwin or Einstein said "I recognize a new, high-level causation here."
Within the section is a helpful explanation about "quantization points," which I said were left unclear in section 2. It gives the reason for having a much lower number of quantization points than there are possible events in the event space. It points out that in a 10 by 10 square of binary (black or white) pixels, there are 2 to the 100th different patterns. By limiting the number of quantization points you force the node to group every input image into a type of pattern (some examples could be lines with various orientations, spots that move in a particular direction, more black on the right or left, etc.). These would be "the most common patterns seen by the node during training."
In section 3.4 the authors give an introductory look at the idea that HTMs can pay attention to certain aspects of the data. In other words, just as you might focus on your newspaper while riding public transportation to work, an HTM can pick some level of the hierarchy of data to focus on. Suppose it is a facial recognition HTM and it thinks a face present to it could be George or Herbert. By focussing on a particular aspect of the face, say the nose-to-lips difference, it might become more certain that the face belongs to George. People can do this both consciously and unconsicously.
It an HTM could do that, it would be really cool.
Next: Evaluating HTMs, Part 5: Belief Propagation