Learning Hierarchically-Structured Concepts II: Overlapping Concepts, and Networks with Feedback | SpringerLink
Skip to main content

Learning Hierarchically-Structured Concepts II: Overlapping Concepts, and Networks with Feedback

  • Conference paper
  • First Online:
Structural Information and Communication Complexity (SIROCCO 2023)

Abstract

We continue our study from Lynch and Mallmann-Trenn (Neural Networks, 2021), of how concepts that have hierarchical structure might be represented in brain-like neural networks, how these representations might be used to recognize the concepts, and how these representations might be learned. In Lynch and Mallmann-Trenn (Neural Networks, 2021), we considered simple tree-structured concepts and feed-forward layered networks. Here we extend the model in two ways: we allow limited overlap between children of different concepts, and we allow networks to include feedback edges. For these more general cases, we describe and analyze algorithms for recognition and algorithms for learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This assumption and the next are just for uniformity, to reduce the number of parameters and simplify the math.

  2. 2.

    This is a departure from our usual models [4,5,6], in which the network determines all the values of the state components for non-input neurons. We expect that the network could be modeled as a composition of sub-networks. One of the sub-networks—a Winner-Take-All (WTA) sub-network— would be responsible for setting the engaged state components. We will discuss the behavior of the WTA sub-network in Sect. F, but a complete compositional model remains to be developed.

  3. 3.

    We might also consider interleaved learning of higher-level concepts and their descendants. The idea is that partial learning of a concept c can be enough to make rep(c) fire, which can help in learning parents of c. We mention this as future work, in Sect. H.

  4. 4.

    The mention of the limited-overlap property is just for emphasis, since all of the concept hierarchies of this paper satisty this property.

  5. 5.

    We use O notation here instead of giving actual constants. We omit a technical assumption involving roundoffs.

  6. 6.

    In any case, we can made the decomposition work by adding our new, not-very-severe, inequality as an assumption.

  7. 7.

    The extension to noisy learning is the main reason that we used the incremental Oja’s rule. If the concepts were presented in a noise-free way, we could have allowed learning to occur all-at-once.

  8. 8.

    For instance, each layer \(\ell -1\) neuron v might have an outgoing edge to a special threshold element that fires exactly if the potential produced by v on the edge (vu) is at least w, i.e., if v fires and \(weight(v,u) \ge w\). Then another neuron associated with u might collect all this firing information from all layer \(\ell -1\) neurons v and see if the number of firing neurons reaches the threshold \(\lfloor o \cdot k \rfloor + 1\), which is equivalent to saying that the number of firing neurons is strictly greater than \(o \cdot k\). If this special neuron fires, it excites u to act as an input to the basic WTA, but if it does not, u should drop out of contention.

  9. 9.

    We are omitting mention here of some trivial technical assumptions, like small lower bounds on \(\tau \).

  10. 10.

    This seems reasonable since we are not considering noise during this second pass. Of course, that might be interesting to consider at some point.

  11. 11.

    The scaled case isn’t actually worked out in Sect. 6 but should follow as a natural extension of the un-scaled results.

References

  1. Felleman, D.J., Van Essen, D.C.: Distributed hierarchical processing in the primate cerebral cortex. Cereb. cortex (New York, NY: 1991) 1(1), 1–47 (1991)

    Google Scholar 

  2. Hubel, D., Wiesel, T.: Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962)

    Article  Google Scholar 

  3. Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurones in the cat’s striate cortex. J. Physiol. 148(3), 574–591 (1959). https://doi.org/10.1113/jphysiol.1959.sp006308

  4. Lynch, N., Mallmann-Trenn, F.: Learning hierarchically structured concepts. Neural Netw. 143, 798–817 (2021)

    Article  Google Scholar 

  5. Lynch, N., Musco, C.: A basic compositional model for spiking neural networks, April 2021. arXiv:1808.03884v2. Also, submitted for publication

  6. Lynch, N., Musco, C., Parter, M.: Winner-take-all computation in spiking neural networks, April 2019. arXiv:1904.12591

  7. Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  8. Leslie, G.: Valiant. Circuits of the Mind. Oxford University Press, Oxford (2000)

    Google Scholar 

  9. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53

    Chapter  Google Scholar 

  10. Zhou, B., Bau, D., Oliva, A., Torralba, A.: Interpreting deep visual representations via network dissection. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2131–2145 (2019). https://doi.org/10.1109/TPAMI.2018.2858759

Download references

Acknowledgement

This work was in part supported by the National Science Foundation awards CCF-1810758 and CCF-2139936, as well as by EPSRC EP/W005573/1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frederik Mallmann-Trenn .

Editor information

Editors and Affiliations

Appendices

A Recognition in Networks with Stochastic Firing

Another type of uncertainty, besides approximate weights, arises when neuron firing is determined stochastically, for example, using a standard sigmoid function. See [5] for an example of a model that uses this strategy. In this case, we cannot make absolute claims about recognition, but we would like to assert that correct recognition occurs with high probability. Here we consider this type of uncertainty in the situation where weights are exactly 1 or 0, as in Sect. 5.1. Extension to allow approximate weights, as well as to networks with feedback, is left for future work.

Following [5], we assume that the potential incoming to a neuron u, \(pot^u\), is adjusted by subtracting a real-valued bias value, and the resulting adjusted potential, x, is fed into a standard sigmoid function with temperature parameter \(\lambda \), in order to determine the firing probability p. Specifically, we have:

$$ p(x) = \frac{1}{1 + e^{-x/\lambda }}, $$

where \(x = pot^u - bias\).

Let \(\delta \) be a small target failure probability. In terms of our usual parameters n, f, k, and \({{\,\mathrm{\ell _{max}}\,}}\), and the new parameters \(\lambda \) and \(\delta \), our goal is to determine values of \(r_1\) and \(r_2\) so that the following holds: Let \(\mathcal C\) be any concept hierarchy satisfying the limited-overlap property. Assume that \(B \subseteq C_0\) is presented at time t. Then:

  1. 1.

    If \(c \in supp_{r_2}(B)\), then with probability at least \(1 - \delta \), rep(c) fires at time \(t + level(c)\).

  2. 2.

    If \(c \notin supp_{r_1}(B)\) then with probability at least \(1 - \delta \), rep(c) does not fire at time \(t + level(c)\).

In order to determine appropriate values for \(r_1\) and \(r_2\), we start by considering the given sigmoid function. We determine real values \(b_1\) and \(b_2\), \(-\infty< b_1< b_2 < \infty \), that guarantee the following:

  1. 1.

    If the adjusted potential x is \(\ge b_2\), then the probability p(x) of firing is \(\ge 1 - \delta '\), and

  2. 2.

    If the adjusted potential x is \(< b_1\), then the probability p(x) of firing is \(\le \delta '\).

Here, we take \(\delta '\) to be a small fraction of the target failure probability \(\delta \), namely, \(\delta ' = \frac{\delta }{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}\). We choose \(b_1\) such that \(p(b_1+bias) = \frac{1}{1+ e^{-(b_1+bias)/\lambda }} = \delta '\) and \(b_2\) such that \(p(b_2+bias) = \frac{1}{1+ e^{-(b_2+bias)/\lambda }} = 1 - \delta '\). In other words, \(b_1 = \lambda \ log(\frac{1-\delta '}{\delta '}) - bias\), and \(b_2 = \lambda \ log(\frac{\delta '}{1-\delta '}) - bias\).

Next, we compute values for \(r_1\) and \(r_2\) based on the values of \(b_1\) and \(b_2\). The values of \(b_1\) and \(b_2\) are adjusted potentials, whereas \(r_1\) and \(r_2\) are fractions of the population of children. To translate, we use \(r_1 = \frac{b_1 + bias}{k}\) and \(r_2 = \frac{b_2 + bias}{k}\). This makes sense because having \(r_2 k\) children firing yields a potential of \(r_2 k\) and an adjusted potential of \(r_2 k - bias = b_2\), and not having \(r_1 k\) children firing means that the potential is strictly less than \(r_1 k\) and the adjusted potential is strictly less than \(r_1 k - bias = b_1\).

Note that the requirements on \(r_1\) and \(r_2\) impose some constraints on the value of bias. Namely, since we require \(r_1 \ge 0\), we must have \(b_1 + bias \ge 0\). And since we require \(r_2 \le 1\), we must have \(b_2+bias \le k\). Within these constraints, different values of bias will yield different values of \(r_1\) and \(r_2\).

With these definitions, we can prove:

Theorem 8

Let \(\mathcal C\) be any concept hierarchy satisfying the limited-overlap property. Let \(\delta \) be a small target failure probability. Let \(r_1\) and \(r_2\) be determined as described above.

Assume that \(B \subseteq C_0\) is presented at time t. Then:

  1. 1.

    If \(c \in supp_{r_2}(B)\), then with probability at least \(1 - \delta \), rep(c) fires at time \(t + level(c)\).

  2. 2.

    If \(c \notin supp_{r_1}(B)\) then with probability at least \(1 - \delta \), rep(c) does not fire at time \(t + level(c)\).

Proof

(Sketch:) Fix any concept c in C, the set of concepts in \(\mathcal C\). Note that c has at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1}\) descendants in C.

For Property 1, suppose that \(c \in supp_{r_2}(B)\). Then for every descendant \(c'\) of c with \(level(c)\ge 1\) and \(c' \in supp_{r_2}(B)\), \(rep(c')\) fires at time \(t+level(c')\) with probability at least \(1 - \delta '\), assuming that for each of its children \(c'' \in supp_{r_2}(B)\), \(rep(c'')\) fires at time \(t+level(c'')\). Using a union bound for all such \(c'\), we obtain that, with probability at least \(1 - k^{{{\,\mathrm{\ell _{max}}\,}}+1} \ \delta ' = 1 - \delta \), rep(c) fires at time \(t+level(c)\).

In a bit more detail, for each descendant \(c'\) of c with \(level(c') \ge 1\) and \(c' \in supp_{r_2}(B)\), let \(S_{c'}\) denote the set of executions in which \(rep(c')\) does not fire at time \(t + level(c')\), but for each of its children \(c'' \in supp_{r_2}(B)\), \(rep(c'')\) fires at time \(t+level(c'')\). Then \(\bigcup _{c'} S_{c'}\) includes all of the executions in which rep(c) does not fire at time level(c).

Moreover, we claim that the probability of each \(S_{c'}\) is at most \(\delta '\): The fact that \(c' \in supp_{r_2}(B)\) imply that at least \(r_2 k\) children \(c''\) are in \(supp_{r_2}(B)\). Since we assume that all of these \(rep(c')\) fire at time \(t + level(c'')\), this implies that the potential incoming to \(c'\) for time \(t + level(c')\) is at least \(r_2 k\), and the adjusted potential is at least \(r_2 k - bias = b_2\). Then the first property of \(b_2\) yields probability \(\le \delta '\) of \(c'\) not firing at time \(t + level(c')\).

For Property 2, suppose that \(c \notin supp_{r_1}(B)\). Then for every descendant \(c' \notin supp_{r_1}(B)\), \(rep(c')\) does not fire at time \(t+level(c')\), with probability at least \(1 - \delta '\), assuming that for each of its children \(c'' \notin supp_{r_1}(B)\), \(rep(c'')\) does not fire at time \(t+level(c'')\). Using a union bound for all \(c' \notin supp_{r_1}(B)\), we obtain that, with probability at least \(1 - k^{{{\,\mathrm{\ell _{max}}\,}}+1} \ \delta ' = 1 - \delta \), does not fire at time \(t+level(c)\).

In a bit more detail, for each descendant \(c'\) of c with \(level(c') \ge 1\) and \(c' \notin supp_{r_1}(B)\), let \(S_{c'}\) denote the set of executions in which \(rep(c')\) fires at time \(t + level(c')\), but for each of its children \(c'' \notin supp_{r_1}(B)\), \(rep(c'')\) does not fire at time \(t+level(c'')\). Then \(\bigcup _{c'} S_{c'}\) includes all of the executions in which rep(c) fires at time level(c), and the probability for each \(S_{c'}\) is at most \(\delta '\); the argument is similar to that for Property 1.

This has been only a sketch of how to analyze stochastic firing, in the simple case of feed-forward networks and exact weights. We leave the complete details of this case, as well as extensions to include approximate weights and networks with feedback, for future work.

B Missing Statements and Proofs of Section 2

The following monotonicity lemma says that increasing the value of the parameter r can only decrease the supported set.Footnote 4

Lemma 8

Let \(\mathcal C\) be any concept hierarchy satisfying the limited-overlap property, and let \(B \subseteq D_0\). Consider \(r, r'\), where \(0 \le r \le r' \le 1\). Then \(supp_{r'}(B) \subseteq supp_r(B)\).

The following lemma says that any concept c is supported by its entire set of leaves. It can be proven by induction on level(c).

Lemma 9

Let \(\mathcal C\) be any concept hierarchy satisfying the limited-overlap property. If c is any concept in C, then \(c \in supp_{1} (leaves(c))\).

Proof

(of Lemma 4). We proceed by induction on t. The base case is \(t=0\). In this case \(c \in S(0)\), which implies that \(c \in supp_{r,0}(B)\), as needed.

For the inductive step, assume that \(c \in S(t+1)\) and \(parent(c) \notin S(t+1)\). If \(c \in S(t)\), then since \(parent(c) \notin S(t)\), the inductive hypothesis tells us that \(c \in supp_{r,0}(B)\). So assume that \(c \notin S(t)\). Then since \(parent(c) \notin S(t)\), c is included in \(S(t+1)\) based only on its children who are in S(t). Since \(c \notin S(t)\), the parent of each of these children is not in S(t). Then by inductive hypothesis, all of c’s children that are in S(t) are in \(supp_{r,0}(B)\). Since they are enough to support c’s inclusion in \(supp_{r,0}(B)\), we have that \(c \in supp_{r,0}(B)\).

Proof

(Proof of Theorem 1). This theorem works because, for each \(\ell \), the \(S(\ell ,t)\) sets stabilize after support has had a chance to propagate upwards from level 0 to level \({{\,\mathrm{\ell _{max}}\,}}\), and then propagate downwards from level \({{\,\mathrm{\ell _{max}}\,}}\) to level \(\ell \). Because the concept hierarchy has a simple tree structure, there is no other way for a concept to get added to the \(S(\ell ,t)\) sets.

Formally, we use a backwards induction on \(\ell \), from \(\ell = {{\,\mathrm{\ell _{max}}\,}}\) to \(\ell = 0\), to prove the claim: If \(c \in supp_{r,f}(B)\) and \(level(c) = \ell \), then \(c \in S(2 {{\,\mathrm{\ell _{max}}\,}}- \ell )\). For the base case, consider \(\ell = {{\,\mathrm{\ell _{max}}\,}}\). Since c has no parents, we must have \(c \in supp_{r,0}(B)\), so Proposition 1 implies that \(c \in S({{\,\mathrm{\ell _{max}}\,}}) \subseteq S(2 {{\,\mathrm{\ell _{max}}\,}})\), as needed.

For the inductive step, suppose that \(c \in supp_{r,f}(B)\) and \(level(c) = \ell -1\). If \(c \in supp_{r,0}(B)\) then again Proposition 1 implies that \(c \in S({{\,\mathrm{\ell _{max}}\,}})\), which suffices. So suppose that \(c \in supp_{r,f}(B) - supp_{r,0}(B)\). Then c’s inclusion in the set \(supp_{r,f}(B)\) relies on its (unique) parent first being included, that is, there is some t for which \(c \notin S(t)\) and \(parent(c) \in S(t)\). Since \(parent(c) \in supp_{r,f}(B)\) and \(level(parent(c)) = \ell \), the inductive hypothesis yields that \(parent(c) \in S(2 {{\,\mathrm{\ell _{max}}\,}}- \ell )\).

Moreover, all the children that c relies on for its inclusion in \(supp_{r,f}(B)\) are in \(supp_{r,0}(B)\), by Lemma 4. Therefore, by Proposition 1, they are in \(S(\ell -2) \subseteq S(2 {{\,\mathrm{\ell _{max}}\,}}- \ell )\). Thus, we have enough support for c in \(S(2 {{\,\mathrm{\ell _{max}}\,}}- \ell )\) to ensure that \(c \in S(2 {{\,\mathrm{\ell _{max}}\,}}- \ell + 1)\), as needed.

C Missing Proofs of Section 5

Proof

(Proof of Theorem 2). Let \(r = \frac{\tau }{k}\), where \(\tau \) is the firing threshold for the non-input neurons of \(\mathcal N\). Assume that \(B \subseteq C_0\) is presented at time t. We prove the two parts of Definition 4 separately.

  1. 1.

    If \(c \in supp_{r_2}(B)\) then rep(c) fires at time \(t + level(c)\). Suppose that \(c \in supp_{r_2}(B)\). By assumption, \(\tau \le r_2 k\), so that \(r = \frac{\tau }{k} \le r_2\). Then Lemma 8 implies that \(c \in supp_{r}(B)\). Then by Lemma 7, rep(c) fires at time \(t + level(c)\).

  2. 2.

    If \(c \notin supp_{r_1}(B)\) then rep(c) does not fire at time \(t + level(c)\). Suppose that \(c \notin supp_{r_1}(B)\). By assumption, \(\tau \ge r_1 k\), so that \(r = \frac{\tau }{k} \ge r_1\). Then Lemma 8 implies that \(c \notin supp_{r}(B)\). Then by Lemma 7, rep(c) does not fire at time \(t + level(c)\).

Proof

(of Lemma 6). If \(layer(u) = 0\), then u fires at time t exactly if \(u = rep(c)\) for some \(c \in B\), by assumption. So consider u with \(layer(u) \ge 1\). We show the contrapositive. Assume that \(u \notin R\). Then u has no positive weight incoming edges, by definition of the weights. So u cannot receive enough incoming potential for time \(t+layer(u)\) to meet the positive firing threshold \(\tau \).

Proof

(of Lemma 7). We prove the two directions separately.

  1. 1.

    If \(c \in supp_{r}(B)\) then rep(c) fires at time \(t + level(c)\). We prove this using induction on level(c). For the base case, \(level(c) = 0\), the assumption that \(c \in supp_r(B)\) means that \(c \in B\), which means that rep(c) fires at time t, by the assumption that B is presented at time t.

    For the inductive step, assume that \(level(c) = \ell +1\). Assume that \(c \in supp_{r}(B)\). Then by definition of \(supp_r\), c must have at least rk children that are in \(supp_{r}(B)\). By inductive hypothesis, the reps of all of these children fire at time \(t + \ell \). That means that the total incoming potential to rep(c) for time \(t+\ell +1\), \(pot^{rep(c)}(t+\ell +1)\), reaches the firing threshold \(\tau = r k\), so rep(c) fires at time \(t + \ell +1\).

  2. 2.

    If \(c \notin supp_r(B)\) then rep(c) does not fire at time \(t+level(c)\). Again, we use induction on level(c). For the base case, \(level(c) = 0\), the assumption that \(c \notin supp_r(B)\) means that \(c \notin B\), which means that rep(c) does not fire at time t by the assumption that B is presented at time t.

    For the inductive step, assume that \(level(c) = \ell +1\). Assume that \(c \notin supp_r(B)\). Then c has strictly fewer than rk children that are in \(supp_{r}(B)\), and therefore, strictly more than \(k - r k\) children that are not in \(supp_{r}(B)\). By inductive hypothesis, none of the reps of the children in this latter set fire at time \(t + \ell \), which means that the reps of strictly fewer than rk children of c fire at time \(t+\ell \). So the total incoming potential to rep(c) from reps of c’s children is strictly less than rk. Since only reps of children of c have positive-weight edges to rep(c), that means that the total incoming potential to rep(c) for time \(t+\ell +1\), \(pot^{rep(c)}(t+\ell +1)\), is strictly less than the threshold \(\tau = r k\) for rep(c) to fire at time \(t + \ell +1\). So rep(c) does not fire at time \(t+\ell +1\).

Proof

(of Theorem 3) Fix \(c,c'\) as above, and assume that \(c'\) is shown at time t.

Claim: For any descendant d of c that is not also a descendant of \(c'\), rep(d) does not fire at time \(t + level(d)\).

Proof of Claim: By induction on level(d). For the base case, \(level(d) = 0\). We know that rep(d) does not fire at time t because only reps of descendants of \(c'\) fire at time t.

For the inductive step, assume that d is a descendant of c that is not also a descendant of \(c'\). By the limited-overlap assumption, d has at most \(o \cdot k < r_1 k\) children that are also descendants of \(c'\). By inductive hypothesis, the reps of all the other children of d do not fire at time \(t + level(d) - 1\). So the number of children of d whose reps fire at time \(t+level(d)-1\) is strictly less than \(r_1 k\). That is not enough to meet the firing threshold \(\tau \ge r_1 k\) for rep(d) to fire at time \(t + level(d)\).

End of proof of Claim.

Applying the Claim with \(d = c\) yields that rep(c) does not fire at time \(t + level(c)\).

Lemma 10

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the feed-forward network defined above, based on \(\mathcal C\). Assume that \(\frac{(r_1 + r_2)k}{2} > \frac{1}{k^{b-1}}\).

Assume that \(B \subseteq C_0\) is presented at time t. If u is a neuron that fires at time \(t+ layer(u)\), then \(u = rep(c)\) for some concept \(c \in C\).

Proof

The proof is slightly more involved than the one for Lemma 6. This time we proceed by induction on layer(u). For the base case, If \(layer(u) = 0\), then u fires at time t exactly if \(u = rep(c)\) for some \(c \in B\), by assumption.

For the inductive step, consider u with \(layer(u) = \ell +1\). Assume for contradiction that u is not of the form rep(c) for any \(c \in C\). Then the weight of each edge incoming to u is at most \(\frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+b}}\). By inductive hypothesis, the only layer \(\ell \) incoming neighbors that fire at time \(t+\ell \) are reps of concepts in C. There are at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1}\) such concepts, hence at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1}\) level \(\ell \) incoming neighbors fire at time \(t+\ell \), yielding a total incoming potential for u for time \(t+\ell +1\) of at most \(\frac{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}{k^{{{\,\mathrm{\ell _{max}}\,}}+b}} = \frac{1}{k^{b-1}}\). Since the firing threshold \(\tau = \frac{(r_1 + r_2)k}{2}\) is strictly greater than \(\frac{1}{k^{b-1}}\), u cannot receive enough incoming potential to meet the threshold for time \(t+\ell +1\).

Lemma 11

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property and \(\mathcal N\) is the feed-forward network defined above, based on \(\mathcal C\). Assume that \(\frac{(r_1 + r_2)k}{2} > \frac{1}{k^{b-1}}\). Suppose that \(r_1\) and \(r_2\) satisfy the following inequalities: \(r_2 (2 w_1 - 1) \ge r_1\) and \(r_2 \ge r_1(2w_2 - 1) + \frac{2}{k^b}\). Assume that \(B \subseteq C_0\) is presented at time t. If c is any concept in C, then

  1. 1.

    If \(c \in supp_{r_2}(B)\) then rep(c) fires at time \(t+level(c)\).

  2. 2.

    If \(c \notin supp_{r_1}(B)\) then rep(c) does not fire at time \(t+level(c)\).

Proof

  1. 1.

    If \(c \in supp_{r_2}(B)\) then rep(c) fires at time \(t + level(c)\). We prove this using induction on level(c). For the base case, \(level(c) = 0\), the assumption that \(c \in supp_{r_2}(B)\) means that \(c \in B\), which means that rep(c) fires at time t, by the assumption that B is presented at time t. For the inductive step, assume that \(level(c) = \ell +1\). Assume that \(c \in supp_{r_2}(B)\). Then c must have at least \(r_2 k\) children that are in \(supp_{r_2}(B)\). By inductive hypothesis, the reps of all of these children fire at time \(t + \ell \). We claim that the total incoming potential to rep(c) for time \(t+\ell +1\), \(pot^{rep(c)}(t+\ell +1)\), reaches the firing threshold \(\tau = \frac{(r_1+r_2)k}{2}\), so rep(c) fires at time \(t+\ell +1\). To see this, note that the total potential induced by the firing reps of children of c is at least \(r_2 k w_1\), because the weight of the edge from each firing child to rep(c) is at least \(w_1\). This quantity is \(\ge \frac{(r_1+r_2)k}{2}\) because of the assumption that \(r_2 (2 w_1 - 1) \ge r_1\).

  2. 2.

    If \(c \notin supp_{r_1}(B)\) then rep(c) does not fire at time \(t + level(c)\). Again we use induction on level(c). For the base case, \(level(c) = 0\), the assumption that \(c \notin supp_{r_1}(B)\) means that \(c \notin B\), which means that rep(c) does not fire at time t, by the assumption that B is presented at time t. For the inductive step, assume that \(level(c) = \ell +1\). Assume that \(c \notin supp_{r_1}(B)\). Then c has strictly fewer than \(r_1 k\) children that are in \(supp_{r_1}(B)\), and therefore, strictly more than \(k - r_1 k\) children that are not in \(supp_{r_1}(B)\). By inductive hypothesis, none of the reps of the children in this latter set fire at time \(t + \ell \), which means that the reps of strictly fewer than \(r_1 k\) children of c fire at time \(t + \ell \). Therefore, the total incoming potential to rep(c) from reps of c’s children is strictly less than \(r_1 k w_2\), since the weight of the edge from each firing child to rep(c) is at most \(w_2\).

    In addition, some potential may be contributed by other neurons at level \(\ell \) that are not children of c but fire at time \(t+\ell -1\). By Lemma 10, these must all be reps of concepts in C.

    There are at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1}\) of these, each contributing potential of at most \(\frac{1}{k^{lmax+b}}\), for a total potential of at most \(\frac{1}{k^{b-1}}\) from these neurons.

    Therefore, the total incoming potential to rep(c) for time \(t+\ell +1\), \(pot^{rep(c)}(t+\ell +1)\), is strictly less than \(r_1 k w_2 + \frac{1}{k^{b-1}}\). This quantity is \(\le \frac{(r_1+r_2)k}{2}\), because of the assumption that \(r_2 \ge r_1(2w_2 - 1) + \frac{2}{k^b}\).

    This means that the total incoming potential to rep(c) for time \(t+\ell +1\) is strictly less than the threshold \(\tau = \frac{(r_1+r_2)k}{2}\) for rep(c) to fire at time \(t+\ell +1\). So rep(c) does not fire at time \(t+\ell +1\).

Proof

(of Theorem 4). (Of Theorem 4:) The proof is similar to that of Theorem 5.3, but now using Lemma 11 in place of Lemma 7. Let \(r = \frac{\tau }{k}\), where \(\tau \) is the firing threshold for the non-input neurons of \(\mathcal N\). Assume that \(B \subseteq C_0\) is presented at time t. We prove the two parts of Definition 4 separately.

  1. 1.

    If \(c \in supp_{r_2}(B)\) then rep(c) fires at time \(t + level(c)\). Suppose that \(c \in supp_{r_2}(B)\). By assumption, \(\tau \le r_2 k\), so that \(r = \frac{\tau }{k} \le r_2\). Then Lemma 8 implies that \(c \in supp_{r}(B)\). Then Lemma 11 implies that rep(c) fires at time \(t + level(c)\).

  2. 2.

    If \(c \notin supp_{r_1}(B)\) then rep(c) does not fire at time \(t + level(c)\). Suppose that \(c \notin supp_{r_1}(B)\). By assumption, \(\tau \ge r_1 k\), so that \(r = \frac{\tau }{k} \ge r_1\). Then Lemma 8 implies that \(c \notin supp_{r}(B)\). Then Lemma 11 implies that rep(c) does not fire at time \(t + level(c)\).

D Missing Statements and Proofs of Section 6

As before, we have:

Lemma 12

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the network defined above, based on \(\mathcal C\). Assume that \(B \subseteq C_0\) is presented at time t. If u is a neuron that fires at some time after t, then \(u \in R\), that is, \(u = rep(c)\) for some concept \(c \in C\).

The following preliminary lemma says that the firing of rep neurons is persistent, assuming persistent inputs (as we do in the definition of recognition for networks with feedback).

Lemma 13

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the network with feedback defined above, based on \(\mathcal C\). Let \(r = \frac{\tau }{k}\), where \(\tau \) is the firing threshold for the non-input neurons of \(\mathcal N\).

Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). Let c be any concept in C. Then for every \(t' \ge t\), if rep(c) fires at time \(t'\), then it fires at all times \(\ge t'\).

Proof

We prove this by induction on \(t'\). The base case is \(t' = t\). The neurons that fire at time t are exactly the input neurons that are reps for concepts in B. By assumption, these same inputs continue for all times \(\ge t\).

For the inductive step, consider a neuron rep(c) that fires at time \(t'\), where \(t' \ge t+1\). If \(level(c) = 0\) then \(c \in B\) and rep(c) continues firing forever. So assume that \(level(c) \ge 1\). Then rep(c) fires at time \(t'\) because the incoming potential it receives from its children and parents who fire at time \(t'-1\) is sufficient to reach the firing threshold \(\tau \). By inductive hypothesis, all of the neighbors of rep(c) that fire at time \(t'-1\) also fire at all times \(\ge t'-1\). So that means that they provide enough incoming potential to rep(c) to make rep(c) fire at all times \(\ge t'\).

Next we have a lemma that is analogous to Lemma 7, but now in terms of eventual firing rather than firing at a specific time. Similarly to before, this works because the network’s behavior directly mirrors the \(supp_{r,f}\) definition, where \(r = \frac{\tau }{k}\).

Lemma 14

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the network with feedback defined above, based on \(\mathcal C\). Let \(r = \frac{\tau }{k}\), where \(\tau \) is the firing threshold for the non-input neurons of \(\mathcal N\).

Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). If c is any concept in C, then rep(c) fires at some time \(\ge t\) if and only if \(c \in supp_{r,f}(B)\).

To prove Lemma 14, it is convenient to prove a more precise version that takes time into account. As before, in Sect. 2.3, we use the abbreviation \(S(t) = supp_{r,f}(B,*,t)\). Thus, S(t) is the set of concepts at all levels that are supported by input B by step t of the recursive definition of the \(S(\ell ,t)\) sets.

Lemma 15

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the network with feedback defined above, based on \(\mathcal C\). Let \(r = \frac{\tau }{k}\), where \(\tau \) is the firing threshold for the non-input neurons of \(\mathcal N\).

Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). Let \(t'\) be any time \(\ge t\). If c is any concept in C, then rep(c) fires at time \(t'\) if and only if \(c \in S(t'-t)\).

Proof

As for Lemma 7, we prove the two directions separately. But now we use induction on time rather than on level(c).

  1. 1.

    If \(c \in S(t'-t)\), then rep(c) fires at time \(t'\).

    We prove this using induction on \(t'\), \(t' \ge t\). For the base case, \(t' = t\), the assumption that \(c \in S(0)\) means that c is in the input set B, which means that rep(c) fires at time t.

    For the inductive step, assume that \(t' \ge t\) and \(c \in S((t'+1)-t)\). If \(level(c) = 0\) then again \(c \in B\), so c fires at time t, and therefore at time \(t'\) by Lemma 13. So assume that \(level(c) \ge 1\). If \(c \in S(t'-t)\) then rep(c) fires at time \(t'\) by the inductive hypothesis, and therefore also at time \(t'+1\) by Lemma 13. Otherwise, enough of c’s children and parents must be in \(S(t'-t)\) to include c in \(S((t'-t)+1) = S((t'+1)-t)\); that is, \(|children(c) \ \cap \ S(t'-t)| + f \ |parents(c)\ \cap \ S(t'-t)| \ge rk\).

    Then by inductive hypothesis, all of the reps of the children and parent concepts mentioned in this expression fire at time \(t'\).

    Therefore, the upward potential incoming to rep(c) for time \(t'+1\), \(upot^{rep(c)}(t'+1)\), is at least \(|children(c) \ \cap \ S(t'-t)|\), and the downward potential incoming to rep(c) for time \(t'+1\), \(dpot^{rep(c)}(t'+1)\), is at least \(f \ |parents(c) \ \cap \ S(t'-t)|\) (since the weight of each downward edge is f). So \(pot^{rep(c)}(t'+1)\), which is equal to \(upot^{rep(c)}(t'+1) + dpot^{rep(c)}(t'+1)\), is \(\ge |children(c) \ \cap \ S(t'-t)| + f \ |parents(c) \ \cap \ S(t'-t)| \ge rk\). That reaches the firing threshold \(\tau = rk\) for rep(c) to fire at time \(t'+1\).

  2. 2.

    If rep(c) fires at time \(t'\), then \(c \in S(t'-t)\).

    We again use induction on \(t'\), \(t' \ge t\). For the base case, \(t' = t\), the assumption that rep(c) fires at time t means that c is in the input set B, hence \(c \in S(0)\).

    For the inductive step, suppose that \(t' \ge t\) and rep(c) fires at time \(t' + 1\). Then it must be that enough of the reps of c’s children and parents fire at time \(t'\) to reach the firing threshold \(\tau = rk\) for rep(c) to fire at time \(t'+1\). That is, \(upot^{rep(c)}(t'+1) + dpot^{rep(c)}(t'+1) \ge rk\). In other words, the number of reps of children of c that fire at time \(t'\) plus f times the number of reps of parents of c that fire at time \(t'\) is \(\ge r k\) (since the weight of each downward edge is f).

    By inductive hypothesis, all of these children and parents of c are in \(S(t'-t)\). Therefore, \(|children(c) \ \cap \ S(t'-t)| + f \ |parents(c) \ \cap \ S(t'-t)| \ge rk\). Then by definition of \(supp_{r,f}(B)\), we have that \(c \in S((t'-t)+1) = S((t' + 1) - t)\), as needed.

Lemma 14 follows immediately from Lemma 15. Then, as in Sect. 6.1, the main recognition theorem follows easily.

Proof

(of Theorem 6). Let \(r = \frac{\tau }{k}\), where \(\tau \) is the firing threshold for the non-input neurons of \(\mathcal N\). Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). We prove the two parts of Definition 5 separately.

  1. 1.

    If \(c \in supp_{r_2,f}(B)\) then rep(c) fires at some time \(\ge t\). Suppose that \(c \in supp_{r_2,f}(B)\). By assumption, \(\tau \le r_2k\), so that \(r = \frac{\tau }{k} \le r_2\). Then Lemma 1 implies that \(c \in supp_{r,f}(B)\). Then by Lemma 14, rep(c) fires at some time \(\ge t\).

  2. 2.

    If \(c \notin supp_{r_1,f}(B)\) then rep(c) does not fire at any time \(\ge t\). Suppose that \(c \notin supp_{r_1,f}(B)\). By assumption, \(\tau \ge r_1 k\), so that \(r = \frac{\tau }{k} \ge r_1\). Then Lemma 1 implies that \(c \notin supp_{r,f}(B)\). Then by Lemma 14, rep(c) does not fire at any time \(\ge t\).

Lemma 16

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the network defined above, based on \(\mathcal C\). Assume that \(\frac{(r_1 + r_2)k}{2} > \frac{1}{k^{b-2}}\). Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). If u is a neuron that fires at any time \(t' \ge t\), then \(u = rep(c)\) for some concept \(c \in C\).

Proof

By induction on the time \(t' \ge t\), we show: If u is a neuron that fires at time \(t'\), then \(u = rep(c)\) for some concept \(c \in C\). For the base case, \(t'=t\), if u fires at time t then \(u = rep(c)\) for some \(c \in B\), by assumption.

For the inductive step, consider any neuron u that fires at time \(t'+1\), where \(t' \ge t\). Assume for contradiction that u is not of the form rep(c) for any \(c \in C\). Then the weight of each edge incoming to u is at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+b}\). By inductive hypothesis, the only incoming neighbors that fire at time \(t'\) are reps of concepts in C. There are at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1} + k^{{{\,\mathrm{\ell _{max}}\,}}-1}\) concepts at the two layers above and below layer(u), hence at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1} + k^{{{\,\mathrm{\ell _{max}}\,}}-1}\) neighbors of u that fire at time \(t'\), yielding a total incoming potential for u for time \(t'+1\) of at most \(\frac{k^{{{\,\mathrm{\ell _{max}}\,}}+1} + k^{{{\,\mathrm{\ell _{max}}\,}}-1}}{k^{{{\,\mathrm{\ell _{max}}\,}}+b}} = \frac{1}{k^{b-1}} + \frac{1}{k^{b+1}}\). Since \(k \ge 2\), this bound on potential is at most \(\frac{1}{k^{b-2}}\). Since the threshold \(\tau = \frac{(r_1 + r_2)k}{2}\) is assumed to be strictly greater than \(\frac{1}{k^{b-2}}\), u does not receive enough incoming potential to meet the firing threshold for time \(t'+1\).

Lemma 17

Assume \(\mathcal C\) is any concept hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the network with feedback as defined above, based on \(\mathcal C\). Assume that \(\frac{(r_1+r_2)k}{2} > \frac{1}{k^{b-2}}\). Also suppose that \(r_1\) and \(r_2\) satisfy the following inequalities: \(r_2(2w_1-1) \ge r_1\) and \(r_2 \ge r_1(2w_2 - 1) + \frac{2}{k^{b-1}}\). Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). If c is any concept in C, then:

  1. 1.

    If \(c \in supp_{r_2,f}(B)\) then rep(c) fires at some time \(\ge t\).

  2. 2.

    If rep(c) fires at some time \(\ge t\) then \(c \in supp_{r_1,f}(B)\).

Proof

The proof follows the general outline of the proof of Lemma 14, based on Lemma 15. As in those results, the proof takes into account both the upward potential upot and the downward potential dpot. As before, we split the cases up and use two inductions based on time. However, now the two inductions incorporate the treatment of variable weights used in the proof of Lemma 11.

  1. 1.

    If \(c \in S(t'-t)\) then rep(c) fires at time \(t'\). Here the set \(S(t'-t)\) is defined in terms of \(supp_{r_2,f}(B)\).

    We prove this using induction on \(t'\), \(t' \ge t\). For the base case, \(t' = t\), the assumption that \(c \in S(0)\) means that c is in the input set B, which means that rep(c) fires at time \(t'\).

    For the inductive step, assume that \(t' \ge t\) and \(c \in S((t'+1)-t)\). If \(level(c) = 0\) then again \(c \in B\), so c fires at time \(t'\). So assume that \(level(c) \ge 1\). Since \(c \in S((t'+1)-t)\), we get that \(|children(c) \ \cap \ S(t'-t)| + f \ |parents(c) \ \cap \ S(t'-t)| \ge r_2 k\).

    By the inductive hypothesis, the reps of all of these children and parents fire at time \(t'\). Therefore, the upward potential incoming to rep(c) for time \(t'+1\), \(upot^{rep(c)}(t'+1)\), is at least \(|children(c) \ \cap \ S(t'-t)| \ w_1\), and the downward potential incoming to rep(c) for time \(t'+1\), \(dpot^{rep(c)}(t'+1)\), is at least \(f \ |parents(c) \ \cap \ S(t'-t)| \ w_1\). Adding these two potentials, we get that the total incoming potential to rep(c) for time \(t'+1\), \(pot^{rep(c)}(t'+1)\), is at least \((|children(c) \ \cap \ S(t'-t)| + f \ |parents(c) \ \cap \ S(t'-t)|) \ w_1 \ge r_2 k w_1\). This is at least \(\frac{(r_1+r_2) k}{2}\), because of the assumption that \(r_2 \ (2w_1 - 1) \ge r_1\). So the incoming potential to rep(c) for time \(t'+1\) is enough to reach the firing threshold \(\tau = \frac{(r_1+r_2) k}{2}\), so rep(c) fires at time \(t'+1\).

  2. 2.

    If rep(c) fires at time \(t'\), then \(c \in S(t'-t)\). Here the set \(S(t'-t)\) is defined in terms of \(supp_{r_1,f}(B)\).

    We again use induction on \(t'\), \(t' \ge t\). For the base case, \(t' = t\), the assumption that rep(c) fires at time t means that c is in the input set B, hence \(c \in S(0)\).

    For the inductive step, assume that rep(c) fires at time \(t'+1\). Then it must be that \(pot^{rep(c)}(t'+1) = upot^{rep(c)}(t'+1) + dpot^{rep(c)}(t'+1)\) reaches the firing threshold \(\tau = \frac{(r_1+r_2)k}{2}\) for c to fire at time \(t'+1\). Arguing as in the proof of Lemma 16, the total incoming potential to rep(c) from neurons at levels \(level(c) - 1\) and \(level(c)+1\) that are not reps of children or parents of c is at most \(\frac{1}{k^{b-2}}\). So the total incoming potential to rep(c) from firing reps of its children and parents must be at least \(\frac{(r_1+r_2)k}{2} - \frac{1}{k^{b-2}}\).

    By inductive hypothesis, all of these children and parents of c are in \(S(t'-t)\). Therefore, \((|children(c) \ \cap \ S(t'-t)| + f \ |parents(c) \ \cap \ S(t'-t)|) \ w_2 \ge \frac{(r_1+r_2)k}{2} - \frac{1}{k^{b-2}}\). By the assumption that \(r_2 \ge r_1(2w_2-1) + \frac{2}{k^{b-1}}\), we get that \(|children(c) \ \cap \ S(t'-t)| + f \ |parents(c) \ \cap \ S(t'-t)| \ge r_1 k\). (In more detail, let \(E = |children(c) \cap S(t'-t)| + f |parents(c) \cap S(t'-t)|\). So we know that \(E w_2 \ge \frac{(r_1+r_2)k}{2} - \frac{1}{k^{b-2}}\). Assume for contradiction that \(E < r_1 k\). Then \(E w_2 < r_1 k w_2\). But \(r_1 k w_2 \le (r1+r2)k/2 - 1/k^{b-2}\), because of the assumption that \(r_2 \ge r_1(2w_2-1) + \frac{2}{k^{b-1}}\). So that means that \(E w_2 < (r1+r2)k/2 - 1/k^{b-2}\), which is a contradiction.) Then by definition of \(supp_{r_1,f}(B)\), we have that \(c \in S((t'-t)+1) = S((t' + 1) - t)\), as needed.

The results of this section are also extendable to the case of scaled weights and thresholds, as in Sect. 5.4.

E Time Bounds for Networks with Feedback

We first give the time bounds for tree hierarchies then for general ones.

1.1 E.1 Time Bounds for Tree Hierarchies in Networks with Feedback

It remains to prove time bounds for recognition for hierarchical concepts in networks with feedback. Now the situation turns out to be quite different for tree hierarchies and hierarchies that allow limited overlap. In this section, we consider the simpler case of tree hierarchies.

For a tree network, one pass upward and one pass downward is enough to recognize all concepts, though that is a simplification of what actually happens, since much of the recognition activity is concurrent. Still, for tree hierarchies, we can prove an upper bound of twice the number of levels:

Theorem 9

Assume \(\mathcal C\) is a tree hierarchy and \(\mathcal N\) is the network with feedback defined above, based on \(\mathcal C\). Let \(r = \frac{\tau }{k}\), where \(\tau \) is the firing threshold for the non-input neurons of \(\mathcal N\).

Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). If \(c \in supp_{r,f}(B)\), then rep(c) fires at some time \(\le t + 2 {{\,\mathrm{\ell _{max}}\,}}\).

Proof

Assume that \(c \in supp_{r,f}(B)\). By Lemma 1, we have that \(c \in S(2 {{\,\mathrm{\ell _{max}}\,}})\). Then Lemma 15 implies that rep(c) fires at time \(t + 2 {{\,\mathrm{\ell _{max}}\,}}\).

And this result extends to larger thresholds:

Corollary 1

Assume \(\mathcal C\) is a tree hierarchy and \(\mathcal N\) is the network with feedback as defined above, based on \(\mathcal C\). Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). If \(c \in supp_{r_2,f}(B)\), then rep(c) fires at some time \(\le t + 2 {{\,\mathrm{\ell _{max}}\,}}\).

Proof

By Theorem 9 and Lemma 1.

1.2 E.2 Time Bounds for General Hierarchies in Networks with Feedback

The situation gets more interesting when the hierarchy allows overlap. We use the same network as before. Each neuron gets inputs from its children and parents at each round, and fires whenever its threshold is met. As noted in Lemma 15, this network behavior follows the definition of \(supp_{r_2,f}(B)\).

In the case of a tree hierarchy, one pass upward followed by one pass downward suffice to recognize all concepts, though the actual execution involves more concurrency, rather than separate passes. But with overlap, more complicated behavior can occur. For example, an initial pass upward can activate some rep neurons, which can then provide feedback on a downward pass to activate some other rep neurons that were not activated in the upward pass. So far, this is as for tree hierarchies. But now because of overlap, these newly-recognized concepts can in turn trigger more recognition on another upward pass, then still more on another downward pass, etc. How long does it take before the network is guaranteed to stabilize?

Here we give a simple upper bound and an example that yields a lower bound. Work is needed to pin the bound down more precisely.

Upper Bound. We give a crude upper bound on the time to recognize all the concepts in a hierarchy.

Theorem 10

Assume \(\mathcal C\) is any hierarchy satisfying the limited-overlap property, and \(\mathcal N\) is the network with feedback defined above, based on \(\mathcal C\). Assume that \(B \subseteq C_0\) is presented at all times \(\ge t\). If \(c \in supp_{r,f}(B)\), then rep(c) fires at some time \(\le t + k^{{{\,\mathrm{\ell _{max}}\,}}+1}\).

Proof

All the level 0 concepts in B start firing at time 0. We consider how long it might take, in the worst case, for the reps of all the concepts in \(supp_{r,f}(B)\) with levels \(\ge 1\) to start firing.

The total number of concepts in C with levels \(\ge 1\) is at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1}\); therefore, the number of concepts in \(supp_{r_,f}(B)\) with levels \(\ge 1\) is at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1}\).

By Lemma 12, the rep neurons are the only ones that ever fire. Therefore, the firing set stabilizes at the first time \(t'\) such that the sets of rep neurons that fire at times \(t'\) and time \(t'+1\) are the same. Since there are at most \(k^{{{\,\mathrm{\ell _{max}}\,}}+1}\) rep neurons with levels \(\ge 1\), the worst case is if one new rep starts firing at each time. But in this case the firing set stabilizes by \(t + k^{{{\,\mathrm{\ell _{max}}\,}}+1}\), as claimed.

The bound in Theorem 10 may seem very pessimistic. However, the example in the next subsection shows that it is not too far off, in particular, it shows that the time until all the reps fire can be exponential in \({{\,\mathrm{\ell _{max}}\,}}\).

Lower Bound. Here we present an example of a concept hierarchy \(\mathcal C\) and an input set B for which the time for the rep neurons for all the supported concepts to fire is exponential in \({{\,\mathrm{\ell _{max}}\,}}\). This yields a lower bound, in Theorem 11.

The concept hierarchy \(\mathcal C\) has levels \(0,\ldots ,{{\,\mathrm{\ell _{max}}\,}}\) as usual. We assume here that \(r_1 = r_2 = r\). We assume that the overlap bound o satisfies \(o \cdot k \ge 2\), that is, the allowed overlap is at least 2. We take \(f = 1\).

The network \(\mathcal N\) embeds \(\mathcal C\), as described earlier in this section. As before, we assume that \(\ell '_{max} = {{\,\mathrm{\ell _{max}}\,}}\), and the threshold \(\tau \) for the non-input nodes in the network is rk. Now we assume that the weights are 1 for both upward and downward edges between reps of concepts in C, which is consistent with our choice of \(f=1\) in the concept hierarchy.

We assume that hierarchy \(\mathcal C\) has overlap only at one level—in the sets of children of level 2 concepts. The upper portion of \(\mathcal C\), consisting of levels \(2,\ldots {{\,\mathrm{\ell _{max}}\,}}\), is a tree, with no overlap among the sets children(c), \(3 \le level(c) \le {{\,\mathrm{\ell _{max}}\,}}\). There is also no overlap among the sets of children of level 1 concepts.

We order the children of each concept with \(level \ge 3\) in some arbitrary order, left-to-right. This orients the upper portion of the concept hierarchy, down to the level 2 concepts. Let \(C'\) be the set of all the level 2 concepts that are leftmost children of their parents. Since there are \(k^{{{\,\mathrm{\ell _{max}}\,}}-2}\) level 3 concepts, it follows that \(|C'| = k^{{{\,\mathrm{\ell _{max}}\,}}-2}\). Number the elements of \(C'\) in order left-to-right as \(c_1,\ldots ,c_{k_{{{\,\mathrm{\ell _{max}}\,}}-2}}\). Also, for every concept \(c_i\) in \(C'\), order its k children in some arbitrary order, left-to-right, and number them 1 through k.

Now we describe the overlap between the sets of children of the level 2 concepts \(c_i\), \(1 \le i \le k_{{{\,\mathrm{\ell _{max}}\,}}-2}\). The first \(k-1\) children of \(c_1\) are unique to \(c_1\), whereas its \(k^{th}\) child is shared with \(c_2\). For \(i =k_{{{\,\mathrm{\ell _{max}}\,}}-2}\), the last \(k-1\) children of \(c_i\) are unique to \(c_i\), whereas its first child is shared with \(c_{i-1}\). For each other index i, the middle \(k-2\) children of \(c_i\) are unique to \(c_i\), whereas its first child is shared with \(c_{i-1}\), and its \(k^{th}\) child is shared with \(c_{i+1}\). There is no other sharing in \(\mathcal C\).

Next, we define the set B of level 0 concepts to be presented to the network. B consists of the following grandchildren of the level 2 concepts in \(C'\):

  1. 1.

    Grandchildren of \(c_1\):

    1. (a)

      All the (level 0) children of the children of \(c_1\) numbered \(1,\ldots , \lceil r k \rceil \), and

    2. (b)

      \(\lceil r k \rceil - 1\) of the (level 0) children of the \(k^{th}\) child of \(c_1\), which is also the first child of \(c_2\).

  2. 2.

    Grandchildren of each \(c_i\), \(2 \le i \le k^{{{\,\mathrm{\ell _{max}}\,}}-2} - 1\):

    1. (a)

      \(\lceil r k \rceil - 1\) of the (level 0) children of the first child of \(c_i\), which is also the \(k^{th}\) child of \(c_{i-1}\) (this has already been specified, just above),

    2. (b)

      All the (level 0) children of the children of \(c_i\) numbered \(2,\ldots , \lceil r k \rceil \), and

    3. (c)

      \(\lceil r k \rceil - 1\) of the (level 0) children of the \(k^{th}\) child of \(c_i\), which is also the first child of \(c_{i+1}\).

  3. 3.

    Grandchildren of \(c_i\), \(i = k^{{{\,\mathrm{\ell _{max}}\,}}-2}\):

    1. (a)

      \(\lceil r k \rceil - 1\) of the (level 0) children of the first child of \(c_i\), which is also the \(k^{th}\) child of \(c_{i-1}\) (this has already been specified, just above), and

    2. (b)

      All the (level 0) children of the children of \(c_i\) numbered \(2,\ldots , \lceil r k \rceil \).

Figure 2 illustrates a sample overlap pattern, for level 2 neurons \(c_1, c_2, c_3,...c_m\), where \(m = k^{{{\,\mathrm{\ell _{max}}\,}}- 2}\). Here we use \(k = 4\), \(r = 3/4\), and \(o = 1/2\).

Fig. 2.
figure 2

Concept hierarchy with overlap, and input set.

Theorem 11

Assume \(\mathcal C\) is the concept hierarchy defined above, and \(\mathcal N\) is the network with feedback defined above, based on \(\mathcal C\). Let B be the input set just defined, and assume that B is presented at all times \(\ge t\). Then the time required for the rep neurons for all concepts in \(supp_{r,f}(B)\) to fire is at least \(2 (k^{{{\,\mathrm{\ell _{max}}\,}}}-2)\).

Proof

The network behaves as follows:

  • Time 0: Exactly the reps of concepts in B fire.

  • Time 1: The reps of the (level 1) children of \(c_1\) numbered \(1,\ldots ,\lceil rk \rceil \) begin firing. Also, for every \(c_i\), \(2 \le i \le k^{{{\,\mathrm{\ell _{max}}\,}}-2}\), the reps of the (level 1) children numbered \(2,\ldots ,\lceil rk \rceil \) begin firing. This is because all of these neurons receive enough potential from the reps of their (level 0) children that fired at time 0, to trigger firing at time 1. No other neuron receives enough potential to begin firing at time 1.

  • Time 2: Neuron \(rep(c_1)\) begins firing, since it receives enough potential from the reps of its first \(\lceil r k \rceil \) children. No other neuron receives enough potential to begin firing at time 2.

  • Time 3: Now that \(rep(c_1)\) has begun firing, it begins contributing potential to the reps of its children, via feedback edges. This potential is enough to trigger firing of the rep of the (level 1) \(k^{th}\) child of \(c_1\), when it is added to the potential from the reps of that child’s own level 0 children. So, at time 3, the rep of the \(k^{th}\) child of \(c_1\) begins firing. No other neuron receives enough potential to begin firing at time 3.

  • Time 4: The \(k^{th}\) child of \(c_1\) is also the first child of \(c_2\). So its rep contributes potential to \(rep(c_2)\). This is enough to trigger firing of \(rep(c_2)\), when added to the potential from the reps of \(c_2\)’s already-firing children. So, at time 4, \(rep(c_2)\) begins firing. No other neuron receives enough potential to begin firing at time 4.

  • Time 5: Now that \(rep(c_2)\) has begun firing, it contributes potential to the reps of its children, via feedback edges. This is enough to trigger firing of the rep of the (level 1) \(k^{th}\) child of \(c_2\), when added to the potential from the reps of that child’s own level 0 children. So, at time 5, the rep of the \(k^{th}\) child of \(c_2\) begins firing. No other neuron begins firing at time 3.

  • Time 6: In analogy with that happens at time 4, neuron \(rep(c_3)\) begins firing at time 6, and no other neuron begins firing.

  • \(\ldots \)

  • Time \(2 (k^{{{\,\mathrm{\ell _{max}}\,}}}-2)\): Continuing in the same pattern, neuron \(rep(c_{k^{{{\,\mathrm{\ell _{max}}\,}}-2}})\) begins firing at time \(2 (k^{{{\,\mathrm{\ell _{max}}\,}}}-2)\).

Thus, the time to recognize concept \(c_{k^{{{\,\mathrm{\ell _{max}}\,}}-2}}\) is exactly \(2 (k^{{{\,\mathrm{\ell _{max}}\,}}}-2)\), as claimed.

F Learning Algorithms for Feed-Forward Networks

Now we address the question of how concept hierarchies (with and without overlap) can be learned in layered networks. In this section, we consider learning in feed-forward networks, and in Sect. G we consider networks with feedback.

For feed-forward networks, we describe noise-free learning algorithms, which produce edge weights for the upward edges that suffice to support robust recognition. These learning algorithms are adapted from the noise-free learning algorithm in [4], and work for both tree hierarchies and general concept hierarchies. We show that our new learning algorithms can be viewed as producing approximate, scaled weights as described in Sect. 5, which serves to decompose the correctness proof for the learning algorithms. We also discuss extensions to noisy learning.

1.1 F.1 Tree Hierarchies

We begin with the case studied in [4], tree hierarchies in feed-forward networks.

Overview of Previous Noise-free Learning Results. In [4], we set the threshold \(\tau \) for every neuron in layers \(\ge 1\) to be \(\tau = \frac{(r_1+r_2) \sqrt{k}}{2}\). We assumed that the network starts in a state in which no neuron in layer \(\ge 1\) is firing, and the weights on the incoming edges of all such neurons is \(\frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}\). We also assume a Winner-Take-All sub-network satisfying Assumption 13 below, which is responsible for engaging neurons at layers \(\ge 1\) for learning. These assumptions, together with the general model conventions for activation and learning using Oja’s rule, determine how the network behaves when it is presented with a training schedule as in Definition 7. Our main result, for noise-free learning, is (paraphrased slightly)Footnote 5:

Theorem 12

(\((r_1,r_2)\)-Learning Tree concepts). Let \(\mathcal N\) be the network described above, with maximum layer \(\ell '_{max}\), and with learning rate \(\eta = \frac{1}{4k}\). Let \(r_1, r_2\) be reals in [0, 1] with \(r_1 < r_2\). Let \(\epsilon = \frac{r_2-r_1}{r_1+r_2}\). Let \(\mathcal C\) be any concept hierarchy, with maximum level \({{\,\mathrm{\ell _{max}}\,}}\le \ell '_{max}\). Assume that the concepts in \(\mathcal C\) are presented according to a \(\sigma \)-bottom-up training schedule as defined in Sect. 4.2, where \(\sigma \) is \(O\left( {{\,\mathrm{\ell _{max}}\,}}\log (k) + \frac{1}{\epsilon }) \right) \). Then \(\mathcal N\) \((r_1,r_2)\)-learns \(\mathcal C\).

Specifically, we show that the weights for the edges from children to parents approach \(\frac{1}{\sqrt{k}}\) in the limit, and the weights for the other edges approach 0.

The Winner-Take-All Assumption. Theorem 12 depends on Assumption 13 below, which hypothesizes a Winner-Take-All (WTA) module with certain abstract properties. This module is responsible for selecting a neuron to represent each new concept. It puts the selected neuron in a state that prepares it to learn the concept, by setting the engaged flag in that neuron to 1. It is also responsible for engaging the same neuron when the concept is presented in subsequent learning steps. In more detail, while the network is being trained, example concepts are “shown” to the network, according to a \(\sigma \)-bottom-up schedule as described in Sect. 4.2. We assume that, for every example concept c that is shown, exactly one neuron in the appropriate layer gets engaged; this layer is the one with the same number as the level of c in the concept hierarchy. Furthermore, the neuron in that layer that is engaged is one that has the largest incoming potential \(pot^u\):

Assumption 13

(Winner-Take-All Assumption) If a level \(\ell \) concept c is shown at time t, then at time \(t+\ell \), exactly one neuron u in layer \(\ell \) has its engaged state component equal to 1, that is, \(engaged^u(t+\ell ) = 1\). Moreover, u is chosen so that \(pot^u(t+\ell )\) is the highest potential at time \(t+\ell \) among all the layer \(\ell \) neurons.

Thus, the WTA module selects the neuron to “engage” for learning. For a concept c that is being shown for the first time, we showed that a new neuron is selected to represent c—one that has not previously been selected. This is because, if a neuron has never been engaged in learning, its incoming weights are all equal to the initial weight \(w = \frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}\), yielding a total incoming potential of kw. On the other hand, those neurons in the same layer that have previously been engaged in learning have incoming weights for all of c’s children that are strictly less than the initial weight w, which yields a strictly smaller incoming potential. Also, for a concept c that is being shown for a second or later time, we showed that the already-chosen representing neuron for c is selected again. This is because the total incoming potential for the previously-selected neuron is strictly greater than kw (as a result of previous learning), whereas the potential for other neurons in the same layer is at most kw.

In a complete network for solving the learning problem, the WTA module would be implemented by a sub-network, but we treated it abstractly in [4], and we continue that approach in this paper.

Connections with Our New Results. Here we consider how we might use our scaled result in Sect. 5.4 to decompose the proof of Theorem 12 in [4]. A large part of the proof in [4] consists of proving that the edge weights established as a result of a \(\sigma \)-bottom-up training schedule, for sufficiently large \(\sigma \), are within certain bounds. If these bounds match up with those in Sect. 5.4, we can use the results of that section to conclude that they are adequate for recognition.

The general definitions in Sect. 5.4 use a threshold of \(\frac{(r_1+r_2)k s}{2}\) and weights given by: \( weight(u,v) \in [w_1 s, w_2 s]\) if \(\mathcal {E}_{u,v}\) and \(weight(u,v) \in [0,\frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+b}}]\) otherwise. To make the results of [4] fit the constraints of Sect. 5.4, we can simply take \(w_1 = \frac{1}{1+\epsilon }\), \(w_2 = 1\), and the scaling factor \(s = \frac{1}{\sqrt{k}}\). The two constraints \(r_2 (2w_1 - 1) \ge r_1\) and \(r_2 \ge r_1 (2 w_2 - 1) + \frac{2}{k^b s}\) now translate into \(r_2 (\frac{1 - \epsilon }{1+\epsilon }) \ge r_1\) and \(r_2 \ge r_1 + \frac{2}{k^{b-\frac{1}{2}}}\). The first of these, \(r_2 (\frac{1 - \epsilon }{1+\epsilon }) \ge r_1\), follows from the assumption in [4] that \(\epsilon = \frac{r_2 - r_1}{r_2+r_1}\). The second inequality is similar to a roundoff assumption in [4] that we have omitted here.Footnote 6

Noisy Learning. In [4], we extended our noise-free learning algorithm to the case of “noisy learning”. There, instead of presenting all leaves of a concept c at every learning step, we presented only a subset of the leaves at each step. This subset is defined recursively with respect to the hierarchical concept structure of c and its descendants. The subset varies, and is chosen randomly at each learning step. Similar results hold as for the noise-free case, but with an increase in learning time.Footnote 7 The result about noisy learning in [4] assumes a parameter p giving the fraction of each set of children that are shown; a larger value of p yields a correspondingly shorter training time. The target weight for learned edges is \(\bar{w} = \frac{1}{\sqrt{pk+1-p}}\). The threshold is \(r_2k(\bar{w} - \delta )\), where \(\delta = \frac{(r_2-r_1)\bar{w}}{25}\).

The main result says that, after a certain time \(\sigma \) (larger than the \(\sigma \) used for noise-free learning) spent training for a tree concept hierarchy \(\mathcal C\), with high probability, the resulting network achieves \((r_1,r_2)\)-recognition for \(\mathcal C\). Here, a key lemma asserts that, with high probability, after time \(\sigma \), the weights are as follows: \( weight(u,v) \in [\bar{w} - \delta , \bar{w} + \delta ] \text { if } \mathcal {E},\) and otherwise \(weight(u,v) \in [0,\frac{1}{k^{2{{\,\mathrm{\ell _{max}}\,}}}}]\). To make these results fit the constraints of Sect. 5.4, it seems that we should modify the threshold slightly, by using the similar but simpler threshold \((\frac{(r_1+r_2)k}{2})\bar{w}\) in place of \(r_2k(\bar{w} - \delta )\). The weights can remain the same as above, but in case of \(\mathcal {E}_{u,v}\) we have \( weight(u,v) \in [(1-\frac{r_2-r_1}{25}) \bar{w}, (1+\frac{r_2-r_1}{25}) \bar{w}]\). Thus, we have scaled the basic threshold \(\frac{(r_1+r_2)k}{2}\) by multiplying it by \(\bar{w} = \frac{1}{\sqrt{pk+1-p}}\). To make the results fit the constraints of Sect. 5.4, we can take \(s = \bar{w}\), \(w_1 = 1 - \frac{r_2-r_1}{25}\), \(w_2 = 1 + \frac{r_2-r_1}{25}\), and \(b = {{\,\mathrm{\ell _{max}}\,}}\). One can easily verify that the new thresholds still fulfill the requirements for recognition. We do this in the full version.

1.2 F.2 General Concept Hierarchies

The situation for general hierarchies, with limited overlap, in feed-forward networks is similar to that for tree hierarchies. The same learning algorithm, based on Oja’s rule, still works in the presence of overlap, with little modification to the proofs. The only significant new issue to consider is how to choose an acceptable neuron to engage in learning, at each learning step. We continue to encapsulate this choice within a separate WTA service. As before, the WTA should always select an unused neuron (in the right layer) for a concept that is being shown for the first time. And for subsequent times when the same concept is shown, the WTA should choose the same neuron as it did the first time.

An Issue with the Previous Approach. Assumption 13, which we used for tree hierarchies, no longer suffices. For example, consider two concepts c and \(c'\) with \(level(c') = level(c)\), and suppose that there is exactly one concept d in the intersection \(children(c) \cap children(c')\). Suppose that concept c has been fully learned, so a rep(c) neuron has been defined, and then concept \(c'\) is shown for the first time. Then when \(c'\) is first shown, rep(c) will receive approximately \(\frac{1}{\sqrt{k}}\) of total incoming potential, resulting from the firing of rep(d). On the other hand, any neuron that has not previously been engaged in learning will receive potential \(\frac{k}{k^{{{\,\mathrm{\ell _{max}}\,}}+1}} = \frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}}}\), based on k neurons each with initial weight \(\frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}\), which is smaller than \(\frac{1}{\sqrt{k}}\). Thus, Assumption 13 would select rep(c) in preference to any unused neuron. One might consider replacing Oja’s learning rule with some other rule, to try to retain Assumption 13, which works based just on comparing potentials. Another approach, which we present here, is to use a “smarter” WTA, that is, to modify Assumption 13 so that it takes more information into account when engaging a neuron.

Approach Using a Modified WTA Assumption. In the assumption below, w denotes the initial weight, \(\frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}\). \(N_{\ell }\) denotes the set of layer \(\ell \) neurons. We make the trivial assumption that \(o < 1\) for the noise-free case; for the noisy case, we strengthen that to \(o < p\), where p is the parameter indicating how many child concepts are chosen.

Assumption 14

(Revised Winner-Take-All Assumption). If a level \(\ell \) concept c is shown at time t, then at time \(t+\ell \), exactly one neuron \(u \in N_{\ell }\) has its engaged state component equal to 1, that is, \(engaged^u(t+\ell ) = 1\). Moreover, u is chosen so that \(pot^u(t+\ell )\) is the highest potential at time \(t+\ell \) among the layer \(\ell \) neurons that have strictly more than \(o \cdot k\) incoming neighbors that contribute potential that is \(\ge w\).

Thus, we are assuming that the WTA module is “smart enough” to select the neuron to engaged based on a combination of two criteria: First, it rules out any neuron that has just a few incoming neighbors that contribute potential \(\ge w\). This is intended to rule out neurons that have already started learning, but for a different concept. Second, it uses the same criterion as in Assumption 13, choosing the neuron with the highest potential from among the remaining candidate neurons.

We claim that using Assumption 14 in the learning protocol yields appropriate choices for neurons to engage, as expressed by Lemma 19 below. Showing these properties depends on a characterization of the incoming weights for a neuron \(u \in N_{\ell }\) at any point during the learning protocol, as expressed by Lemma 18.

Lemma 18

During execution of the learning protocol, at a point after any finite number of concept showings, the following properties hold:

  1. 1.

    If u has not previously been engaged for learning, then all of u’s incoming weights are equal to the initial weight w.

  2. 2.

    If u has been engaged for learning a concept c, and has never been engaged for learning any other concept, then all of u’s incoming weights for reps of concepts in children(c) are strictly greater than w, and all of its other incoming weights are strictly less than w.

Proof

Property 1 is obvious—if a neuron is never engaged for learning, its incoming weights don’t change. Property 2 follows from Oja’s rule.

Lemma 19

During execution of the learning protocol, the following properties hold for any concept showing:

  1. 1.

    If a concept c is being shown for the first time, the neuron that gets engaged for learning c is one that was not previously engaged.

  2. 2.

    If a concept c is being shown for the second or later time, the neuron that gets engaged for learning c is the same one that was engaged when c was shown for the first time.

Proof

We prove Properties 1 and 2 together, by strong induction on the number m of the concept showing. For the inductive step, suppose that concept c is being shown at the \(m^{th}\) concept showing. By (strong) induction, we can see that, for each concept that was previously shown, the same neuron was engaged in all of its showings. Therefore, the weights described in Lemma 18, Property 2, hold for all neurons that have been engaged in showings \(1,\ldots ,m-1\).

Claim: Consider any neuron u with \(layer(u) = level(c)\) that was previously engaged for learning a different concept \(c' \ne c\). Then u has at most \(o \cdot k\) incoming neighbors that contribute potential to u that is \(\ge w\), and so, is not eligible for selection by the WTA.

Proof of Claim: Lemma 18, Property 2, implies that all the incoming weights to neuron u for reps of concepts in \(children(c')\) are strictly greater than w, and all of its other incoming weights are strictly less than w. Since \(|children(c) \cap children(c')| \le o \cdot k\), u has at most \(o \cdot k\) incoming neighbors that contribute potential that is \(\ge w\), as claimed.

End of proof of Claim

Now we prove Properties 1 and 2:

  1. 1.

    If concept c is being shown for the first time, the neuron u that gets engaged for learning c is one that was not previously engaged. Assume for contradiction that the chosen neuron u was previously engaged. Then it must have been for a different concept \(c' \ne c\), since this is the first time c is being shown. Then by the Claim, u is not eligible for selection by the WTA. This is a contradiction.

  2. 2.

    If concept c is being shown for the second or later time, the neuron u that gets engaged for learning c is the same one that was engaged when c was shown for the first time. Arguing as for Property 1, again using the Claim, we can see that u cannot have been previously engaged for a concept \(c' \ne c\). So the only candidates for u are neurons that were not previously engaged, as well as the (single) neuron that was previously engaged for c. The given WTA rule chooses u from among these candidate based on highest incoming potential. For neurons that were not previously engaged, Lemma 18, Property 1, implies that the incoming potential is exactly kw. For the single neuron that was previously engaged for c, Lemma 18, Property 2 implies that the incoming potential is strictly greater than kw. So the WTA rule selects the previously-engaged neuron.

With the new WTA assumption, the learning analysis for general hierarchies follows the same pattern as the analysis for tree hierarchies in [4], and yields the same time bound.

Theorem 15

(\((r_1,r_2)\)-Learning General Hierarchies). Let \(\mathcal N\) be the network described above, with maximum layer \(\ell '_{max}\), and with learning rate \(\eta = \frac{1}{4k}\). Let \(r_1, r_2\) be reals in [0, 1] with \(r_1 < r_2\). Let \(\epsilon = \frac{r_2-r_1}{r_1+r_2}\). Let \(\mathcal C\) be any general concept hierarchy, with maximum level \({{\,\mathrm{\ell _{max}}\,}}\le \ell '_{max}\). Assume the revised WTA (Assumption 14) and that the concepts in \(\mathcal C\) are presented according to a \(\sigma \)-bottom-up training schedule as defined in Sect. 4.2, where \(\sigma \) is \(O\left( {{\,\mathrm{\ell _{max}}\,}}\log (k) + \frac{1}{\epsilon }) \right) \). Then \(\mathcal N\) \((r_1,r_2)\)-learns \(\mathcal C\).

Implementing Assumption 14 will require some additional mechanism, in addition to the mechanisms that are used to implement the basic WTA satisfying Assumption 13. Such a mechanism could serve as a pre-processing step, before the basic WTA. The new mechanism could allow a layer \(\ell \) neuron u to fire (and somehow reflect its incoming potential) exactly if u has strictly more than \(o \cdot k\) incoming neighbors that contribute potential \(\ge w\) to u.Footnote 8

G Learning Algorithms for Networks with Feedback

Now we consider how concept hierarchies, with and without overlap, can be learned in layered networks with feedback. The learning algorithms described in Sect. F set the weights on the directed edges from each layer \(\ell \) to the next higher layer \(\ell +1\), that is, the “upward” edges. Now the learning algorithm must also set the weights on the directed edges from each layer \(\ell \) to the next lower layer \(\ell -1\), i.e., the “downward” edges.

One reasonable approach is to separate matters, first learning the weights on the upward edges and then the weights on the downward edges. Fortunately, we can rely on Lemma 9, which says that, if c is any concept in a concept hierarchy \(\mathcal C\), then \(c \in supp_{1} (leaves(c))\). That is, any concept is supported based only on its descendants, without any help from its parents. This lemma implies that learning of upward edges can proceed bottom-up, as in Sect. F. We give some details below.

1.1 G.1 Noise-Free Learning

As in Sect. F, we assume that the threshold is \(\frac{(r_1+r_2)\sqrt{k}}{2}\), the initial weight for each upward edge is \(w = \frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}\), and \(\epsilon = \frac{r_2 - r_1}{r_1+r_2}\). Here we also assume that the initial weight for each downward edge is w.Footnote 9 We assume that the network starts in a state in which no neuron in layer \(\ge 1\) is firing.

Our main result, for noise-free learning, is:

Theorem 16

Let \(\mathcal N\) be the network defined in this section, with maximum layer \(\ell '_{max}\), and with learning rate \(\eta = \frac{1}{4k}\). Let \(r_1, r_2\) be reals in [0, 1] with \(r_1 \le r_2\). Let \(\epsilon = \frac{r_2-r_1}{r_1+r_2}\).

Let \(\mathcal C\) be any concept hierarchy, with maximum level \({{\,\mathrm{\ell _{max}}\,}}\le \ell '_{max}\). Assume that the algorithm described in this section is executed: On the first pass, the concepts in \(\mathcal C\) are presented according to a \(\sigma \)-bottom-up presentation schedule, where \(\sigma \) is \(O\left( {{\,\mathrm{\ell _{max}}\,}}\log (k) + \frac{1}{\epsilon }) \right) \). The second pass is as described in Sect. G.1. Then \(\mathcal N\) \((r_1,r_2,f)\)-learns \(\mathcal C\).

First Learning Pass. As a first pass, we carry out the learning protocol from Sect. F for all the concepts in the concept hierarchy \(\mathcal C\), working bottom-up. Learning each concept involves applying Oja’s rule for that concept, for enough steps to ensure that the weights of the upward edges end up within the bounds described in Sect. 5.4.

Consider the network after the first pass, when the weights of all the upward edges have reached their final values. At that point, we have that the network \((r_1,r_2)\)-recognizes the given concept hierarchy \(\mathcal C\), as described in Sect. F. Moreover, we obtain:

Lemma 20

The weights of the edges after the completion of the first learning pass are as follows:

  1. 1.

    The weights of the upward edges from reps of children to reps of their parents are in the range \([\frac{1}{(1+\epsilon )\sqrt{k}}, \frac{1}{\sqrt{k}})]\), and the weights of the other upward edges are in the range \([0,\frac{1}{2^{lmax+b}}]\).

  2. 2.

    The weights of all downward edges are \(w = \frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+1}}\).

As a consequence of these weight settings, we can prove the following about the network resulting from the first pass:

Lemma 21

The following properties hold of the network that results from the completion of the first learning pass:

  1. 1.

    Suppose c is any concept in C. Suppose that c is shown (that is, the set leaves(c) is presented) at time t, and no inputs fire at any other times. Then rep(c) fires at time \(t + level(c) = t + layer(rep(c))\), and does not fire at any earlier time.

  2. 2.

    Suppose c is any concept in C. Suppose that c is shown at time t, and no inputs fire at any other times. Suppose \(c'\) is any other concept in C with \(level(c') = level(c)\). Then \(rep(c')\) does not fire at time \(t+level(c)\).

  3. 3.

    Suppose that u is a neuron in the network that is not a rep of any concept in C. Suppose that precisely the set of level 0 concepts in C is presented at time t, and no inputs fire at any other times. Then neuron u never fires.

Proof

Property 1 follows from the analysis in [4], plus the fact that level(c) is the time it takes to propagate a wave of firing from the inputs to layer(rep(c). Property 2 can be proved by induction on level(c), using the limited-overlap property. Property 3 can be proved by induction on the time \(t' \ge t\), analogously to Lemma 16.

Second Learning Pass. The second pass sets all the weights for the downward edges. Here, to be simple, we set the weight of each edge to its final value in one learning step, rather than proceeding incrementally.Footnote 10 We aim to set the weights of all the “important” downward edges, that is, those that connect the rep of a concept to the rep of any of its children, to \(\frac{f}{\sqrt{k}}\), and the weights of all other downward edges to 0.

We first set the weights on the “important” downward edges. For this, we proceed level by level, from 1 to \({{\,\mathrm{\ell _{max}}\,}}\). The purpose of the processing for level \(\ell \) is to set the weights on all the “important” downward edges from layer \(\ell \) to layer \(\ell -1\) to \(\frac{f}{\sqrt{k}}\), while leaving the weights of the other downward edges equal to the initial weight w.

For each particular level \(\ell \), we proceed sequentially through the level \(\ell \) concepts in C, one at a time, in any order. For each such level \(\ell \) concept c, we carry out the following three steps:

  1. 1.

    Show concept c, that is, present the set leaves(c), at some time t. By Theorem 20, the reps of all children of c fire at time \(t + \ell - 1\), and rep(c) fires at time \(t+\ell \).

  2. 2.

    Engage all the layer \(\ell -1\) neurons to learn their incoming dweights at time \(t+\ell +1\), by setting their dgaged flags.

  3. 3.

    Learning rule: At time \(t+\ell +1\), each dgaged layer \(\ell -1\) neuron u that fired at time \(t+\ell -1\) sets the weights of any incoming edges from layer \(\ell \) neurons that fired at time \(t+\ell \) (and hence contributed potential to u) to \(\frac{f}{\sqrt{k}}\). Neuron u does not modify the weights of other incoming edges.

Note that, in Step 3, each neuron u that fired at time \(t+\ell -1\) will set the weight of at most one incoming downward edge to \(\frac{f}{\sqrt{k}}\); this is the edge from rep(c), in case u is the rep of a child of c.

Also note that u does not reduce the weights of other incoming downward edges during this learning step. This is to allow u to receive potential from other layer \(\ell \) neurons when those concepts are processed. This is important because u may represent a concept with multiple parents, and must be able to receive potential from all parents when they are processed.

Finally, note that, to implement this learning rule, we need some mechanism to engage the right neurons at the right times. For now, we just treat this abstractly, as we did for learning in feed-forward networks in Sect. F.

At the end of the second pass, each neuron u resets the weights of all of its incoming downward edges that still have the initial weight w, to 0. The neurons can all do this in parallel.

Lemma 22

The weights of the edges after the completion of the second learning pass are as follows:

  1. 1.

    The weights of the upward edges from reps of children to reps of their parents are in the range \([\frac{1}{(1+\epsilon )\sqrt{k}}, \frac{1}{\sqrt{k}})]\), and the weights of the other upward edges are in the range \([0,\frac{1}{2^{lmax+b}}]\).

  2. 2.

    The weights of the downward edges from reps of parents to reps of their children are \(\frac{f}{\sqrt{k}}\), and the weights of the other downward edges are 0.

Proof

Property 1 follows from Lemma 20 and the fact that the weights of the upward edges are unchanged during the second pass.

For Property 2, the second pass is designed to set precisely the claimed weights. This depends on the neurons firing at the expected times. This follows from Lemma 21, once we note that the three claims in that lemma remain true throughout the second pass. (The first two properties depend on upward weights only, which do not change during the second pass. Property 3 follows because only rep neurons have their incoming weights changing during the second pass.)

With these weights, we can now prove the main theorem:

Proof

(Of Theorem 16:) We use a scaled version of Theorem 7. Here we use \(w_1 = \frac{1}{1+\epsilon }\), \(w_2 = 1\), and a scaling factor \(s = \frac{1}{\sqrt{k}}\).Footnote 11

1.2 G.2 Noisy Learning

We can extend the results of the previous section to allow noisy learning in the first pass. For this, we use a threshold of \((\frac{(r_1+r_2)k}{2})\bar{w}\) and retain the initial edge weights of \(w = \frac{1}{k^{{{\,\mathrm{\ell _{max}}\,}}+ 1}}\). We define \(w_1 = 1-\frac{r_2-r_1}{25}\), \(w_2 = 1-\frac{r_2-r_1}{25}\), and the scaling factor s to be \(\bar{w} = \frac{1}{\sqrt{pk + 1 - p}}\).

The ideas are analogous to the noise-free case. The differences are:

  1. 1.

    The first phase continues long enough to complete training for the weights of the upward edges using Oja’s rule.

  2. 2.

    The weights of the upward edges from reps of children to reps of their parents are in the range \([(1-\frac{r_2-r_1}{25}) \bar{w}, (1+\frac{r_2-r_1}{25}) \bar{w}]\), and the weights of the other upward edges are in the range \([0,\frac{1}{k^{2{{\,\mathrm{\ell _{max}}\,}}}}]\).

  3. 3.

    The weights of the downward edges are set to \(f \bar{w}\).

With these changes, we can obtain a theorem similar to Theorem 16 but with a larger training time, yielding \((r_1,r_2,f)\)-learning of \(\mathcal C\).

H Future Work

There are many possible directions for extending the work. For example:

Concept Recognition: It would be interesting to study recognition behavior after partial learning. The aim of the learning process is to increase weights sufficiently to guarantee recognition of a concept when partial information about its leaves is presented. Initially, even showing all the leaves of a concept c should not be enough to induce rep(c) to fire, since the initial weights are very low. At some point during the learning process, after the weights increase sufficiently, presenting all the leaves of c will guarantee that rep(c) fires. As learning continues, fewer and fewer of the leaves will be needed to guarantee firing. It would be interesting to quantify the relationship between amount of learning and the number of leaves needed for recognition.

Also, the definition of robustness that we have used in this paper involves just omitting some inputs. In would be interesting to also consider other types of noise, such as adding extraneous inputs. How well do our recognition algorithms handle this type of noise?

Another type of noise arises if we replace the deterministic threshold elements with neurons that fire stochastically, based on some type of sigmoid function of the incoming potential. How well do our recognition algorithms handle this type of noise? Some initial ideas in this direction appear in Appendix A, but more work is needed.

Learning of Concept Hierarchies: Our learning algorithms depend heavily on Winner-Take-All subnetworks. We have treated these abstractly in this paper, by giving formal assumptions about their required behavior. It remains to develop and analyze networks implementing the Winner-Take-All assumptions.

Another interesting issue involves possible flexibility in the order of learning concepts. In our algorithms, incomparable concepts can be learned in any order, but children must be completely learned before we start to learn their parents. We might also consider some interleaving in learning children and parents. Specifically, in order to determine rep(c) for a concept c, according to our learning algorithms, we would need for the reps of all of c’s children to be already determined, and for the children to be learned sufficiently so that their reps can be made to fire by presenting “enough” of their leaves.

But this does not mean that the child concepts must be completely learned, just that they have been learned sufficiently that it is possible to make them fire (say, when all, or almost all, of their leaves are presented). This suggests that it is possible to allow some interleaving of the learning steps for children and parents. This remains to be worked out.

Another issue involves noise in the learning process. In Sects. F and G, we have outlined results for noisy learning of weights of upward edges, in the various cases studied in this paper, but full details remain to be worked out. The approach should be analogous to that in [4], based on presenting randomly-chosen subsets of the leaves of a concept being learned. The key here should be to articulate simple lemmas about achieving approximate weights with high probability. It also remains to consider noise in the learning process for weights of downward edges.

Finally, our work on learning of weights of upward edges has so far relied on Oja’s learning rule. It would be interesting to consider different learning rules as well, comparing the guarantees that are provided by different rules, with respect to speed of learning and robustness to noise during the learning process.

Different Data Models, Different Network Assumptions, Different Representations: One can consider many variations on our assumptions. For example, what is the impact of loosening the very rigid assumptions about the shape of concept hierarchies? What happens to the results if we have limited connections between layers, rather than all-to-all connections? Such connections might be randomly chosen, as in [8]. Also, we have been considering a simplified representation model, in which each concept is represented by precisely one neuron; can the results be extended the to accommodate more elaborate representations?

Experimental Work in Computer Vision: Finally, it would be interesting to try to devise experiments in computer vision that would reflect some of our theoretical results. For example, can the high-latency recognition behavior that we identified in Sect. E.2, involving extensive information flow up and down the hierarchy, be exhibited during recognition of visual scenes?

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lynch, N., Mallmann-Trenn, F. (2023). Learning Hierarchically-Structured Concepts II: Overlapping Concepts, and Networks with Feedback. In: Rajsbaum, S., Balliu, A., Daymude, J.J., Olivetti, D. (eds) Structural Information and Communication Complexity. SIROCCO 2023. Lecture Notes in Computer Science, vol 13892. Springer, Cham. https://doi.org/10.1007/978-3-031-32733-9_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-32733-9_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-32732-2

  • Online ISBN: 978-3-031-32733-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics