2024/08 - Encoding Object Similarity in SDRs

Ramy gives a presentation on his work during his internship. He presents on how he uses the relative evidence scores for objects to create a similarity matrix and learn SDRs that encode these similarities in overlaps.

2 Likes

The presentation features an algorithm that generates three SDRs from three target overlaps (or similarities) between them.

I see two problems with such an algorithm (any, not this one in particular):

  • first is scaling, how difficulty increases when it is applied to thousands of target “objects” or SDRs. If it scales e.g. quadratically, then you might have a problem.
  • Objects or represntations aren’t available all at once. First the child learns 100 objects, then another 100, etc.. Somehow you’d like already known representations to be consistent have some … persistence to be recognizable later without requiring to rewrite them.
2 Likes

Hi @blimpyway and welcome to TBP discourse :slight_smile:

These are definitely valid points to consider with such an optimization approach.

Scaling issues

We have not observed scaling issues with optimizing SDRs for thousands of synthetic objects. Even though the target overlap scales quadratically with the number of objects, we sample a small number of objects every “minibatch” for calculating a representative gradient and optimizing SDRs in a similar manner to stochastic gradient descent (not discussed in the presentation).

The actual calculation of the target overlaps relies on evidence scores already existing in LM memory. To calculate the overlaps, we use the output of the evidence updates already calculated in EvidenceGraphLM, but we do not directly compare object graphs to determine similarity; EvidenceGraphLM does the heavy lifting of computing evidence scores across all existing objects by expanding the size of its hypothesis space to account for the added objects.

I’ve borrowed some common terminology from machine learning that relates to optimization (e.g., calculating gradients, minibatch, SGD), but there isn’t any kind of (deep) learning here. The optimization directly modifies the SDR representations to match the target overlap without training any hidden parameters.

Streaming Setup

We ran some experiments with synthetic objects for optimizing SDRs in a streaming setup, where we do not assume that we have all the objects during optimization. This simulates the scenario you described of progressively increasing the number of objects as the agent explores the environment. Results are here. The overlap matrix size increases because more objects are being added, which results in a transient increase of overlap error until optimization quickly drives that error back down.

In the Monty-YCB setup, we do not assume that we have all the similarity values existing before optimizing SDRs. After each episode, Monty observes a new object and uses the relative evidence scores to fill the values of a new row in the target similarity matrix. This can be seen here.

We have yet to test this approach in a more challenging setup of unsupervised learning, where learning and inference occur together in the same episode. In such case, Monty will only add relative similarity scores with existing object graphs, and we cannot assume knowledge of similarity values with objects that do not exist in memory. I do not expect this to be much more challenging than the synthetic experiments of streaming objects.

Related issues to consider here are representation drift and stability of SDRs. EvidenceSDRGraphLM has a stability parameter that controls how much learned SDRs are allowed to change as we introduce new objects. This is targeted at solving some of the challenges you described with learning persistent objects representations in a streaming manner. We have had some ideas for how Hebbian style learning rules could be used to enable continual learning in this streaming setting, without revisiting all known evidence scores, but we realistically won’t get to exploring these anytime soon. It does suggest an avenue for how the brain might solve this issue, and something we might revisit.

Note that working on encoding similarity into SDRs is currently not a TBP priority according to the current roadmap. A broader discussion of neural elements in Monty can be found here.

2 Likes

Sounds thoroughly considered.

A question I would ask is why? I thought that in real world resemblance between X and Y is something about to be inferred from observation/experiencing both X and Y, not something that is aimed for, and not to any level of “precision” of overlap.

Some side thought also… what if there is no single SDR encoding a cup, ant, bike, etc? So “resemblance” and “identity” isn’t as straight forward as theory assumes?

The similarity between objects is inferred from observing the objects, as you said. This measure of similarity is a byproduct of updating evidence scores as we observe features on an object. It is not something that we aim for. However, the mapping between a similarity score to overlap bits in SDRs is learned because Monty is not implemented with SDRs.

The notion of similarity would be useful for representing compositionality. In Monty, every message between any two learning modules (LMs) is represented as a feature at a location. For a low-level LM observing a cup, the features can be morphological (e.g., point normal and principal curvature directions) and/or non-morphological features (e.g., hue), and the location in this case would be a location in the reference frame of the cup. For a higher-level LM that observes a scene representation (e.g., dinner set), the features would be different objects at different locations in the reference frame of the scene. Any learning module needs to be able to compute the similarity of the incoming features to compare the sensed features with the model features. It is easy to compute the similarity of hue or vectors, but a bit more involved for higher level features (e.g., objects).

I like to think of it as a measure of interchangeability, i.e., can we swap out one object with another one in a high-level scene representation and still recognize the same scene? We would need to know how similar the objects are to perform such computation. The below image is from another presentation during my internship.

Note that while we think similarity is useful for representing compositional objects, we are still not sure that SDRs can meaningfully encode similarities as bit overlaps. We do not want to assign meaning to individual bits of such a sparse representation.

Hope this clarifies things.

2 Likes

One thing that comes to mind is to generate real value embedding vectors representing “objects” in a manner that “more similar” objects have closer (in some metric) embeddings.

With sufficient large vectors, just cutting the top-k values for each embedding to get respective SDRs should be sufficient to propagate a measure of similarity in SDR space.

Regarding composability I still think real-valued vectors are more useful (or at least more convenient) than SDRs and they still can be “downgraded” to SDRs wherever a SDR representation is considered more useful.


e.g. one issue with overlapping of two SDRs is it doesn’t encode “order”. A + B == B + A
Adding two float vector embeddings has the same issue, but at least we know that positional encodings exist and do work. They are even able to encode different types of relationships - like ((A+R) % 1 + B) % 1 , where “R” is an embedding representing an arbitrary, complex relationship (even a 7D spatial relation = relative position + rotation + scaling) between A and B

Some of these thoughts were shuffled here


A fundamental (and mostly rhetoric) question about the example pictures you shared - how do objects pop out into existence? Let’s say you feed a “new born” machine a bunch of images or videos, how do you expect to start “believing” that there-s a cup, a cereal box, a plate, etc.. and encoding the first image as a “cup SDR” and the second as several SDRs - for cup, spoon, table, etc.. How this… segmentation is supposed to emerge?


2 Likes

Hi @blimpyway, I came across this and wanted to share my thoughts.

One thing that comes to mind is to generate real value embedding vectors representing “objects” in a manner that “more similar” objects have closer (in some metric) embeddings.

With sufficient large vectors, just cutting the top-k values for each embedding to get respective SDRs should be sufficient to propagate a measure of similarity in SDR space.

I think what you’ve proposed is a clever way to derive SDRs from float vectors (I believe we also currently do this in EvidenceSDRGraphLM class). I agree that having semantically similar objects map to nearby embeddings is often desirable — especially for retrieval, classification, or compression.

That said, I think directly optimizing for object similarity may be problematic, particularly in general-purpose systems like Monty. Whether two objects are “similar” is highly context-dependent. For example, a kitchen cup and a chemistry lab beaker may share features and morphology, but serve very different roles — I probably wouldn’t want to drink from the beaker. Conversely, some tasks might not care about differences between dissimilar-looking objects if they serve the same function. In the future, we might want to encode affordances in higher-level scenes, etc. These should come from higher-level feedback, because they describe how an object might be used (or appears) in a compositional scene. All these different factors define the similarity of objects.

On the top-k SDR idea: I think the dimensionality details are important here. In low dimensions, there’s a chance of SDR collisions — distinct objects may share top-k elements, especially if k is small. One way to mitigate this is to choose k = \frac{d}{2}, which maximizes the number of possible combinations (since \binom{d}{k} is maximized at k = d/2). But this starts to challenge the idea that the representation is “sparse.”

On the other hand, if we keep k small and increase the overall embedding dimension to preserve sparsity, we run into the curse of dimensionality, i.e. distance metrics like Euclidean or cosine become less meaningful, as all vectors tend to become nearly equidistant. While it’s true that something like \binom{512}{16} \approx 10^{30} provides a vast space of SDRs, it’s hard to know ahead of time what the optimal dimensionality should be — especially as Monty continues to learn more concepts over its lifetime.

e.g. one issue with overlapping of two SDRs is it doesn’t encode “order”. A + B == B + A
Adding two float vector embeddings has the same issue, but at least we know that positional encodings exist and do work. They are even able to encode different types of relationships - like ((A+R) % 1 + B) % 1 , where “R” is an embedding representing an arbitrary, complex relationship (even a 7D spatial relation = relative position + rotation + scaling) between A and B

As for SDRs not encoding order: I think we may be able to learn things like relation, hierarchy, and order in a compositional manner, without being baked into the SDR itself. I think vector representations from models like word2vec (e.g., “king - man + woman = queen”) or in GAN latent spaces are examples where these relations were emergent and not explicitly designed mechanisms. So while float vectors do offer a path to encode order — via positional embeddings or relational vectors like R — I think it may still be possible to learn these relational mechanisms on top of SDRs. Just my two cents, though.

4 Likes

Regarding curse of dimensionality and SDRs .. I don’t think this is a big issue, since the overlap (inverse of overlap actually) is the “natural” way to measure similarity and works pretty well.
e.g. if one needs vector search, some engines (e.g. pynndescent) allow for user defined metrics in their API, thus using overlap metric to search near neighbors in a big (milions of entries) database isn’t out of reach, even for quite large SDRs. e.g. 400/20k bit.

1 Like

one issue with overlapping of two SDRs is it doesn’t encode “order”. A + B == B + A

That’s an interesting point. I should probably point out that in HTM the order is encoded as the context by choosing which cell in a minicolumn is active. Therefore, the SDR representing “C” in the context of seeing “A” then “B” is different from the SDR representing “C” in the context of seeing “B” then “A”. Even the union of predictions represent unions of context-aware SDRs, therefore the need for order is not very important here. Order and positional encoding of inputs would be needed in deep learning attention mechanisms, but not with SDRs that represent context in minicolumns. For what it’s worth, LSTMs also don’t require positional encoding because they are expected to learn some context in their memory representation through backpropagation.

In TBT, and by extension Monty, we have a similar (and perhaps more flexible) mechanism of learning features at locations. Each feature is associated with a location in a reference frame (RF). You can imagine the simplest form of RF being a 1D RF representing the order of hearing a sequence of notes in a melody. But RFs can become more complex to represent higher-dimensions, and even abstract concepts.

how do objects pop out into existence?

Yes, this is a very interesting question that we have frequently visited during research meetings. In my opinion, approaches that rely on prediction error seem to be promising at detecting boundaries of objects or events. But we have recently discussed other very plausible approaches that rely on object behaviors. Stay tuned as we release a set of exciting videos touching on this topic.

2 Likes

Yeah, that should be an essential part of a world model.

1 Like

The issue I see with that is in (H)TM there are two entirely different things that are considered as SDRs. It’s the input/output SDR the size of number of columns (e.g. 1000 bits) and what you call “context” or most recent firing state of ALL cells within the TM, which size is the number of columns multiplied by the height of each column (e.g. 1000x16 = 16k bits available for context).

One entity represents the information (or “symbols” if you like) getting in or out of the TM’s “black box” and the latter is its internal state, or whatever invisibly is constrained"within the box.

So when we want to compress (assuming that is necessary for higher order semantics) “symbol A followed by B” with a single output symbol C (only 1000 bits not 1000x16) that’s what overlapping alone cannot solve.

This assumption seems necessary if we want avoiding passing messages as large as the whole state of the “brain”.

Regarding objects - I’m a bit skeptic about the current object-centric approach which makes a lot of “how do objects should be represented and how they work” assumptions without considering first the question on how do objects are (or should be) brought into existence.
Simply because it is very likely that whatever mechanics creates mental object representations, it also has a heavy saying into how these representations are encoded and how the objects interact/behave.

1 Like

The issue I see with that is in (H)TM there are two entirely different things that are considered as SDRs

Yes, this is a good point to bring up. There are usually three representations. Only the latter two are SDRs.

  • The output of the spatial pooler: This is one cell per minicolumn and it is not aware of any context. We have around 150 minicolumns in a cortical column which means that the size of this representation is around 150 cells (much less than 1000 in practice). This representation is not very sparse and does not get shared with other layers, but it drives the activations in the minicolumns. It is tempting to view the set of active minicolumns (e.g. 32 out of 150) as an SDR, but this is not accurate, given this is never decoded/used as an SDR. Among other things, it is not very high dimensional, or sparse.

  • The full SDR in L4: This includes all the cells in all minicolumns. This is the real SDR that gets shared with other layers within a cortical column. It is much sparser than the output of the spatial pooler and encodes the context of previous representations. I would not treat it as a black box in this case.

  • The output SDR in L3: The SDR here would represent a more stable representation of a sequence of feature SDRs (i.e., temporal pooling). This is a separate SDR that gets shared with other columns.

It is worth noting that in the TBT, we encode the context by associating features in L4 with locations (anchored to an object’s RF) in L6a. This association encodes the needed order you are referring to. This can be analogous to positional encoding of features, as the association ties the features to some location, without having to modify the feature SDR itself to embed this positional knowledge.

I’m a bit skeptic about the current object-centric approach

We assume for now that Monty learns about objects in an isolated manner (i.e. one at a time). This is akin to a child holding an object and devoting its attention to it at the exclusion of the rest of the world (the nearsightedness of infants may actually assist this). This being said, we have been discussing approaches that make use of object behaviors to break off a parent object into children in a compositional hierarchy. I think these videos are in queue to be released soon.

2 Likes