@sknudstrup gives a review of the research on compositional policies so far, and presents his thoughts on this using a toy example of a lamp with a switch.
Still working my way through this now, so please forgive me if you guys cover this later in the video. But when talking about hierarchical goal-state generation, wouldn’t it make more sense to distribute high-level planning to as wide a group of LMs as possible (or at least to some arbitrary threshold), as opposed to having a dedicated “make coffee” module?
You could then have a wider pool of low level LMs for your motor-behavioral output and you simple select for the shortest acyclic path from goal state to behavioral execution?
Edit: @26:33. I think @vclay is 100% on point with her observation (regarding typing on a keyboard). I suspect what’s going on is that we slowly migrate learned (consistent/efficient model-based policy executions) into model-free storage. Then when we’re touch typing, it’s not that we’ve memorized where specific keys are at on the keyboard, but rather what specific hand/digit orientation + pose leads to what specific outcome. In this way, we’re outputting communication not through character selection, but through sequences of hand gestures (similar to how a deaf person might sign).