Monty on the MNIST dataset

skj9865 · May 13, 2025, 12:02am

Hi,

I’ve been experimenting with Monty for 2-D object recognition on the MNIST digits and would appreciate your advice on pushing the accuracy and speed higher.

What I changed so far

Principal-curvature stage
- Replaced the 3-D quadric fitting with a 2-D Hessian-based approach.
- Tangential vectors (Hessian eigen-vectors) are fed into pose_vectors[1:3].
Point-normal vector
- In 2-D every patch shares the same “out-of-plane” normal, so I fix pose_vectors[0] = (0, 0, 1).
Environment
- Adapted SaccadeOnImageEnvironment for flat images.
Dataset split
- 60 samples per digit (0–9).
  - 30 for training
  - 60 (the full mini-set) for evaluation.

Current results

Accuracy ≈ 55 %
Avg. recognition latency ≈ 1.2 s per image (0.025 s per each step)

My target is ≥ 90 % accuracy at ≥ 13 FPS.

Questions & points for discussion

LM configuration

Are there recommended LM hyper-parameters (tolerance, max match distance, etc.) that typically boost performance on small 2-D datasets?

Rotation ambiguity (6 ↔ 9)

Would it be sensible to suppress 180° pose hypotheses during matching, or is there an existing mechanism to disambiguate such mirror digits inside Monty’s LM?

Any other proven tricks for 2-D use-cases—e.g., descriptor dimensionality reduction, alternative KD-tree settings—that you have found effective?

Thanks in advance for any guidance!

Best regards,

Below are my code snippets
------------------------------------------------- experiment configuration --------------------------------------------

mnist_training = dict(
   experiment_class=MontySupervisedObjectPretrainingExperiment,
   experiment_args=ExperimentArgs(
       n_train_epochs=1,
       do_eval=False,
   ),
   logging_config=CSVLoggingConfig(
               output_dir="mnist/log",
               monty_log_level="BASIC",
               monty_handlers=[BasicCSVStatsHandler],                
           ),
   monty_config=PatchAndViewMontyConfig(
       # Take 1 step at a time, following the drawing path of the letter
       motor_system_config=MotorSystemConfigInformedNoTransStepS1(),
       sensor_module_configs=omniglot_sensor_module_config,
   ),
   dataset_class=ED.EnvironmentDataset,
   dataset_args=MnistDatasetArgs(),
   train_dataloader_class=ED.MnistDataLoader,
   train_dataloader_args = get_mnist_train_dataloader(start_at_version = 0, number_ids = np.arange(0,10), num_versions=30)
)

mnist_inference = dict(
   experiment_class=MontyObjectRecognitionExperiment,
   experiment_args=ExperimentArgs(
       #model_name_or_path=pretrain_dir + "/mnist_training/",
       model_name_or_path = "mnist/log/mnist_training/pretrained",
       do_train=False,
       n_eval_epochs=1,
   ),
   logging_config=CSVLoggingConfig(
           output_dir="mnist/log",
           monty_log_level="BASIC",
           monty_handlers=[BasicCSVStatsHandler],                
       ),


   monty_config=PatchAndViewMontyConfig(
       monty_class=MontyForEvidenceGraphMatching,
       learning_module_configs=dict(
           learning_module_0=dict(
               learning_module_class=EvidenceGraphLM,
               learning_module_args=dict(
                   # xyz values are in larger range so need to increase mmd
                   max_match_distance=5,
                   tolerances={
                       "patch": {
                           "principal_curvatures_log": np.ones(2),
                           "pose_vectors": np.ones(3) * 45,
                       }
                   },
                   # Point normal always points up, so they are not useful
                   feature_weights={
                       "patch": {
                           "pose_vectors": [0, 1, 0],
                       }
                   },
                   # We assume the letter is presented upright
                   initial_possible_poses=[[0, 0, 0]],
               ),
           )
       ),
       sensor_module_configs=omniglot_sensor_module_config,
   ),
   dataset_class=ED.EnvironmentDataset,
   dataset_args=MnistDatasetArgs(),
   eval_dataloader_class=ED.MnistDataLoader,
   eval_dataloader_args = get_mnist_eval_dataloader(start_at_version = 0, number_ids = np.arange(0,10), num_versions=60)
)

------------------------------------------------- experiment configuration --------------------------------------------

------------------------------------------- SaccadeOnImageEnvironment----------------------------------------


class TwoDimensionSaccadeOnImageEnvironment(EmbodiedEnvironment): # by skj for 2D image evaluation
   def __init__(self, patch_size=10, data_path=None):
       self.patch_size = patch_size
       self.rotation = qt.from_rotation_vector([np.pi / 2, 0.0, 0.0])
       self.state = 0
       self.data_path = data_path
       if self.data_path is None:
           self.data_path = os.path.join(os.environ["MONTY_DATA"], "mnist/samples/trainingSample")
       self.number_names = [
           a for a in os.listdir(self.data_path) if a[0] != "."
       ] 
       self.current_number = self.number_names[0]       
       self.number_version = 1

       self.current_image,self.current_loc = self.load_new_number_data()    
         
       self.move_area = self.get_move_area()
      
       # Get 3D scene point cloud array from depth image
       self.current_scene_point_cloud =0
       self.current_sf_scene_point_cloud =0
       self._agents = [
           type(
               "FakeAgent",
               (object,),
               {"action_space_type": "distant_agent_no_translation"},
           )()
       ]
       self._valid_actions = ["look_up", "look_down", "turn_left", "turn_right"]

   @property
   def action_space(self):
       ……
   def add_object(self, *args, **kwargs):
	……

   def step(self, action: Action):
   
       if action.name in self._valid_actions:
           amount = action.rotation_degrees
       else:
           amount = 0
      
       if np.abs(amount) < 1:
           amount = 1
       # Make sure amount is int since we are moving using pixel indices
       amount = int(amount)       
       amount = 1
      
       query_loc = self.get_next_loc(action.name, amount)       
       self.current_loc = query_loc 
              
       patch = self.get_image_patch(
           self.current_image, self.current_loc, self.patch_size
       )               
       #print(action.name)

       # patch : (H, W) uint8, 0=배경·>0=글자       
       h, w = patch.shape
       yy, xx = np.mgrid[0:h, 0:w]
       zz = np.zeros_like(xx, dtype=np.float32)

       # 글자(픽셀 값 > 0)를 semantic_id=1 로 표시
       sem_id = (patch > 0).astype(np.float32)
       semantic_3d = np.stack([xx, yy, zz, sem_id], axis=-1) \
               .astype(np.float32) \
               .reshape(-1, 4)  

       sensor_frame_data = semantic_3d.copy()

       # ── 깊이 맵 : 0.5(전경) / 1.0(배경) ───────────────────────
       depth = np.where(patch > 0, 0.5, 1.0).astype(np.float32)
       # ── world_camera : 단순 평면이므로 단위 행렬 ───────────────
       world_camera = np.eye(4, dtype=np.float32)
       obs = {
           "agent_id_0": {
               "patch": {
                   "depth": depth,
                   "semantic_3d": semantic_3d,
                   "sensor_frame_data": sensor_frame_data,
                   "world_camera": world_camera,
                   "rgba": np.stack([patch, patch, patch], axis=2),
                   "pixel_loc": self.current_loc,
               },
               "view_finder": {
                   "depth": self.current_image,
                   "semantic": np.array(patch, dtype=int),
               },
           }
       }       
       return obs

   def get_state(self):
       ……
   def switch_to_object(self, number_id, version_id):
       ……   
   def remove_all_objects(self):
       ……

   def reset(self):
       self.step_num = 0
       patch = self.get_image_patch(
           self.current_image, self.current_loc, self.patch_size
       )              
       depth = np.ones((patch.shape[0],patch.shape[1]))
       obs = {
           "agent_id_0": {
               "patch": {
                   "depth": depth,
                   "semantic": np.array(patch, dtype=int),
                   "rgba": np.stack(
                       [patch, patch, patch], axis=2
                   ),  # TODO: placeholder
                   "pixel_loc": self.current_loc,
               },
               "view_finder": {
                   "depth": self.current_image,
                   "semantic": np.array(patch, dtype=int),
               },
           }
       }       
       return obs

   def load_new_number_data(self):
       …… 
   def load_depth_data(self, depth_path, height, width):
       ……
   def process_depth_data(self, depth):
       ……
   def load_rgb_data(self, rgb_path):
       ……
   def get_move_area(self):
       ……
   def get_next_loc(self, action_name, amount):
       ……
   def get_image_patch(self, img, loc,patch_size):
       ……
   def close(self):
       ……

------------------------------------------- SaccadeOnImageEnvironment----------------------------------------

--------------------------------------------sensor_modules.py-------------------------------------------------------

  @staticmethod
   def get_hessian_eigens(img_patch: np.ndarray, center:int, σ=1.0):
       f = cv2.GaussianBlur(img_patch, (0,0), σ)       # 소음 완화
       fxx = cv2.Sobel(f, cv2.CV_64F, 2, 0, ksize=3)
       fyy = cv2.Sobel(f, cv2.CV_64F, 0, 2, ksize=3)
       fxy = cv2.Sobel(f, cv2.CV_64F, 1, 1, ksize=3)
       H   = np.array([[fxx.flat[center], fxy.flat[center]],
                       [fxy.flat[center], fyy.flat[center]]])
       λ, V = np.linalg.eigh(H)              # λ0 ≥ λ1 정렬
       idx  = np.argsort(-np.abs(λ))
       return λ[idx][0], λ[idx][1], V[:,idx][:,0], V[:,idx][:,1], True
   ####################### by skj for 2D processing

   def extract_and_add_features(
           self
           features: dict,
           gray_patch: np.ndarray,        # (H, W)  ─ SIFT 등 형상 계산용
           rgba_patch: np.ndarray,        # (H, W, 3)
           depth_patch: np.ndarray,       # (H, W)  ─ 가짜 깊이 0.5/1.0
           center_flat_idx: int,          # row * W + col
           center_rowcol: int,            # 패치 중앙 row == col
           sem_mask: np.ndarray,          # (H, W)  ─ on-object 마스크
       ):
       # ────────────────────────────────────────────────────────────
       # 1.  형상-특징 (Morphological)
       # ────────────────────────────────────────────────────────────
       k1, k2, v1, v2, valid_pc = self.get_hessian_eigens(gray_patch, center_flat_idx)
       normal = np.array([0.0,0.0,1.0])
       morphological_features = {
           "pose_vectors": np.vstack([
               #np.append(grad_vec, 0.0),     # z=0 padding
               normal,
               np.append(v1, 0.0),
               np.append(v2, 0.0),
           ]),
           #"pose_fully_defined": pose_fully_defined,
           "pose_fully_defined": bool(abs(k1-k2) > self.pc1_is_pc2_threshold)

       }
       # ────────────────────────────────────────────────────────────
       # 2.  비-형상 feature (RGBA, HSV, Depth 통계)
       # ────────────────────────────────────────────────────────────
       # 중심 픽셀 좌표
       c = center_rowcol
       if "rgba" in self.features:
           features["rgba"] = rgba_patch[c, c]

       if "hsv" in self.features:
           rgb = rgba_patch[c, c] / 255.0
           hsv = skimage.color.rgb2hsv(rgb[np.newaxis, np.newaxis, :])[0, 0]
           features["hsv"] = hsv

       if "min_depth" in self.features:
           valid = depth_patch[sem_mask] < 1.0   # on-object
           features["min_depth"] = float(depth_patch[sem_mask][valid].min()) \
               if valid.any() else np.nan

       if "mean_depth" in self.features:
           valid = depth_patch[sem_mask] < 1.0
           features["mean_depth"] = float(depth_patch[sem_mask][valid].mean()) \
               if valid.any() else np.nan

       if any("curvature" in f for f in self.features) and valid_pc:
           if "principal_curvatures" in self.features:
               features["principal_curvatures"] = np.array([k1, k2])
           if "principal_curvatures_log" in self.features:
               features["principal_curvatures_log"] = log_sign(np.array([k1, k2]))
           if "gaussian_curvature" in self.features:
               features["gaussian_curvature"] = k1 * k2
           if "mean_curvature" in self.features:
               features["mean_curvature"] = (k1 + k2) / 2

       invalid_signals = False # 법선 계산 실패가 없으므로       
       return features, morphological_features, invalid_signals


   def observations_to_comunication_protocol(self, data, on_object_only=True):   
       patch_dict = data #by skj
       # 1) 회색 패치 (H,W) ─ 그래디언트‧헤시안 계산용
       gray_patch = patch_dict["rgba"][:, :, 0].astype(np.float32) # by skj
       h, w = gray_patch.shape # by skj
       center_rowcol  = h // 2 # by skj
       center_flat_idx = center_rowcol * w + center_rowcol # by skj

       obs_3d = data["semantic_3d"]       
       sensor_frame_data = data["sensor_frame_data"]
       world_camera = data["world_camera"]
       rgba_feat = data["rgba"]
       depth_feat = data["depth"].reshape(data["depth"].size, 1).astype(np.float64)
       # Assuming squared patches
       center_row_col = rgba_feat.shape[0] // 2
       # Calculate center ID for flat semantic obs
       obs_dim = int(np.sqrt(obs_3d.shape[0]))
       half_obs_dim = obs_dim // 2
       center_id = half_obs_dim + obs_dim * half_obs_dim
       # Extract all specified features
       features = dict()
      
       if "object_coverage" in self.features:
           # Last dimension is semantic ID (integer >0 if on any object)
           features["object_coverage"] = sum(obs_3d[:, 3] > 0) / len(obs_3d[:, 3])
           assert (
               features["object_coverage"] <= 1.0
           ), "Coverage cannot be greater than 100%"
      
       rgba_patch  = patch_dict["rgba"]
       depth_patch = patch_dict["depth"]
       sem_mask    = (patch_dict["semantic_3d"][:, 3].reshape(gray_patch.shape) > 0)
      
       features, morph_feats, invalid = self.extract_and_add_features(
               features,
               gray_patch,
               rgba_patch,
               depth_patch,
               center_flat_idx,
               center_rowcol,
               sem_mask,
       )

       # 3) on_object 판정은 semantic_3d 의 4번째 값 사용
       sem_3d = patch_dict["semantic_3d"]

       # ── 중앙 픽셀의 3D 좌표(x,y,z=0) ─────────────────────────
       obs_center = sem_3d[center_flat_idx]        # [x, y, z, semantic_id]
       x, y, z = obs_center[:3]                    # z는 0
       semantic_id = obs_center[3]
       #print(semantic_id)
       # on_object 플래그
       morph_feats["on_object"] = float(semantic_id > 0)

       observed_state = State(
           location=np.array([x, y, z]),  
           morphological_features=morph_feats,
           non_morphological_features=features,
           confidence=1.0,
           use_state=bool(morph_feats["on_object"]) and not invalid,
           sender_id=self.sensor_module_id,
           sender_type="SM",
       )      
       return observed_state

--------------------------------------------sensor_modules.py-------------------------------------------------------

hlee · May 13, 2025, 3:35am

Hey @skj9865! I’ve been lurking on your posts (about improving accuracy and speed on 2D datasets) and wanted to chime in. I think @vclay will have more ideas (I have only played around with the 3D YCB dataset so far), but I’m really intrigued and excited that you are working on this.

Please pardon that I only have one concrete suggestion for now and a few questions…

One Suggestion for Speedup: This is not Monty-related, but I think your get_hessian_eigens() function could be sped up. Looking at the code you shared:

@staticmethod
   def get_hessian_eigens(img_patch: np.ndarray, center:int, σ=1.0):
       f = cv2.GaussianBlur(img_patch, (0,0), σ)       # 소음 완화
       fxx = cv2.Sobel(f, cv2.CV_64F, 2, 0, ksize=3)
       fyy = cv2.Sobel(f, cv2.CV_64F, 0, 2, ksize=3)
       fxy = cv2.Sobel(f, cv2.CV_64F, 1, 1, ksize=3)
       H   = np.array([[fxx.flat[center], fxy.flat[center]],
                       [fxy.flat[center], fyy.flat[center]]])
       λ, V = np.linalg.eigh(H)              # λ0 ≥ λ1 정렬
       idx  = np.argsort(-np.abs(λ))
       return λ[idx][0], λ[idx][1], V[:,idx][:,0], V[:,idx][:,1], True
   ####################### by skj for 2D processing

It looks like Hessian is a 2x2 matrix. For a 2x2 matrix, I think it is much faster to just use the closed form solution, something like:

@staticmethod
def get_hessian_eigens(img_patch: np.ndarray, center: int, σ=1.0):
    f = cv2.GaussianBlur(img_patch, (0,0), σ)
    fxx = cv2.Sobel(f, cv2.CV_64F, 2, 0, ksize=3).flat[center]
    fyy = cv2.Sobel(f, cv2.CV_64F, 0, 2, ksize=3).flat[center]
    fxy = cv2.Sobel(f, cv2.CV_64F, 1, 1, ksize=3).flat[center]

    # Eigenvalues
    trace = fxx + fyy
    delta = np.sqrt((fxx - fyy)**2 + 4*fxy**2)
    λ1 = 0.5 * (trace + delta)
    λ2 = 0.5 * (trace - delta)

    # Eigenvectors
    if fxy != 0:
        v1 = np.array([λ1 - fyy, fxy])
        v2 = np.array([λ2 - fyy, fxy])
    elif fxx >= fyy:
        v1 = np.array([1, 0])
        v2 = np.array([0, 1])
    else:
        v1 = np.array([0, 1])
        v2 = np.array([1, 0])
    v1 /= np.linalg.norm(v1)
    v2 /= np.linalg.norm(v2)

    return λ1, λ2, v1, v2, True

And 3Blue1Brown has a nice video explaining it here.

I’m not sure how much this will practically impact speed because I’m not sure how many times this is being called and speeding up 2x2 EVD may not be a big deal…

Few Questions

I think speed is closely proportional to number of steps that Monty takes to “recognize” the object. Episodes that go up to max_eval_steps (I think default is 500 steps) and then “time out” takes the longest. In eval_stats.csv, it might be worth looking at which samples had a high number of num_steps. For example, if you consistently notice that a certain digit has a higher average num_step, then we can focus on trying to explore what makes that digit difficult and try to reduce the steps there. As a corollary, you could reduce max_eval_steps to something smaller, but that would also negatively impact accuracy, which we don’t want.
Related to above, if we want to reduce the num_steps, it means that we must quickly converge the possible number of hypotheses down to one. If you can run your experiments with logging_config with python_log_level = “DEBUG”, then I think it will show many hypotheses are being considered in each step.
For your MNIST classification, may I ask some specifics? (Sorry if I missed these in your code)

Are you doing classification only (so guessing correct digit between 0-9) or also doing some pose estimation as well? If you are only doing classification, I feel like the maximum number of hypotheses should just be 10 and go down from there.
May I ask how much of the MNIST image patch you are seeing? For example in YCB dataset, the input image that we get is 64 pixels x 64 pixels. Since MNIST images are only 28 pixels x 28 pixels, are you seeing the entire image at once? Or a smaller window?

It may also be insightful to look at the digit “models” that were learned by Monty after pretraining to make sure that the learned digits are reasonable before evaluation. One time I had some nonsense pointclouds that were learned, and unsurprisingly the accuracy was very low during inference. Here is an example code sample to help see the “models” learned by Monty, and what that might look like for a YCB object.

def load_object_model(
    model_name: str,
    object_name: str,
    features: Optional[Iterable[str]] = ("rgba",),
    checkpoint: Optional[int] = None,
    lm_id: int = 0,
) -> ObjectModel:
    """Load an object model from a pretraining experiment.

    Args:
        model_name (str): The name of the model to load (e.g., `dist_agent_1lm`).
        object_name (str): The name of the object to load (e.g., `mug`).
        checkpoint (Optional[int]): The checkpoint to load. Defaults to None. Most
          pretraining experiments aren't checkpointed, so this is usually None.
        lm_id (int): The ID of the LM to load. Defaults to 0.

    Returns:
        ObjectModel: The loaded object model.

    Example:
        >>> model = load_object_model("dist_agent_1lm", "mug")
        >>> model -= [0, 1.5, 0]
        >>> rotation = R.from_euler("xyz", [0, 90, 0], degrees=True)
        >>> rotated = model.rotated(rotation)
        >>> print(model.rgba.shape)
        (1354, 4)
    """
    if checkpoint is None:
        model_path = DMC_PRETRAIN_DIR / model_name / "pretrained/model.pt"
    else:
        model_path = (
            DMC_PRETRAIN_DIR
            / model_name
            / f"pretrained/checkpoints/{checkpoint}/model.pt"
        )
    data = torch.load(model_path)
    data = data["lm_dict"][lm_id]["graph_memory"][object_name]["patch"]
    points = np.array(data.pos, dtype=float)
    if features:
        features = [features] if isinstance(features, str) else features
        feature_dict = {}
        for feature in features:
            if feature not in data.feature_mapping:
                print(f"WARNING: Feature {feature} not found in data.feature_mapping")
                continue
            idx = data.feature_mapping[feature]
            feature_data = np.array(data.x[:, idx[0] : idx[1]])
            if feature == "rgba":
                feature_data = feature_data / 255.0
            feature_dict[feature] = feature_data

    return ObjectModel(points, features=feature_dict)

class ObjectModel:
    """Mutable wrapper for object models.

    Args:
        pos (ArrayLike): The points of the object model as a sequence of points
          (i.e., has shape (n_points, 3)).
        features (Optional[Mapping]): The features of the object model. For
          convenience, the features become attributes of the ObjectModel instance.
    """

    def __init__(
        self,
        pos: ArrayLike,
        features: Optional[Mapping[str, ArrayLike]] = None,
    ):
        self.pos = np.asarray(pos, dtype=float)
        if features:
            for key, value in features.items():
                setattr(self, key, np.asarray(value))

    @property
    def x(self) -> np.ndarray:
        return self.pos[:, 0]

    @property
    def y(self) -> np.ndarray:
        return self.pos[:, 1]

    @property
    def z(self) -> np.ndarray:
        return self.pos[:, 2]

    def copy(self, deep: bool = True) -> "ObjectModel":
        return deepcopy(self) if deep else self

    def rotated(
        self,
        rotation: Union[R, ArrayLike],
        degrees: bool = False,
    ) -> "ObjectModel":
        """Rotate the object model.

        Args:
            rotation: Rotation to apply. May be one of
              - A `scipy.spatial.transform.Rotation` object.
              - A 3x3 rotation matrix.
              - A 3-element array of x, y, z euler angles.
            degrees (bool): Whether Euler angles are in degrees. Ignored
                if `rotation` is not a 1D array.

        Returns:
            ObjectModel: The rotated object model.
        """
        if isinstance(rotation, R):
            rot = rotation
        else:
            arr = np.asarray(rotation)
            if arr.shape == (3,):
                rot = R.from_euler("xyz", arr, degrees=degrees)
            elif arr.shape == (3, 3):
                rot = R.from_matrix(arr)
            else:
                raise ValueError(f"Invalid rotation argument: {rotation}")

        pos = rot.apply(self.pos)
        out = self.copy()
        out.pos = pos

        return out

    def __add__(self, translation: ArrayLike) -> "ObjectModel":
        translation = np.asarray(translation)
        out = deepcopy(self)
        out.pos += translation
        return out

    def __sub__(self, translation: ArrayLike) -> "ObjectModel":
        translation = np.asarray(translation)
        return self + (-translation)

Looking forward to your reply and good luck!

skj9865 · May 13, 2025, 7:35am

Thank you for taking the time to look over my notes—and for the 2 × 2 closed-form eigen-decomposition tip. I’ll swap that into get_hessian_eigens() and profile the impact.

Below are answers (and a few follow-up questions) to your points:

1 · Classification vs Pose

At the moment I’m letting Monty run in its default classification + pose-estimation mode; I haven’t found a switch that disables pose.
By the way, how the number of hypotheses be 10 if I’m only doing classification? Are you assuming the entire image case?
You mentioned that, with classification only, the hypothesis pool could start at 10 (one per digit). Does this assume the entire image case(patch size equals to the entire image)?

2 · Patch size / view

I set the patch to 10 × 10 px. The sensor therefore “saccades” over the 28 × 28 image following the default motor policy.

3 · Learned digit models

Per your advice I visualised the pretrained graphs and many of them are indeed “nonsense” pointclouds. I should fix up this. Any suggestion will be helpful.

Thanks again—really appreciate the guidance!

vclay · May 13, 2025, 9:11am

Hi @skj9865

that’s exciting to see! And thanks @hlee for those detailed hints and ideas. A couple of thoughts and hints from my side:

Regarding Pose Estimation

Since the MNIST dataset (similar to anything that is about recognizing letters and numbers) does not require pose recognition I would agree with your intuition that we should not have Monty try and infer the digits in a pose invariant way. Normally, when we do 3D object recognition, we want the system to recognize the object independent of its pose. But as you point out with numbers (like 6 and 9) and letters (like b and d or p and q) the orientation actually matters for the classification. (I could go off on a tanget here on how humans seem to have an extra mechanism to suppress the mirror symmetry recognition when reading text but won’t unless you ask about it )

In terms of Monty, the easiest way to achieve this would be to change initial_possible_poses="informed" to initial_possible_poses=[[0, 0, 0]]. From your config it looks like you are already doing that (like in the tutorial code). If you want a bit more granularity (being able to recognize the numbers when they are a bit tilted, you can simply add some more orientations into that list that are slight tilts from the default 0,0,0 orientation.

Regarding the Pretrained Models

Some thoughts on why they may look like nonsense:

make sure you have your training set up right and provide the right labels for the right digits (don’t merge multiple digits into one model)
Maybe try just showing one example of each digit. Since it looks like you are doing supervised training, Monty does not do inference on the location of the digit. Our MontySupervisedObjectPretrainingExperiment code assumes that the objects are always shown at the same location but that may not always be the case in the MNIST dataset. You would either have to provide the location of each digit manually as a supervised signal so they align correctly or not supervise the training and have Monty infer it’s location on the digits (like outlined in this tutorial: Unsupervised Continual Learning)
Related to the previous point I had a look at your training config and am not sure you are actually showing 30 versions. It looks like the training_dataloader_args set num_versions=30 but in the experiment_args you specify to only run one training epoch. Maybe just something to double check.

MISC comments

I agree with Hojae that it would make sense to inspect the .csv logs to see how many steps you usually take until object recognition. That can help to determine other parameters such as the x_percent_threshold, tolerances, and max_eval_steps.
Another thing that may help is visualize what is happening in one episode. You could try running one or two inference episodes with the DetailedJSONLogger and visualize which numbers and locations on them get high evidence and how Monty moves over them. Here are some details on how to do this: Logging and Analysis
Also Hojae’s tip on setting the log level to INFO or DEBUG and reading through what happens in an episode could give you some useful insights.
You are correct that even if you don’t infer pose, you should have more than 10 hypotheses since you are also inferring where on the digit your patch currently is.

I hope this is helpful! If you still have trouble with the accuracy, maybe you could share the github repo with your code and/or your logs

Best wishes,
Viviane

hlee · May 14, 2025, 6:07pm

Hey @skj9865 - I had some more ideas for debugging:

Train on 1 digit and visualize the object learned using load_object_model(). If the visualization of the object is nonsense, then you need to dig into the graphs and points added to the graph. I had a case where I was always adding points from the first episode in YCB dataset, which is part of the object at (0, 0, 0) orientation. When I did another episode at a different orientation, it was adding the same points from (0, 0, 0) but at the orientation specified. I could tell because I saw 14 different rotations of the same thing when I visualized.
If the above is good, then see the effect of training on 10 different samples of one digit. I actually think it might be better to just train on a single image of a digit (you may need to look through the dataset and pick a “good representative” picture) than training on multiple samples of 1’s, 2’s, etc. My reason is that some of these handwritings are pretty bad, and essentially just adding noise to the object model.
For motor policy during pretraining, it might be simplest to do something like a sliding window of your 10x10 patch across the 28x28 image, similar to how convolutional kernels move. So, within 9 steps (with some overlap between patches), your model should see the entire object.

Once the pretraining is good (i.e. the object models look reasonable when loading from memory), then we can move onto better accuracy and speed. Good luck! Let us know how it goes.

skj9865 · May 19, 2025, 6:14am

Hi,

I followed your debugging suggestions and found that the training pipeline isn’t producing sensible graphs for most digits.

What I see :
Digit: ‘0’
Graph snapshot

comment : White nodes outline a clear “0”. Looks fine.

Digit: ‘1’
Graph snapshot

comment : Shape is recognisable but the stroke is tilted.

Digit : ‘2’ to ‘9’
Graph snapshot : (not shown)
comment : Graphs are essentially noise—no digit structure.

My guess is that DepthTo3DLocations (D2L), which is designed for RGB-D input, is still running and filtering the 2-D patches in a way that discards most strokes.
Because I don’t need any true 3-D processing for MNIST, I’d like to bypass D2L entirely.

Questions

Is DepthTo3DLocations strictly required for the LM to function, or can I disable it when working with flat images?
If it must stay, is there a recommended flag / minimal setting to make it use only the semantic mask and ignore the depth-based surface filtering?

Any pointers would be much appreciated!

Repo & quick-start

I pushed my current 2-D branch here:

Steps to reproduce:

1. Place MNIST samples

mv mnist ~/tbp/data/

2. Train

python benchmarks/run.py -e mnist_training

3. Inspect a graph model

cd benchmarks/mnist/log/mnist_training
python model_check.py --object_name 0_0 # examples: 0_0, 1_5, …, 9_4

The code is still rough, but it should run end-to-end.

Thanks in advance for any advice!

vclay · May 19, 2025, 1:56pm

Hi @skj9865
I unfortunately just have time for a quick response today since we have our robot hackathon this week. I won’t be able to look more at your code for the next 2 weeks. But a couple of thoughts based on what you sent here:

It seems strange to define the black pixels (space around the strokes) as part of the digit. I would suggest writing a custom sensor module that set’s any observation with the center pixel black as on_object=False and therefor it will not be part of the object graph.
the DepthTo3DLocations is not required and it is indeed written with 3D environments in mind. However, if you take it out, you may need to rewrite some of the sensor module class to have other ways for extracting principal curvatures and point normals (or generally pose_vectors).
I don’t quite understand why 0 and 1 would look reasonable but the other digits not. Are you showing each digit once, in it’s default orientation? Or are you merging graphs over time?
I don’t see why DepthTo3DLocations would be discarding any information. The main place I can think of where we discard sensory information is the feature change SM, which I don’t think you are using in your config. Did you change anything from the way sensory input is processed in the Omniglot dataset to the MNIST dataset?

Sorry if those are more questions than answers, I’ll need a bit of time to look at the code and logs to be able to tell what is going wrong.

skj9865 · May 22, 2025, 5:37am

Hi,

I finally have Monty producing sensible graph models for all ten MNIST digits and ran a first round of evaluation. Below is a short summary, plus two questions where I could use your advice. (I removed DepthTo3DLocation)

Experiment set-up

Stage	Images / digit	Total
Training	10	100
Inference / eval	60 (includes the 10 train imgs)	600

Results (see attached `eval_stats.pdf`)

Metric	Count / 600
True positive ( `mostly confused_mlh` )	490
False negative (`mostly patch_off_object`)	110

eval_stats.pdf (57.9 KB)

The largest single error source is therefore patch_off_object (≈ 20% of samples).
confused_mlh cases look like close matches that failed late in the pipeline; I suspect those can be fixed with parameter tuning once the off-object issue is under control.

Questions

Best way to suppress patch_off_object?
I’m already using a semantic mask (patch > 200) and on_object_only=True, but ~20 % of patches still fall outside the stroke and terminate the episode.

Is there a recommended tweak to the exploration policy or delta thresholds to keep the sensor on-object more consistently?

Turning ‘confused_mlh’ into correct matches
Once the off-object cases are gone, I expect accuracy to climb above 90 %.

Which hyper-parameters (or thresholds) have you found most effective for tightening final matching?

Any pointers are highly appreciated—thanks!

--------Graph models for digits (only 1 sample per digit is used for training)------------

xavier · May 22, 2025, 10:22am

Hey skj, we’ve been talking in private but let me reiterate that I think you are doing some pretty cool stuff.

I am not as well versed in the project as Viviane and Hojae but I have an observation that might be of interest.

Depth in your dataset does not exist, thus it is created the following way:

depth = np.where(patch > 0, 0.5, 1.0).astype(np.float32)

It serves its purpose well since you create depth out of thin air, which is needed here.

However to my understanding it creates a bit of a problem: everywhere on the image will be the same normal vector. Actually, you might also have an overwhelming amount of principal curvature vectors that are equal (but I am less sure about that).

This makes it probably harder for the Learning Module to learn properly.

I wonder if creating more depth (i.e. center of stroke is closer than edge of stroke) would solve some of the patch_off_object problems you are having.

vclay · May 26, 2025, 8:21am

Hi @skj9865
this looks promising! Thanks for sharing those updates. A couple of follow-up questions and thoughts to diagnose the remaining problems:

For the images of the learned graphs, you say that you just used one image per digit, but in the table at the top, you say you used 10 images/digit for training. Which one do you use when you evaluate? If you use 10 per digit, can you share what those graphs look like?
Based on your eval_stats file, it looks like the first episodes are correct, and then they become confused. In your table, you mention that your evaluation set also includes the training images. Do the rows with correct classifications correspond to the digit versions you trained on? In that case, it is a generalization issue that may be solved to some extent with higher tolerance parameters.
Although from looking at your configs the tolerances and max_match_distance already look quite high so it may be an issue that requires hierarchy to solve. Looking at the parameters brings up another point though. You currently set max_match_distance to 5, which is quite a lot given that your digit coordinates seem to be in a range of ~20. This may be why your experiments don’t converge (taking the full 2000 monty_steps and having “_mlh” in the result). Try setting that a bit lower.
Regarding the patch_off_object episodes, you can try several things to get the patch onto the digit if it initially starts off the digit. You could try to move it around randomly, in a scanning way, or intelligently onto the stroke by using the entire image. We implemented those options for a 3D world so you would probably have to write your own but you can have an idea by looking at: the touch_object function which rotates the camera around it’s own axis (two directions) to locate the object and then moves closer, and the find_location_to_look_at function which calculates a movement of the camera that will make it center on an area where the object exists.
Regarding @xavier’s concern, I don’t think it is an issue that the point normals all point in the same direction. In fact, this is what we would expect since it is just a 2D image. The important part is that they always point out of the image (not randomly in the opposite direction) and that we can use the curvature directions in a principled way. From what I can see in your implementation, you extract the principal curvature directions with Sobel filters, which looks like a good solution. Maybe one thing you could do is visualize the extracted curvature directions to make sure they look as expected.
Viviane

sknudstrup · May 27, 2025, 7:29pm

Hi skj, neat project! I’ll just say that the view initialization functions (like find_location_to_look_at) will require that your object is located in 3D space in a “reasonable” way. In short, it’s going to compute how much the sensor should pitch/yaw in order to land on a certain point in 3D space using basic trigonometry, so if the z-axis values in the observed patch are strange, then this function won’t work, and you’ll likely never end up with an on-object view if you started off-object.

I don’t know how your agent and object/digit are initialized positionally, but I’d recommend you set it up so that the image is around 0.2 m from the agent. For image areas where there is a stroke, have it return a depth value around 0.2 m, and for non-stroke areas, have the depth be >= 1.0 m. I’m not sure if that’ll work for your setup without looking at the code, but from the pretrained models (which should be in world coordinates), the z-values look close to zero, so if those are the 3D coordinates observed during an episode, then the near-zero z-values could be problematic for find_location_to_look_at.

skj9865 · May 29, 2025, 1:05am

Hi,

I’ve run into a new issue during my experiments, and I think something might be fundamentally off.

Evidence not updating

As you can see in the screenshot above, no evidence updates are happening during inference.

Previously, I reported a result of 490 “true positives” out of 600 samples,
but now I realize that this count only reflected the number of possible matches, not actual evidence-based decisions.
Since evidence wasn’t being updated, all graph models remained active as candidates throughout the episode.

Now I’m trying to understand why this is happening.

Questions

1. Patch in graph_matching_loggers.py — could this be related?

When I first tried training digit graph models, I encountered this error:

UnboundLocalError: local variable 'episode_performance' referenced before assignment

To bypass this, I added the following workaround in graph_matching_loggers.py (inside BasicGraphMatchingLogger.update_overall_stats()):

if 'episode_performance' in locals():  # by skj
    pass
else:
    episode_performance = "no_match"

After adding this, I was able to create and save graph models successfully.
Could this change be preventing evidence from updating correctly?

2. Scale of graph model coordinates — does it affect evidence updates?

If you look at the digit graph models I shared earlier, you’ll notice that the x and y coordinate ranges are much larger than those in @hlee’s models.

Could this scale difference interfere with evidence update logic (e.g., matching thresholds or movement tolerances)?
If so, what coordinate range would be considered “normal” for reliable evidence updates?

Thanks in advance for your help!

Note on graph models

The digit graph visualizations I shared in the previous post were based on 1 training sample per digit.
For the actual experiment, I trained using 10 samples per digit, and those models look like this:

(digit ‘1’)
(digit ‘8’)
(digit ‘4’)

skj9865 · May 29, 2025, 7:30am

Hi,

I believe I’ve figured out why evidence updates weren’t working earlier.

Initially, I projected the MNIST image onto the x–y plane, assigning all z-values as DC (depth or constant).
But after switching the projection to the x–z plane and using the y-axis for depth, the evidence updates started working properly.
It seems Monty assumes depth values along the y-axis, and aligning with that expectation resolved the issue.

With this corrected setup, I evaluated Monty’s performance on the MNIST dataset.
I trained using 10 samples per digit, and tested on the same training samples.

However, the accuracy was only about 50%, even on the training set.
The graph model structure appears similar to what I shared in my previous post.

Do you have any suggestions for improving accuracy on MNIST using Monty?
Any guidance on parameters or architecture adjustments would be greatly appreciated.

Thanks!

vclay · May 30, 2025, 6:29am

Hi @skj9865
I’m sorry I don’t have time to dive too deep into this right now since we are having a focus week but as a quick note: I would try training on only one digit as mentioned earlier. If you are not doing continual learning where you infer the digit’s location along with it’s ID they will not line up correctly in the trained graphs. Alternatively you would have to provide their relative locations as part of the supervised data. As you can see from the learned graphs, they are currently not very informative about the digits because you are not aligning the observations.
Also, did you try the updated max_match_distance I suggested? Generally it would be useful to see the resulting .csv file to be able to tell what is happening. Are your episode timing out? Are you getting no_match? Are you detecting the wrong object?

skj9865 · May 30, 2025, 7:09am

Thank you for your advice.

As you suggested, I will train on only one digit and observe the results.

Also, I did set max_match_distance to a small value (0.3), as recommended.

According to the .csv results, I had:

43 correct matches

48 confused matches

9 no_match cases

Currently, I’m using MotorSystemConfigNaiveScanSpiral as my motor_config.

Do you think this is an appropriate choice for training and inference on the MNIST dataset?

I tried switching to other motor configurations, but encountered errors—so I’ve been using this one for now.

Also, I’m attaching the graph model for digit ‘0’, trained with 10 samples.
The second and third plots show the distribution of principal curvature directions.

vclay · June 2, 2025, 6:56am

Why does the model of ‘0’ look so thick? Don’t the strokes in the images look much thinner?
For max_match_distance I would recommend more something like 1 or 2. Since you have pixel locations here, locations are always integer values (as far as I can see from your plots) so I don’t think it makes sense to set it to smaller than 1.

skj9865 · June 2, 2025, 7:16am

In my opinion, the reason the graph model of digit ‘0’ looks so thick is because Monty builds the model using multiple versions of the digit ‘0’.

It seems that during training, Monty does not compare the newly observed input with the existing graph model. So, if multiple variations of the digit ‘0’ are used for training under the same label, Monty may accumulate them all into a single graph, rather than aligning or consolidating them.

This behavior might be due to something I overlooked when adapting Monty for 2D object recognition.

vclay · June 2, 2025, 9:10am

Like I previously mentioned, this is because you are doing supervised learning. If you use the MontySupervisedObjectPretrainingExperiment class, you are explicitly overwriting the object ID, location, and orientation with ground truth labels. However, in your setup, it seems like you are always setting the object position and orientation to 0,0,0, even though they will not always be presented in exactly the same location and orientation. You either have to provide Monty with the correct labels (by setting dataloader.primary_target correctly to include offsets of where digits are shown and their orientations) or don’t use supervised training. If you use the default learning setup where Monty does not receive labels, Monty will infer the object’s location and orientation and use that to correctly update the graphs.

You can have a look at this tutorial for more details on how the unsupervised learning setup works: Unsupervised Continual Learning
You can look at this tutorial for more details on the supervised training: Pretraining a Model

Since I assume it will be a bit more involved to provide ground truth location and orientation information in your setup I would, again, recommend only training on one example per digit. This way you will not run into the issue of incorrectly aligning the graphs because of false labels. You could also try to run unsupervised learning after doing supervised learning on one example. This way you would tell the model in the first epoch that there are 10 distinct objects and give it the correct labels for them. Then on the following epochs you would not provide any labels to Monty but Monty can still update it’s graphs based on what it infers. Monty will ideally infer to object ID and pose and then align the new observations with the existing graphs using it’s internal pose hypothesis.
To do this you could use your existing pretraining config, run it for 1 epoch, then use your existing mnist_inference config but in addition to do_eval=True also set do_train=True and specify the number of training epochs you want to run. Since your mnist_inference experiment uses MontyObjectRecognitionExperiment, no labels will be provided and so there will be no incorrect alignment (unless Monty incorrectly infers to pose). Since you are limiting Monty to recognize the object at orientation 0,0,0 you may still run into some offsets there but at least for the location it should be better.

Viviane

hlee · June 2, 2025, 9:49pm

Hey @skj9865, great progress! I feel like you’re close - thanks for sharing your updates.

To answer some questions from where I last left off before our hackathon:

1. Patch in graph_matching_loggers.py — could this be related?

Likely no. It looks like a scope issue (i.e. variable not defined in scope), and should not affect evidence from updating correctly.

2. Scale of graph model coordinates — does it affect evidence updates?

I think this could affect the performance! The physical distances in Monty is in meters (I believe due to habitat), and I could imagine this would be a problem if your image is 0.2 m from agent but the image is also 28 meters by 28 meters.

^ Update: I realize the evidence update issue has been resolved, but it may still be worth looking into the scale of graph model coordinates. It seems like there was also some discussion on max-match distance, and would also be relevant.

But after switching the projection to the x–z plane and using the y-axis for depth , the evidence updates started working properly.

Great catch! Sorry again, Monty assumes y-axis to be what we typically label as z-axis (in mathematics)…again because of habitat.

I trained using 10 samples per digit , and tested on the same training samples .

I think Viviane said this already, but it should be better just training on 1 sample per digit (where the chosen sample is as close to “neat handwriting” / average of the digit.

It seems that during training, Monty does not compare the newly observed input with the existing graph model . So, if multiple variations of the digit ‘0’ are used for training under the same label, Monty may accumulate them all into a single graph , rather than aligning or consolidating them.

(This is also my reasoning for just training on 1 sample per digit)

Best of luck!

skj9865 · June 9, 2025, 7:28am

Hi,

I attempted to run unsupervised learning on the MNIST dataset.

As you suggested earlier, I generated a pretrained model using 1 sample per digit, and then ran mnist_inference with do_train=True to continue updating the digit models.

However, I encountered the following error:
AttributeError: 'GridObjectModel' object has no attribute '_location_scale_factor'

Due to this error, I was unable to proceed further.

Separately, I also tried running unsupervised learning from scratch (without any pretrained model), using the surf_agent_2obj_unsupervised setup. In this case, the above error did not occur.

Below is my configuration for the unsupervised learning experiment:

mnist_inference = dict(
    experiment_class=MontyObjectRecognitionExperiment,
    experiment_args=ExperimentArgs(
        #model_name_or_path=pretrain_dir + "/mnist_training/",
        #model_name_or_path = "mnist/log/mnist_training/pretrained",
        do_train=True,
        do_eval=False,
        n_train_epochs=3,
        n_eval_epochs=1,
        max_total_steps=1000,
    ),
    #logging_config=LoggingConfig(),
    logging_config=CSVLoggingConfig(
            output_dir="mnist/log",
            monty_log_level="BASIC",
            monty_handlers=[BasicCSVStatsHandler],                 
        ),

    monty_config=PatchAndViewMontyConfig(
        motor_system_config=MotorSystemConfigNaiveScanSpiral(),
        monty_class=MontyForEvidenceGraphMatching,
        learning_module_configs=dict(
            learning_module_0=dict( 
                learning_module_class=EvidenceGraphLM,
                learning_module_args=dict(               
                    max_match_distance=1,
                    tolerances={
                        "patch": {
                            "principal_curvatures_log": np.ones(2),
                            "pose_vectors": np.ones(3) * 45,
                        }
                    },
                    # Point normal always points up, so they are not useful
                    feature_weights={
                        "patch": {
                            "pose_vectors": [0, 1, 0],
                        }
                    },
                    # We assume the letter is presented upright
                    #initial_possible_poses=[[0, 0, 0]],
                ),
            )
        ),
        sensor_module_configs=mnist_sensor_module_config,
    ),
    dataset_class=ED.EnvironmentDataset,
    dataset_args=MnistDatasetArgs(),
    train_dataloader_class=ED.MnistDataLoader,
    train_dataloader_args = get_mnist_train_dataloader(start_at_version = 0, number_ids = np.arange(0,2), num_versions=1),
    eval_dataloader_class=ED.MnistDataLoader,
    eval_dataloader_args = get_mnist_eval_dataloader(start_at_version = 0, number_ids = np.arange(0,2), num_versions=10)
)

I expected that during training, Monty would build graph models for new digits as they appear, and then use those models for recognition.

However, during training I encountered the following log:

---Updating memory of learning_module_0---
INFO:root:new_object0 not in memory ()
INFO:root:Adding a new graph to memory.
INFO:root:init object model with id new_object0
INFO:root:building graph from 170 observations
INFO:root:Too many observations outside of grid (0.59%). Skipping update of grids.
INFO:root:Grid too small for given locations. Not building a model for new_object0
INFO:root:

I’m not sure what this message means. Could it be due to the fact that I’m using MotorSystemConfigNaiveScanSpiral()?

If you have any insights or suggestions, I’d really appreciate your help!

Topic		Replies	Views
Some Questions from the Documentation General	21	383	January 10, 2025
Evaluating Monty's performance on the Omniglot dataset Monty Code	10	278	May 8, 2025
Monty and Graphs Monty Code	14	252	April 20, 2025
New Tutorials on Using Monty in Custom Applications Monty Code	16	482	July 21, 2025
2023/01 - A Comprehensive Overview of Monty and the Evidence-Based Learning Module Video Discussions core-video	8	293	January 5, 2025

Monty on the MNIST dataset

Current results

Questions & points for discussion

1 · Classification vs Pose

2 · Patch size / view

3 · Learned digit models

Regarding Pose Estimation

Regarding the Pretrained Models

MISC comments

1. Place MNIST samples

2. Train

3. Inspect a graph model

Experiment set-up

Results (see attached eval_stats.pdf)

Questions

Evidence not updating

Questions

Note on graph models

Related topics

Results (see attached `eval_stats.pdf`)