Background image of a computer

When a Robot Understands—but Can’t Execute

Written by Zackary Frazier, posted on 2026-04-25

  • ROBOTICS
  • AI

The robot usually knew what I wanted.

I’d look at a cube, and it would move to it almost every time.

Then it would fail to pick it up.

That pattern repeated enough to be the main result.


Overview

VISIONGRIP is a gaze-driven robotic manipulation system built for a graduate assistive robotics course.

I spent about four months building it from scratch without prior experience in robotic arms, UR hardware, or imitation learning. The goal wasn’t to produce a polished system—it was to run something real and identify where it fails.

The core question:

Can gaze alone drive robotic grasping?

No joystick, no UI—just: look → act.


System

Hardware

  • UR12e arm
  • Robotiq Hand-E gripper
  • Three cameras (including wrist-mounted)

Software

  • Webcam-based gaze tracking
  • LeRobot imitation learning pipeline
  • ROS2 + MoveIt2

Interaction loop

  1. User fixates (~2 seconds)
  2. System selects a target object
  3. Policy generates a motion trajectory
  4. Robot executes and attempts a grasp

In practice, this required manual resets, camera repositioning, and occasional recalibration between runs.


What worked

The system consistently moved the robot to the correct object.

Approach success was effectively ~100%, meaning:

  • gaze → target selection worked
  • policy → arm motion toward the object worked

The robot could reliably get its gripper to the vicinity of the intended cube.


Where it failed

Failures occurred at the moment of physical interaction—when the gripper tried to actually grasp the object.

Typical behavior:

  • the arm moves to the correct cube
  • the gripper aligns approximately
  • the grasp attempt fails (misses or fails to secure the object)

Results:

  • 3/13 (~23%) success in a controlled setup
  • 0/24 when the camera setup was slightly misaligned

So the issue wasn’t selecting or reaching the object—it was executing a precise, successful grasp.


Why it failed

1. Sensitivity to camera geometry (direct impact on arm accuracy)

The cameras were not fixed. Each session introduced small differences in:

  • angle
  • height
  • position

These differences directly affected how the robot moved its arm to grasp the cube.

Specifically:

  • Top-down camera misalignment → XY positioning errors

    • The robot would move its gripper to the wrong spot on the table
    • Result: the gripper closes next to the cube instead of around it
  • Side-view camera misalignment → Z (height) errors

    • The robot would stop too high or too low relative to the cube
    • Result: the gripper closes above the cube or collides awkwardly

So the issue wasn’t just perception—it translated into systematic errors in the arm’s final pose during the grasp attempt.

Because the policy depended on consistent camera geometry, even small shifts caused the arm to execute the wrong motion at the critical moment.


2. Learned a shortcut instead of interaction

A repeated motion pattern emerged:

forward → down → close → up

This persisted even late in training.

Given that the input was essentially (x, y) coordinates from a top-down view, the model likely learned:

“execute this fixed motion at this coordinate”

instead of:

“adapt the gripper motion based on the object’s actual position and contact”

This explains the failure mode:

  • The arm reaches the right area
  • But the final motion is not adjusted based on real contact or small errors
  • So small misalignments lead to failed grasps

Model behavior (≈290 episodes)

Three policies were trained:

  • Diffusion
  • ACT (Action Chunking Transformer)
  • SmolVLA

ACT

  • Smooth, stable trajectories
  • Most consistent execution
  • But still failed at the grasp stage

Diffusion

  • Less smooth
  • Occasionally recovered from failed grasps and retried
  • Suggests some capacity for adapting behavior after failure (not fully explored)

SmolVLA

  • Unstable gripper control
  • Repeated open/close actions
  • Sometimes grasped successfully, then immediately dropped the object

Each model exposed a different failure mode, but all struggled with reliable grasp execution.


Key takeaway

The system can:

  • infer user intent
  • move the arm to the correct object

But it cannot reliably:

  • execute a precise grasp under small variations

Intent is solved well enough.
Execution under real-world variability is not.


What I’d change next

Stabilize or remove perception dependencies

  • Physically fix camera positions to eliminate variation

If that’s not possible:

  • use viewpoint-invariant approaches (e.g., VISTA) so the policy doesn’t depend on exact camera placement

Move beyond coordinate inputs

Replace (x, y) with richer visual representations:

  • segmentation
  • learned visual features

Shift from:

“move to this coordinate”

to:

“grasp this object”

(e.g., using models like Segment Anything)


Focus data on the failure region (grasping)

Collect more data specifically where the system breaks:

  • near-contact states
  • failed grasp attempts
  • recovery behaviors

Right now, the model is weakest exactly where precision matters most.


Improve repeatability

Reduce reliance on:

  • manual resets
  • recalibration
  • careful setup

The system should produce consistent results across runs without manual intervention.


Closing

In a tightly controlled setup, performance would likely improve.

But assistive robotics doesn’t operate under perfect conditions:

  • cameras shift
  • setups vary
  • calibration drifts

Those variations directly affect how the arm moves—and whether a grasp succeeds.

This project didn’t solve that problem, but it made the failure modes clear, consistent, and measurable.