When a Robot Understands—but Can’t Execute
Written by Zackary Frazier, posted on 2026-04-25
- ROBOTICS
- AI
The robot usually knew what I wanted.
I’d look at a cube, and it would move to it almost every time.
Then it would fail to pick it up.
That pattern repeated enough to be the main result.
Overview
VISIONGRIP is a gaze-driven robotic manipulation system built for a graduate assistive robotics course.
I spent about four months building it from scratch without prior experience in robotic arms, UR hardware, or imitation learning. The goal wasn’t to produce a polished system—it was to run something real and identify where it fails.
The core question:
Can gaze alone drive robotic grasping?
No joystick, no UI—just: look → act.
System
Hardware
- UR12e arm
- Robotiq Hand-E gripper
- Three cameras (including wrist-mounted)
Software
- Webcam-based gaze tracking
- LeRobot imitation learning pipeline
- ROS2 + MoveIt2
Interaction loop
- User fixates (~2 seconds)
- System selects a target object
- Policy generates a motion trajectory
- Robot executes and attempts a grasp
In practice, this required manual resets, camera repositioning, and occasional recalibration between runs.
What worked
The system consistently moved the robot to the correct object.
Approach success was effectively ~100%, meaning:
- gaze → target selection worked
- policy → arm motion toward the object worked
The robot could reliably get its gripper to the vicinity of the intended cube.
Where it failed
Failures occurred at the moment of physical interaction—when the gripper tried to actually grasp the object.
Typical behavior:
- the arm moves to the correct cube
- the gripper aligns approximately
- the grasp attempt fails (misses or fails to secure the object)
Results:
- 3/13 (~23%) success in a controlled setup
- 0/24 when the camera setup was slightly misaligned
So the issue wasn’t selecting or reaching the object—it was executing a precise, successful grasp.
Why it failed
1. Sensitivity to camera geometry (direct impact on arm accuracy)
The cameras were not fixed. Each session introduced small differences in:
- angle
- height
- position
These differences directly affected how the robot moved its arm to grasp the cube.
Specifically:
Top-down camera misalignment → XY positioning errors
- The robot would move its gripper to the wrong spot on the table
- Result: the gripper closes next to the cube instead of around it
Side-view camera misalignment → Z (height) errors
- The robot would stop too high or too low relative to the cube
- Result: the gripper closes above the cube or collides awkwardly
So the issue wasn’t just perception—it translated into systematic errors in the arm’s final pose during the grasp attempt.
Because the policy depended on consistent camera geometry, even small shifts caused the arm to execute the wrong motion at the critical moment.
2. Learned a shortcut instead of interaction
A repeated motion pattern emerged:
forward → down → close → up
This persisted even late in training.
Given that the input was essentially (x, y) coordinates from a top-down view, the model likely learned:
“execute this fixed motion at this coordinate”
instead of:
“adapt the gripper motion based on the object’s actual position and contact”
This explains the failure mode:
- The arm reaches the right area
- But the final motion is not adjusted based on real contact or small errors
- So small misalignments lead to failed grasps
Model behavior (≈290 episodes)
Three policies were trained:
- Diffusion
- ACT (Action Chunking Transformer)
- SmolVLA
ACT
- Smooth, stable trajectories
- Most consistent execution
- But still failed at the grasp stage
Diffusion
- Less smooth
- Occasionally recovered from failed grasps and retried
- Suggests some capacity for adapting behavior after failure (not fully explored)
SmolVLA
- Unstable gripper control
- Repeated open/close actions
- Sometimes grasped successfully, then immediately dropped the object
Each model exposed a different failure mode, but all struggled with reliable grasp execution.
Key takeaway
The system can:
- infer user intent
- move the arm to the correct object
But it cannot reliably:
- execute a precise grasp under small variations
Intent is solved well enough.
Execution under real-world variability is not.
What I’d change next
Stabilize or remove perception dependencies
- Physically fix camera positions to eliminate variation
If that’s not possible:
- use viewpoint-invariant approaches (e.g., VISTA) so the policy doesn’t depend on exact camera placement
Move beyond coordinate inputs
Replace (x, y) with richer visual representations:
- segmentation
- learned visual features
Shift from:
“move to this coordinate”
to:
“grasp this object”
(e.g., using models like Segment Anything)
Focus data on the failure region (grasping)
Collect more data specifically where the system breaks:
- near-contact states
- failed grasp attempts
- recovery behaviors
Right now, the model is weakest exactly where precision matters most.
Improve repeatability
Reduce reliance on:
- manual resets
- recalibration
- careful setup
The system should produce consistent results across runs without manual intervention.
Closing
In a tightly controlled setup, performance would likely improve.
But assistive robotics doesn’t operate under perfect conditions:
- cameras shift
- setups vary
- calibration drifts
Those variations directly affect how the arm moves—and whether a grasp succeeds.
This project didn’t solve that problem, but it made the failure modes clear, consistent, and measurable.
