When a Robot Understands—but Can’t Execute

Written by Zackary Frazier, posted on 2026-04-25

ROBOTICS
AI

The robot usually knew what I wanted.

I’d look at a cube, and it would move to it almost every time.

Then it would fail to pick it up.

That pattern repeated enough to be the main result.

Overview

VISIONGRIP is a gaze-driven robotic manipulation system built for a graduate assistive robotics course.

I spent about four months building it from scratch without prior experience in robotic arms, UR hardware, or imitation learning. The goal wasn’t to produce a polished system—it was to run something real and identify where it fails.

The core question:

Can gaze alone drive robotic grasping?

No joystick, no UI—just: look → act.

System

Hardware

UR12e arm
Robotiq Hand-E gripper
Three cameras (including wrist-mounted)

Software

Webcam-based gaze tracking
LeRobot imitation learning pipeline
ROS2 + MoveIt2

Interaction loop

User fixates (~2 seconds)
System selects a target object
Policy generates a motion trajectory
Robot executes and attempts a grasp

In practice, this required manual resets, camera repositioning, and occasional recalibration between runs.

What worked

The system consistently moved the robot to the correct object.

Approach success was effectively ~100%, meaning:

gaze → target selection worked
policy → arm motion toward the object worked

The robot could reliably get its gripper to the vicinity of the intended cube.

Where it failed

Failures occurred at the moment of physical interaction—when the gripper tried to actually grasp the object.

Typical behavior:

the arm moves to the correct cube
the gripper aligns approximately
the grasp attempt fails (misses or fails to secure the object)

Results:

3/13 (~23%) success in a controlled setup
0/24 when the camera setup was slightly misaligned

So the issue wasn’t selecting or reaching the object—it was executing a precise, successful grasp.

Why it failed

1. Sensitivity to camera geometry (direct impact on arm accuracy)

The cameras were not fixed. Each session introduced small differences in:

angle
height
position

These differences directly affected how the robot moved its arm to grasp the cube.

Specifically:

Top-down camera misalignment → XY positioning errors
- The robot would move its gripper to the wrong spot on the table
- Result: the gripper closes next to the cube instead of around it
Side-view camera misalignment → Z (height) errors
- The robot would stop too high or too low relative to the cube
- Result: the gripper closes above the cube or collides awkwardly

So the issue wasn’t just perception—it translated into systematic errors in the arm’s final pose during the grasp attempt.

Because the policy depended on consistent camera geometry, even small shifts caused the arm to execute the wrong motion at the critical moment.

2. Learned a shortcut instead of interaction

A repeated motion pattern emerged:

forward → down → close → up

This persisted even late in training.

Given that the input was essentially (x, y) coordinates from a top-down view, the model likely learned:

“execute this fixed motion at this coordinate”

instead of:

“adapt the gripper motion based on the object’s actual position and contact”

This explains the failure mode:

The arm reaches the right area
But the final motion is not adjusted based on real contact or small errors
So small misalignments lead to failed grasps

Model behavior (≈290 episodes)

Three policies were trained:

Diffusion
ACT (Action Chunking Transformer)
SmolVLA

ACT

Smooth, stable trajectories
Most consistent execution
But still failed at the grasp stage

Diffusion

Less smooth
Occasionally recovered from failed grasps and retried
Suggests some capacity for adapting behavior after failure (not fully explored)

SmolVLA

Unstable gripper control
Repeated open/close actions
Sometimes grasped successfully, then immediately dropped the object

Each model exposed a different failure mode, but all struggled with reliable grasp execution.

Key takeaway

The system can:

infer user intent
move the arm to the correct object

But it cannot reliably:

execute a precise grasp under small variations

Intent is solved well enough.
Execution under real-world variability is not.

What I’d change next

Stabilize or remove perception dependencies

Physically fix camera positions to eliminate variation

If that’s not possible:

use viewpoint-invariant approaches (e.g., VISTA) so the policy doesn’t depend on exact camera placement

Move beyond coordinate inputs

Replace (x, y) with richer visual representations:

segmentation
learned visual features

Shift from:

“move to this coordinate”

to:

“grasp this object”

(e.g., using models like Segment Anything)

Focus data on the failure region (grasping)

Collect more data specifically where the system breaks:

near-contact states
failed grasp attempts
recovery behaviors

Right now, the model is weakest exactly where precision matters most.

Improve repeatability

Reduce reliance on:

manual resets
recalibration
careful setup

The system should produce consistent results across runs without manual intervention.

Closing

In a tightly controlled setup, performance would likely improve.

But assistive robotics doesn’t operate under perfect conditions:

cameras shift
setups vary
calibration drifts

Those variations directly affect how the arm moves—and whether a grasp succeeds.

This project didn’t solve that problem, but it made the failure modes clear, consistent, and measurable.

HAKT