✍️ RLHF for Robots

In late 2022, a simple architectural decision transformed large language models from research demos into the most widely adopted AI technology in history. The decision wasn't about model size, training data, or compute. It was about adding a feedback loop.

Reinforcement learning from human feedback — RLHF — took a pre-trained language model that could generate plausible text and gave it a mechanism to learn which outputs humans actually wanted. A human reads two responses, picks the better one, and that preference signal trains the model to produce better outputs over time.

We've spent the last several months building the equivalent system for robots.

What RLHF Looks Like When the Model Has a Body

The core idea translates directly. You have a powerful pre-trained model. It generates outputs. A human evaluates those outputs. The evaluation becomes a training signal that steers the model toward better performance.

For language models, the output is text and the evaluation is a preference comparison. For robots, the output is physical motion and the evaluation is simpler but harder-earned: the human watches the robot attempt a task and reports whether it succeeded or failed.

Simpler because it's a binary signal rather than a nuanced comparison. Harder-earned because each evaluation requires a real robot to execute a real task in the real world, taking minutes rather than milliseconds, with physical consequences for failure.

The system we built runs this loop on bimanual robot arms performing household manipulation tasks. A vision-language-action foundation model — pre-trained on large-scale multi-robot demonstration data — generates motor commands from camera images and language instructions. The robot executes those commands on physical hardware. A human operator observes the outcome and provides a success or failure label. That label is stored as a reward signal alongside the complete episode trajectory, building a dataset structured for reinforcement learning.

The Three Phases

RLHF, whether for language or robotics, has three distinct phases. Being precise about which phase you're in matters, because the challenges are completely different.

Phase 1: Build the infrastructure

For language models, this meant building the data pipeline for human preference collection, the reward model training loop, and the PPO training integration. For robots, it means building the real-time inference pipeline, the hardware safety systems, the observation formatting, the action translation, and the episode management. This phase is pure engineering and it's where most attempts fail — not because the algorithms are wrong, but because the integration breaks in ways that are invisible until you run on real hardware.

Phase 2: Collect feedback data

Run the model, evaluate the outputs, accumulate the training signal. For language models, this was the phase of hiring annotators and building labeling interfaces. For robots, this is the phase of running episodes on physical hardware and labeling outcomes. The model's behavior doesn't change during this phase — you're building the dataset that will drive learning.

Phase 3: Close the loop

Use the collected feedback to actually modify the model's behavior. For language models, this is when PPO training runs against the reward model. For robots, this is when the RL agent begins steering the policy based on accumulated reward signals.

We're currently in Phase 2. We've completed Phase 1 — which, for robots, turned out to be the hardest part — and are running data collection episodes on physical hardware. Phase 3 is implemented but not yet connected to the live system.

We're being explicit about this because overstating progress is common in this space, and because each phase has distinct value. Phase 1 is the engineering foundation that everything else depends on. Phase 2 produces data that has value independent of Phase 3. Phase 3 is where behavior changes, but it's meaningless without the first two.

Why Phase 1 Is the Hard Part for Robots

For language models, Phase 1 was relatively contained. The model runs on GPUs, inputs and outputs are text, and the infrastructure is software. For robots, Phase 1 requires solving a cascade of physical-world integration problems that don't exist in the language domain.

Real-time control at 30Hz

The foundation model must generate action predictions fast enough to maintain a smooth control loop. Each inference produces a chunk of 50 future actions, and the system must query the model, receive the response, post-process the actions, and send commands to the hardware — all within 33 milliseconds per control step.

Observation formatting

The model expects images in a specific resolution, color space, and tensor layout. Our cameras output a different format. Proprioceptive state needs normalization. Gripper values need unit conversion. Getting any of these wrong doesn't produce an error — it produces actions that look like model failures. We spent more time debugging observation formatting than any algorithmic component.

Action translation and safety

The model outputs actions in a normalized space. The hardware operates in radians and meters. Between the model's output and the motor command, actions must be denormalized, velocity-limited, and position-clamped. The velocity limiting is particularly critical at action chunk boundaries, where the model generates a fresh sequence that may not smoothly continue from the previous one. We routinely see requested velocity changes of ten times the safe limit being clamped down at these transitions.

Episode management

The training loop must handle episode boundaries, human input prompts, trajectory recording, and hardware resets — all while maintaining the safety guarantees.

None of these problems are algorithmically interesting. All of them are load-bearing.

What We Observe on Real Hardware

Running this system on physical hardware has surfaced observations that wouldn't be visible in simulation.

Chunk-boundary dynamics

The foundation model generates action chunks of 50 timesteps, with new inference every 10 steps. The transitions between chunks produce significant discontinuities. Our velocity limiter fires on nearly every transition, sometimes clamping multiple joints simultaneously. This is a fundamental characteristic of current flow-matching VLA architectures, not a bug in our system.

Gripper sensitivity

The gripper joints (operating in millimeters rather than radians) are the most frequently velocity-limited. Small changes in the model's normalized output correspond to large relative movements in the gripper's physical range. This suggests that gripper control may benefit from separate tuning or a different normalization scheme than the arm joints.

Policy consistency

Across episodes on the same task with the same initial conditions, the model produces meaningfully different trajectories. This variance comes from the stochastic noise initialization in the flow-matching denoising process — and it's precisely this noise that the RL agent will learn to control in Phase 3.

That last observation is the connection point between data collection and learning. The RL approach we're implementing doesn't modify the foundation model's weights. Instead, it learns to select the noise that initializes the model's action generation. Better noise selection produces better actions without touching the pre-trained capabilities. This is analogous to how RLHF for language models often uses a thin steering layer rather than full model retraining.

The Comparison to Language Model RLHF

What's the same

A pre-trained foundation model generates outputs. Those outputs are evaluated by humans. The evaluations become training signal for an RL algorithm. The RL algorithm steers the model toward higher-reward behavior through a lightweight adaptation mechanism rather than full retraining.

What's different

The feedback is binary (success/failure) rather than comparative. Each evaluation requires minutes of physical execution. Safety constraints are physical — a bad action can damage hardware. Sample efficiency requirements are orders of magnitude more demanding.

What's harder for robots

The infrastructure. Getting text in and text out of a language model is a solved problem. Getting camera images, proprioceptive state, and motor commands in and out of a physical robot at 30Hz with safety guarantees is not.

What's easier for robots

The reward signal. Determining whether a towel is folded is more straightforward than determining whether a paragraph is helpful, accurate, and harmless. Binary task success is a cleaner signal than human preference over text.

What's Next

The immediate priority is connecting the reinforcement learning agent to the live system. This requires two specific engineering tasks: loading the RL agent (which learns in the model's noise space) into the execution pipeline, and modifying the inference server to accept externally-provided noise rather than sampling it randomly. Both components are implemented — they need to be wired together and tested.

Once the loop is closed, we expect to run on the order of 50 to 100 evaluated episodes before we can measure whether the feedback signal is producing behavioral improvement. Published results from the RL approach we're using have demonstrated meaningful improvement within this range on comparable tasks.

The Bigger Picture

Eighteen months ago, the components of this system existed in separate research labs as individual contributions. The foundation models were at Physical Intelligence. The reinforcement learning methods were at UC Berkeley. The hardware platform was at Trossen Robotics. The data tooling was at Hugging Face.

Today a small team can assemble these into a working system on commercially available hardware. Not because any single component is new, but because the ecosystem has matured to the point where integration is possible.

We believe RLHF for robots — the full loop of pre-training, fine-tuning, and real-world reinforcement learning from human feedback — will become the standard approach for deploying capable manipulation systems in unstructured environments. The same way it became the standard approach for deploying capable language systems.

The research is open. The hardware is available. The hard part is the integration. That's what we're building.

The foundation model, reinforcement learning method, and robot hardware referenced in this post are open-source research contributions from their respective teams. Our work builds on these contributions and focuses on the integration required to operate them as a unified system on physical hardware.

About Sidekick Robotics

Sidekick Robotics is building robots that learn from practice. We're developing the AI Sidekick for Physical Work — robots that learn and perform the complex tasks that keep hospitals and care communities running.

Media Contact: founders@sidekickrobotics.ai