Reinforcement Learning for Flow-Matching Policies with Density Transport

1University of Pennsylvania
RLDT Pipeline

RLDT Pipeline. RLDT constructs a transport field φ(a) on the action manifold and uses the expected posterior a† to compute targets for intermediate denoising steps of πθ.

Abstract

We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models.

Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach — Reinforcement Learning with Density Transport (RLDT) — constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD), then fine-tunes a pretrained flow matching policy to align with this field.

Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time.

Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation.

Method

RLDT recasts RL policy improvement as a probability transport problem on the action manifold. A flow-matching policy first maps Gaussian noise to an action distribution; RLDT then steers that distribution toward high-reward regions without collapsing its multimodal structure.

A flow-matching policy maps Gaussian noise to a bimodal action distribution; the RL challenge is that the resulting samples spread over action space and miss the reward peaks.

Step 1 — SVGD Transport Field

RLDT constructs a transport field $\phi^*(a)$ via SVGD that combines two forces: an attractive term pulling samples toward high-reward regions of the Q-function, and a repulsive kernel term that prevents particle collapse and preserves diversity. The animation below shows how these two components interact to move action samples onto the reward peaks while keeping them spread across both modes.

The SVGD field decomposes into Q-gradient attraction (orange) and kernel repulsion (blue); their combination $\phi^*$ (yellow) moves particles to reward peaks without mode collapse.

Step 2 — Expected Posterior on the Action Manifold

Because flow-matching generates actions over many ODE steps, backpropagating gradients through the full denoising chain is unstable. Instead, RLDT maps each intermediate denoising sample $a_{t,\tau}$ to its expected posterior $a^\dagger_{t,\tau}$ — a predicted final action that lies on the action manifold. Applying the transport field to $a^\dagger$ induces a proportional update in the velocity network $v_\theta$, propagating reward signal across all ODE steps without a reverse ODE.

The expected target $a^\dagger_{t,\tau}$ (pink diamond) projects any intermediate denoising state onto the action manifold; the transport field applied there back-propagates stably into the network without unrolling the ODE.

Step 3 — Training Objective

The policy is updated with three losses: an RL loss that aligns the velocity field with the transport direction $\phi^*(a^\dagger)$; a consistency loss that keeps flow trajectories straight by penalizing the gap between $a^\dagger_{t,\tau}$ and the final action sample; and a constraint loss (Fisher divergence from the frozen pretrained policy) that prevents the policy from drifting too far during fine-tuning. The animation below summarizes the full pipeline.

Full RLDT pipeline: sample noise → flow matching → expected target $a^\dagger$ → SVGD transport $\phi^*$ → update $\theta$.

Contributions

  • We formulate a novel RL algorithm for fine-tuning flow-matching policies by modeling policy improvement as a transport of action probabilities towards regions of high reward. Our formulation bypasses density estimation and backpropagation through time, leading to better performance than baselines and stable training.
  • We tackle the problem of backpropagation through the denoising process of flow-matching policies by estimating the expected policy action from intermediate denoising steps. Since the transport field is defined on the action manifold, this prediction enables gradient back-propagation through the flow-matching backbone over the entire denoising process.

Experiments

We evaluate RLDT across three benchmark settings of increasing complexity: dense-reward locomotion (OpenAI Gym — Walker2d, Hopper, HalfCheetah), long-horizon sparse-reward manipulation (Furniture-Bench — One-Leg and Lamp with Low/Medium randomness), and vision-based robot manipulation (Robomimic — Square and Transport). We compare against DPPO, ReinFlow, FPO++, and QAM.

Dense Reward Tasks (OpenAI Gym)

RLDT achieves the highest rewards across all Gym baselines by a significant margin. DPPO results match those reported in the original paper, while ReinFlow underperforms on HalfCheetah and Hopper. QAM performance varies across tasks but always underperforms RLDT.

Dense Reward Results on OpenAI Gym

Dense reward tasks on OpenAI Gym. Mean and standard deviation over 3 seeds.


Sparse Reward Tasks (Robomimic & Furniture-Bench)

On Robomimic (Square), RLDT starts with a higher initial success rate and reaches ~90%. On Transport, RLDT converges to performance comparable to DPPO while ReinFlow significantly underperforms. On the challenging Furniture-Bench Lamp-Med task, RLDT reaches ~70% success rate versus DPPO's ~30%.

Sparse Reward Results on Robomimic and Furniture-Bench

Sparse reward tasks. Left: state-based Robomimic benchmark. Right: Furniture-Bench manipulation tasks. Averaged over 3 seeds.


Related Work

Prior approaches to RL for diffusion and flow-matching policies include:

DPPO, ReinFlow and FPO optimize lightweight lower bounds on the policy log-likelihood, which leads to biased policy gradients.

QAM solves the reverse ODE to obtain per-step adjoint targets, incurring sensitivity to ODE solver accuracy.

RLDT differs by constructing a transport field directly on the action manifold, enabling principled gradient propagation without density estimation or architectural constraints.

BibTeX