Understanding Reinforcement Learning in High-Degree-of-Freedom Robotic Systems

Full Version (Including: Research Statement, Experimental Overview, and Progress)

Roman Aguilera-Arevalo

Abstract

High-degree-of-freedom robotic systems pose significant challenges for control due to extreme dimensionality, nonlinear dynamics, and sensitivity to perturbations.

While reinforcement learning (RL), particularly model-free methods such as Proximal Policy Optimization (PPO), has shown empirical success in these regimes, the underlying structure of what is learned remains unclear.

This work investigates the hypothesis that PPO-based policies may primarily learn reference trajectory-generating behaviors rather than explicit goal representations in state space.

We study this using parameterized N-link manipulators, enabling systematic scaling of degrees of freedom while preserving task structure.

Hypothesis

PPO policies may function as implicit trajectory generators conditioned on goal observations, rather than encoding explicit geometric representations of goal location.

Under this view, successful control emerges from learned motion patterns aligned with reward structure, rather than symbolic or geometric goal reasoning.

Significance

This work aims to understand whether reinforcement learning produces goal-directed reasoning or emergent trajectory-based heuristics in high-dimensional control problems.

Implications include:

Valuable insights for controls engineers (allowing faster choices to better performing algorithms)
Interpretability of RL policies
Benchmark design for high-DOF robotics
Understanding scalability limits of policy gradient methods

Results

PPO successfully learns stable reaching policies for systems up to 32 DOF under constrained goal distributions. It could be possible for higer-DOF systems to learn stable reaching policies, but the simulator had functional limitations at the time.

For single-goal and 1D goal distributions, policies produce smooth and repeatable trajectories.

When extended to 2D goal distributions, behavior shifts qualitatively toward workspace-wide sweeping motion patterns.

At this stage, we interpret this as a potential failure of generalization under goal distribution shift. Further statistical analysis is required to determine whether this arises from trajectory memorization, exploration dynamics, or reward exploitation.

Background

Robotic control in high-dimensional systems is difficult due to nonlinear dynamics, redundancy, and sensitivity to small control perturbations.

Classical methods such as inverse kinematics, Jacobian-based control, and trajectory optimization scale poorly as DOF increases.

RL replaces explicit modeling with data-driven optimization, but its internal behavioral structure remains poorly understood.

Motivation

Most RL robotics work evaluates performance across environments rather than isolating how a single system behaves as dimensionality changes.

This makes it difficult to determine whether improvements reflect true generalization or structural properties of high-dimensional spaces.

This work isolates DOF as a controlled variable in a fixed environment family.

Experimental Setup

We construct a parameterized family of N-link manipulators, allowing systematic variation in DOF while preserving identical task definitions.

We evaluate performance across:

Number of links (N)
Goal distribution structure (point, line, 2D region)
Obstacle presence
Dynamic complexity

Tasks:

Single-point reaching
Reach-and-hold
Obstacle avoidance
Pendulum swing-up
Balance stabilization

Algorithms: PPO, TRPO, SAC

Control Representation

Current experiments use Δθ (joint-angle increment) control rather than torque-level control.

This is motivated by instability observed in early torque-based experiments, where nonlinear coupling caused rapid divergence before stable learning signals could form.

Δθ control is used as an intermediate abstraction to stabilize learning and isolate representational behavior before introducing full dynamics.

This allows evaluation of spatial learning behavior independently of full dynamic instability.

Current Work

Trajectory statistics in 2D goal distributions
Time-to-solution scaling vs DOF
Comparison with RRT planners (up to 18 DOF)
Robustness metrics under distribution shift
Transition to torque-level RL