MyArxiv
Robotics 33
☆ LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model
Vision-Language-Action (VLA) models have recently demonstrated strong generalization capabilities in robotic manipulation. Some existing VLA approaches attempt to improve action accuracy by explicitly generating linguistic reasoning traces or future visual observations before action execution. However, explicit reasoning typically incurs non-negligible inference latency, which constrains the temporal resolution required for robotic manipulation. Moreover, such reasoning is confined to the linguistic space, imposing a representational bottleneck that struggles to faithfully capture ineffable physical attributes. To mitigate these limitations, we propose LaST$_0$, a framework that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize. Specifically, we introduce a token-efficient latent CoT space that models future visual dynamics, 3D structural information, and robot proprioceptive states, and further extends these representations across time to enable temporally consistent implicit reasoning trajectories. Furthermore, LaST$_0$ adopts a dual-system architecture implemented via a Mixture-of-Transformers design, where a reasoning expert conducts low-frequency latent inference and an acting expert generates high-frequency actions conditioned on robotics-oriented latent representations. To facilitate coordination, LaST$_0$ is trained with heterogeneous operation frequencies, enabling adaptive switching between reasoning and action inference rates during deployment. Across ten simulated and six real-world manipulation tasks, LaST$_0$ improves mean success rates by 8% and 13% over prior VLA methods, respectively, while achieving substantially faster inference. Project website: https://sites.google.com/view/last0
☆ Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration
Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.
comment: Project Page: https://cordex-manipulation.github.io/
☆ RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
☆ Driving on Registers
We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.
☆ Compensation Effect Amplification Control (CEAC): A movement-based approach for coordinated position and velocity control of the elbow of upper-limb prostheses
Despite advances in upper-limb (UL) prosthetic design, achieving intuitive control of intermediate joints - such as the wrist and elbow - remains challenging, particularly for continuous and velocity-modulated movements. We introduce a novel movement-based control paradigm entitled Compensation Effect Amplification Control (CEAC) that leverages users' trunk flexion and extension as input for controlling prosthetic elbow velocity. Considering that the trunk can be both a functional and compensatory joint when performing upper-limb actions, CEAC amplifies the natural coupling between trunk and prosthesis while introducing a controlled delay that allows users to modulate both the position and velocity of the prosthetic joint. We evaluated CEAC in a generic drawing task performed by twelve able-bodied participants using a supernumerary prosthesis with an active elbow. Additionally a multiple-target-reaching task was performed by a subset of ten participants. Results demonstrate task performances comparable to those obtained with natural arm movements, even when gesture velocity or drawing size were varied, while maintaining ergonomic trunk postures. Analysis revealed that CEAC effectively restores joint coordinated action, distributes movement effort between trunk and elbow, enabling intuitive trajectory control without requiring extreme compensatory movements. Overall, CEAC offers a promising control strategy for intermediate joints of UL prostheses, particularly in tasks requiring continuous and precise coordination.
☆ DAVOS: An Autonomous Vehicle Operating System in the Vehicle Computing Era
Vehicle computing represents a fundamental shift in how autonomous vehicles are designed and deployed, transforming them from isolated transportation systems into mobile computing platforms that support both safety-critical, real-time driving and data-centric services. In this setting, vehicles simultaneously support real-time driving pipelines and a growing set of data-driven applications, placing increased responsibility on the vehicle operating system to coordinate computation, data movement, storage, and access. These demands highlight recurring system considerations related to predictable execution, data and execution protection, efficient handling of high-rate sensor data, and long-term system evolvability, commonly summarized as Safety, Security, Efficiency, and Extensibility (SSEE). Existing vehicle operating systems and runtimes address these concerns in isolation, resulting in fragmented software stacks that limit coordination between autonomy workloads and vehicle data services. This paper presents DAVOS, the Delaware Autonomous Vehicle Operating System, a unified vehicle operating system architecture designed for the vehicle computing context. DAVOS provides a cohesive operating system foundation that supports both real-time autonomy and extensible vehicle computing within a single system framework.
☆ The RoboSense Challenge: Sense Anything, Navigate Anywhere, Adapt Across Platforms IROS 2025
Autonomous systems are increasingly deployed in open and dynamic environments -- from city streets to aerial and indoor spaces -- where perception models must remain reliable under sensor noise, environmental variation, and platform shifts. However, even state-of-the-art methods often degrade under unseen conditions, highlighting the need for robust and generalizable robot sensing. The RoboSense 2025 Challenge is designed to advance robustness and adaptability in robot perception across diverse sensing scenarios. It unifies five complementary research tracks spanning language-grounded decision making, socially compliant navigation, sensor configuration generalization, cross-view and cross-modal correspondence, and cross-platform 3D perception. Together, these tasks form a comprehensive benchmark for evaluating real-world sensing reliability under domain shifts, sensor failures, and platform discrepancies. RoboSense 2025 provides standardized datasets, baseline models, and unified evaluation protocols, enabling large-scale and reproducible comparison of robust perception methods. The challenge attracted 143 teams from 85 institutions across 16 countries, reflecting broad community engagement. By consolidating insights from 23 winning solutions, this report highlights emerging methodological trends, shared design principles, and open challenges across all tracks, marking a step toward building robots that can sense reliably, act robustly, and adapt across platforms in real-world environments.
comment: Official IROS 2025 RoboSense Challenge Report; 51 pages, 37 figures, 5 tables; Competition Website at https://robosense2025.github.io/
☆ When to Act: Calibrated Confidence for Reliable Human Intention Prediction in Assistive Robotics
Assistive devices must determine both what a user intends to do and how reliable that prediction is before providing support. We introduce a safety-critical triggering framework based on calibrated probabilities for multimodal next-action prediction in Activities of Daily Living. Raw model confidence often fails to reflect true correctness, posing a safety risk. Post-hoc calibration aligns predicted confidence with empirical reliability and reduces miscalibration by about an order of magnitude without affecting accuracy. The calibrated confidence drives a simple ACT/HOLD rule that acts only when reliability is high and withholds assistance otherwise. This turns the confidence threshold into a quantitative safety parameter for assisted actions and enables verifiable behavior in an assistive control loop.
☆ SKATER: Synthesized Kinematics for Advanced Traversing Efficiency on a Humanoid Robot via Roller Skate Swizzles
Although recent years have seen significant progress of humanoid robots in walking and running, the frequent foot strikes with ground during these locomotion gaits inevitably generate high instantaneous impact forces, which leads to exacerbated joint wear and poor energy utilization. Roller skating, as a sport with substantial biomechanical value, can achieve fast and continuous sliding through rational utilization of body inertia, featuring minimal kinetic energy loss. Therefore, this study proposes a novel humanoid robot with each foot equipped with a row of four passive wheels for roller skating. A deep reinforcement learning control framework is also developed for the swizzle gait with the reward function design based on the intrinsic characteristics of roller skating. The learned policy is first analyzed in simulation and then deployed on the physical robot to demonstrate the smoothness and efficiency of the swizzle gait over traditional bipedal walking gait in terms of Impact Intensity and Cost of Transport during locomotion. A reduction of $75.86\%$ and $63.34\%$ of these two metrics indicate roller skating as a superior locomotion mode for enhanced energy efficiency and joint longevity.
☆ Zero Wrench Control via Wrench Disturbance Observer for Learning-free Peg-in-hole Assembly
This paper proposes a Dynamic Wrench Disturbance Observer (DW-DOB) designed to achieve highly sensitive zero-wrench control in contact-rich manipulation. By embedding task-space inertia into the observer nominal model, DW-DOB cleanly separates intrinsic dynamic reactions from true external wrenches. This preserves sensitivity to small forces and moments while ensuring robust regulation of contact wrenches. A passivity-based analysis further demonstrates that DW-DOB guarantees stable interactions under dynamic conditions, addressing the shortcomings of conventional observers that fail to compensate for inertial effects. Peg-in-hole experiments at industrial tolerances (H7/h6) validate the approach, yielding deeper and more compliant insertions with minimal residual wrenches and outperforming a conventional wrench disturbance observer and a PD baseline. These results highlight DW-DOB as a practical learning-free solution for high-precision zero-wrench control in contact-rich tasks.
☆ SeqWalker: Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning
Sequential-Horizon Vision-and-Language Navigation (SH-VLN) presents a challenging scenario where agents should sequentially execute multi-task navigation guided by complex, long-horizon language instructions. Current vision-and-language navigation models exhibit significant performance degradation with such multi-task instructions, as information overload impairs the agent's ability to attend to observationally relevant details. To address this problem, we propose SeqWalker, a navigation model built on a hierarchical planning framework. Our SeqWalker features: i) A High-Level Planner that dynamically selects global instructions into contextually relevant sub-instructions based on the agent's current visual observations, thus reducing cognitive load; ii) A Low-Level Planner incorporating an Exploration-Verification strategy that leverages the inherent logical structure of instructions for trajectory error correction. To evaluate SH-VLN performance, we also extend the IVLN dataset and establish a new benchmark. Extensive experiments are performed to demonstrate the superiority of the proposed SeqWalker.
☆ Nightmare Dreamer: Dreaming About Unsafe States And Planning Ahead
Reinforcement Learning (RL) has shown remarkable success in real-world applications, particularly in robotics control. However, RL adoption remains limited due to insufficient safety guarantees. We introduce Nightmare Dreamer, a model-based Safe RL algorithm that addresses safety concerns by leveraging a learned world model to predict potential safety violations and plan actions accordingly. Nightmare Dreamer achieves nearly zero safety violations while maximizing rewards. Nightmare Dreamer outperforms model-free baselines on Safety Gymnasium tasks using only image observations, achieving nearly a 20x improvement in efficiency.
comment: RSS'25: Multi-Objective Optimization and Planning in Robotics Workshop: 5 pages, 8 figures
☆ Optimizing Path Planning using Deep Reinforcement Learning for UGVs in Precision Agriculture
This study focuses on optimizing path planning for unmanned ground vehicles (UGVs) in precision agriculture using deep reinforcement learning (DRL) techniques in continuous action spaces. The research begins with a review of traditional grid-based methods, such as A* and Dijkstra's algorithms, and discusses their limitations in dynamic agricultural environments, highlighting the need for adaptive learning strategies. The study then explores DRL approaches, including Deep Q-Networks (DQN), which demonstrate improved adaptability and performance in two-dimensional simulations. Enhancements such as Double Q-Networks and Dueling Networks are evaluated to further improve decision-making. Building on these results, the focus shifts to continuous action space models, specifically Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3), which are tested in increasingly complex environments. Experiments conducted in a three-dimensional environment using ROS and Gazebo demonstrate the effectiveness of continuous DRL algorithms in navigating dynamic agricultural scenarios. Notably, the pretrained TD3 agent achieves a 95 percent success rate in dynamic environments, demonstrating the robustness of the proposed approach in handling moving obstacles while ensuring safety for both crops and the robot.
☆ Model of Spatial Human-Agent Interaction with Consideration for Others
Communication robots often need to initiate conversations with people in public spaces. At the same time, such robots must not disturb pedestrians. To handle these two requirements, an agent needs to estimate the communication desires of others based on their behavior and then adjust its own communication activities accordingly. In this study, we construct a computational spatial interaction model that considers others. Consideration is expressed as a quantitative parameter: the amount of adjustment of one's internal state to the estimated internal state of the other. To validate the model, we experimented with a human and a virtual robot interacting in a VR environment. The results show that when the participant moves to the target, a virtual robot with a low consideration value inhibits the participant's movement, while a robot with a higher consideration value did not inhibit the participant's movement. When the participant approached the robot, the robot also exhibited approaching behavior, regardless of the consideration value, thus decreasing the participant's movement. These results appear to verify the proposed model's ability to clarify interactions with consideration for others.
☆ UniBiDex: A Unified Teleoperation Framework for Robotic Bimanual Dexterous Manipulation
We present UniBiDex a unified teleoperation framework for robotic bimanual dexterous manipulation that supports both VRbased and leaderfollower input modalities UniBiDex enables realtime contactrich dualarm teleoperation by integrating heterogeneous input devices into a shared control stack with consistent kinematic treatment and safety guarantees The framework employs nullspace control to optimize bimanual configurations ensuring smooth collisionfree and singularityaware motion across tasks We validate UniBiDex on a longhorizon kitchentidying task involving five sequential manipulation subtasks demonstrating higher task success rates smoother trajectories and improved robustness compared to strong baselines By releasing all hardware and software components as opensource we aim to lower the barrier to collecting largescale highquality human demonstration datasets and accelerate progress in robot learning.
☆ Discrete Fourier Transform-based Point Cloud Compression for Efficient SLAM in Featureless Terrain
Simultaneous Localization and Mapping (SLAM) is an essential technology for the efficiency and reliability of unmanned robotic exploration missions. While the onboard computational capability and communication bandwidth are critically limited, the point cloud data handled by SLAM is large in size, attracting attention to data compression methods. To address such a problem, in this paper, we propose a new method for compressing point cloud maps by exploiting the Discrete Fourier Transform (DFT). The proposed technique converts the Digital Elevation Model (DEM) to the frequency-domain 2D image and omits its high-frequency components, focusing on the exploration of gradual terrains such as planets and deserts. Unlike terrains with detailed structures such as artificial environments, high-frequency components contribute little to the representation of gradual terrains. Thus, this method is effective in compressing data size without significant degradation of the point cloud. We evaluated the method in terms of compression rate and accuracy using camera sequences of two terrains with different elevation profiles.
comment: Author's version of a manuscript accepted at the 11th International Conference on Automation, Robotics, and Applications (ICARA). (c) IEEE
☆ Data-Driven Terramechanics Approach Towards a Realistic Real-Time Simulator for Lunar Rovers
High-fidelity simulators for the lunar surface provide a digital environment for extensive testing of rover operations and mission planning. However, current simulators focus on either visual realism or physical accuracy, which limits their capability to replicate lunar conditions comprehensively. This work addresses that gap by combining high visual fidelity with realistic terrain interaction for a realistic representation of rovers on the lunar surface. Because direct simulation of wheel-soil interactions is computationally expensive, a data-driven approach was adopted, using regression models for slip and sinkage from data collected in both full-rover and single-wheel experiments and simulations. The resulting regression-based terramechanics model accurately reproduced steady-state and dynamic slip, as well as sinkage behavior, on flat terrain and slopes up to 20 degrees, with validation against field test results. Additionally, improvements were made to enhance the realism of terrain deformation and wheel trace visualization. This method supports real-time applications that require physically plausible terrain response alongside high visual fidelity.
comment: Author's version of a manuscript accepted at the International Conference on Space Robotics 2025 (iSpaRo 2025). (c) IEEE
☆ Design and Development of Modular Limbs for Reconfigurable Robots on the Moon
In this paper, we present the development of 4-DOF robot limbs, which we call Moonbots, designed to connect in various configurations with each other and wheel modules, enabling adaptation to different environments and tasks. These modular components are intended primarily for robotic systems in space exploration and construction on the Moon in our Moonshot project. Such modular robots add flexibility and versatility for space missions where resources are constrained. Each module is driven by a common actuator characterized by a high torque-to-speed ratio, supporting both precise control and dynamic motion when required. This unified actuator design simplifies development and maintenance across the different module types. The paper describes the hardware implementation, the mechanical design of the modules, and the overall software architecture used to control and coordinate them. Additionally, we evaluate the control performance of the actuator under various load conditions to characterize its suitability for modular robot applications. To demonstrate the adaptability of the system, we introduce nine functional configurations assembled from the same set of modules: 4DOF-limb, 8DOF-limb, vehicle, dragon, minimal, quadruped, cargo, cargo-minimal, and bike. These configurations reflect different locomotion strategies and task-specific behaviors, offering a practical foundation for further research in reconfigurable robotic systems.
comment: Author's version of a manuscript accepted at the International Conference on Space Robotics 2025 (iSpaRo 2025). (c) IEEE
☆ Multiagent Reinforcement Learning with Neighbor Action Estimation
Multiagent reinforcement learning, as a prominent intelligent paradigm, enables collaborative decision-making within complex systems. However, existing approaches often rely on explicit action exchange between agents to evaluate action value functions, which is frequently impractical in real-world engineering environments due to communication constraints, latency, energy consumption, and reliability requirements. From an artificial intelligence perspective, this paper proposes an enhanced multiagent reinforcement learning framework that employs action estimation neural networks to infer agent behaviors. By integrating a lightweight action estimation module, each agent infers neighboring agents' behaviors using only locally observable information, enabling collaborative policy learning without explicit action sharing. This approach is fully compatible with standard TD3 algorithms and scalable to larger multiagent systems. At the engineering application level, this framework has been implemented and validated in dual-arm robotic manipulation tasks: two robotic arms collaboratively lift objects. Experimental results demonstrate that this approach significantly enhances the robustness and deployment feasibility of real-world robotic systems while reducing dependence on information infrastructure. Overall, this research advances the development of decentralized multiagent artificial intelligence systems while enabling AI to operate effectively in dynamic, information-constrained real-world environments.
☆ Fast Continuum Robot Shape and External Load State Estimation on SE(3) ICRA 2026
Previous on-manifold approaches to continuum robot state estimation have typically adopted simplified Cosserat rod models, which cannot directly account for actuation inputs or external loads. We introduce a general framework that incorporates uncertainty models for actuation (e.g., tendon tensions), applied forces and moments, process noise, boundary conditions, and arbitrary backbone measurements. By adding temporal priors across time steps, our method additionally performs joint estimation in both the spatial (arclength) and temporal domains, enabling full \textit{spacetime} state estimation. Discretizing the arclength domain yields a factor graph representation of the continuum robot model, which can be exploited for fast batch sparse nonlinear optimization in the style of SLAM. The framework is general and applies to a broad class of continuum robots; as illustrative cases, we show (i) tendon-driven robots in simulation, where we demonstrate real-time kinematics with uncertainty, tip force sensing from position feedback, and distributed load estimation from backbone strain, and (ii) a surgical concentric tube robot in experiment, where we validate accurate kinematics and tip force estimation, highlighting potential for surgical palpation.
comment: Public preprint for ICRA 2026
♻ ☆ $π_0$: A Vision-Language-Action Flow Model for General Robot Control
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.
comment: See project website for videos: https://physicalintelligence.company/blog/pi0 Published in RSS 2025
♻ ☆ Uncertainty-Aware Robotic World Model Makes Offline Model-Based Reinforcement Learning Work on Real Robots
Reinforcement Learning (RL) has achieved impressive results in robotics, yet high-performing pipelines remain highly task-specific, with little reuse of prior data. Offline Model-based RL (MBRL) offers greater data efficiency by training policies entirely from existing datasets, but suffers from compounding errors and distribution shift in long-horizon rollouts. Although existing methods have shown success in controlled simulation benchmarks, robustly applying them to the noisy, biased, and partially observed datasets typical of real-world robotics remains challenging. We present a principled pipeline for making offline MBRL effective on physical robots. Our RWM-U extends autoregressive world models with epistemic uncertainty estimation, enabling temporally consistent multi-step rollouts with uncertainty effectively propagated over long horizons. We combine RWM-U with MOPO-PPO, which adapts uncertainty-penalized policy optimization to the stable, on-policy PPO framework for real-world control. We evaluate our approach on diverse manipulation and locomotion tasks in simulation and on real quadruped and humanoid, training policies entirely from offline datasets. The resulting policies consistently outperform model-free and uncertainty-unaware model-based baselines, and fusing real-world data in model learning further yields robust policies that surpass online model-free baselines trained solely in simulation.
♻ ☆ ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/
comment: 38 pages
♻ ☆ ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination
Visual navigation is a fundamental capability for autonomous home-assistance robots, enabling long-horizon tasks such as object search. While recent methods have leveraged Large Language Models (LLMs) to incorporate commonsense reasoning and improve exploration efficiency, their planning remains constrained by textual representations, which cannot adequately capture spatial occupancy or scene geometry--critical factors for navigation decisions. We explore whether Vision-Language Models (VLMs) can achieve mapless visual navigation using only onboard RGB/RGB-D streams, unlocking their potential for spatial perception and planning. We achieve this through an imagination-powered navigation framework, ImagineNav++, which imagines future observation images from candidate robot views and translates navigation planning into a simple best-view image selection problem for VLMs. First, a future-view imagination module distills human navigation preferences to generate semantically meaningful viewpoints with high exploration potential. These imagined views then serve as visual prompts for the VLM to identify the most informative viewpoint. To maintain spatial consistency, we develop a selective foveation memory mechanism, which hierarchically integrates keyframe observations via a sparse-to-dense framework, constructing a compact yet comprehensive memory for long-term spatial reasoning. This approach transforms goal-oriented navigation into a series of tractable point-goal navigation tasks. Extensive experiments on open-vocabulary object and instance navigation benchmarks show that ImagineNav++ achieves SOTA performance in mapless settings, even surpassing most map-based methods, highlighting the importance of scene imagination and memory in VLM-based spatial reasoning.
comment: 17 pages, 10 figures. arXiv admin note: text overlap with arXiv:2410.09874
♻ ☆ Cognitive-Hierarchy Guided End-to-End Planning for Autonomous Driving
While end-to-end autonomous driving has advanced significantly, prevailing methods remain fundamentally misaligned with human cognitive principles in both perception and planning. In this paper, we propose CogAD, a novel end-to-end autonomous driving model that emulates the hierarchical cognition mechanisms of human drivers. CogAD implements dual hierarchical mechanisms: global-to-local context processing for human-like perception and intent-conditioned multi-mode trajectory generation for cognitively-inspired planning. The proposed method demonstrates three principal advantages: comprehensive environmental understanding through hierarchical perception, robust planning exploration enabled by multi-level planning, and diverse yet reasonable multi-modal trajectory generation facilitated by dual-level uncertainty modeling. Extensive experiments on nuScenes and Bench2Drive demonstrate that CogAD achieves state-of-the-art performance in end-to-end planning, exhibiting particular superiority in long-tail scenarios and robust generalization to complex real-world driving conditions.
♻ ☆ Solving Robotics Tasks with Prior Demonstration via Exploration-Efficient Deep Reinforcement Learning
This paper proposes an exploration-efficient Deep Reinforcement Learning with Reference policy (DRLR) framework for learning robotics tasks that incorporates demonstrations. The DRLR framework is developed based on an algorithm called Imitation Bootstrapped Reinforcement Learning (IBRL). We propose to improve IBRL by modifying the action selection module. The proposed action selection module provides a calibrated Q-value, which mitigates the bootstrapping error that otherwise leads to inefficient exploration. Furthermore, to prevent the RL policy from converging to a sub-optimal policy, SAC is used as the RL policy instead of TD3. The effectiveness of our method in mitigating bootstrapping error and preventing overfitting is empirically validated by learning two robotics tasks: bucket loading and open drawer, which require extensive interactions with the environment. Simulation results also demonstrate the robustness of the DRLR framework across tasks with both low and high state-action dimensions, and varying demonstration qualities. To evaluate the developed framework on a real-world industrial robotics task, the bucket loading task is deployed on a real wheel loader. The sim2real results validate the successful deployment of the DRLR framework.
comment: This paper has been accepted for Journal publication in Frontiers in Robotics and AI
♻ ☆ Properties of Lyapunov Subcenter Manifolds in Conservative Mechanical Systems
Multi-body mechanical systems have rich internal dynamics, whose solutions can be exploited as efficient control targets. Yet, solutions non-trivially depend on system parameters, obscuring feasible properties for use as target trajectories. For periodic regulation tasks in robotics applications, we investigate properties of nonlinear normal modes (NNMs) collected in Lyapunov subcenter manifolds (LSMs) of conservative mechanical systems. Using a time-symmetry of conservative mechanical systems, we show that mild non-resonance conditions guarantee LSMs to be Eigenmanifolds, in which NNMs are guaranteed to oscillate between two points of zero velocity. We also prove the existence of a unique generator, which is a connected, 1D manifold that collects these points of zero velocity for a given Eigenmanifold. Furthermore, we show that an additional spatial symmetry provides LSMs with yet stronger properties of Rosenberg manifolds. Here all brake trajectories pass through a unique equilibrium configuration, which can be favorable for control applications. These theoretical results are numerically confirmed on two mechanical systems: a double pendulum and a 5-link pendulum.
comment: 18 pages, 27 figures, submitted to Automatica
♻ ☆ SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection
Developing robust drone detection systems is often constrained by the limited availability of large-scale annotated training data and the high costs associated with real-world data collection. However, leveraging synthetic data generated via game engine-based simulations provides a promising and cost-effective solution to overcome this issue. Therefore, we present SynDroneVision, a synthetic dataset specifically designed for RGB-based drone detection in surveillance applications. Featuring diverse backgrounds, lighting conditions, and drone models, SynDroneVision offers a comprehensive training foundation for deep learning algorithms. To evaluate the dataset's effectiveness, we perform a comparative analysis across a selection of recent YOLO detection models. Our findings demonstrate that SynDroneVision is a valuable resource for real-world data enrichment, achieving notable enhancements in model performance and robustness, while significantly reducing the time and costs of real-world data acquisition. SynDroneVision will be publicly released upon paper acceptance.
♻ ☆ RoboReward: General-Purpose Vision-Language Reward Models for Robotics
A well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotics, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) RoboReward, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a negative examples data augmentation pipeline that generates calibrated negative and near-misses via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we build a large training and evaluation dataset spanning diverse tasks and embodiments to test whether state-of-the-art VLMs can reliably provide rewards for robot learning. Our evaluation of open and proprietary VLMs finds that no model excels across tasks, highlighting substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B model in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5 while narrowing the gap to RL training with human-provided rewards. We release the full dataset, trained reward models, and evaluation suite on our website to advance the development of general-purpose reward models in robotics: https://crfm.stanford.edu/helm/robo-reward-bench (project website).
♻ ☆ Human-Aided Trajectory Planning for Automated Vehicles through Teleoperation and Arbitration Graphs
Teleoperation enables remote human support of automated vehicles in scenarios where the automation is not able to find an appropriate solution. Remote assistance concepts, where operators provide discrete inputs to aid specific automation modules like planning, is gaining interest due to its reduced workload on the human remote operator and improved safety. However, these concepts are challenging to implement and maintain due to their deep integration and interaction with the automated driving system. In this paper, we propose a solution to facilitate the implementation of remote assistance concepts that intervene on planning level and extend the operational design domain of the vehicle at runtime. Using arbitration graphs, a modular decision-making framework, we integrate remote assistance into an existing automated driving system without modifying the original software components. Our simulative implementation demonstrates this approach in two use cases, allowing operators to adjust planner constraints and enable trajectory generation beyond nominal operational design domains.
comment: 7 pages, 8 figures, presented at IEEE Intelligent Vehicles Symposium 2025, video demonstration available at https://www.youtube.com/watch?v=fVSO-YOeGMk
♻ ☆ Bio-Skin: A Cost-Effective Thermostatic Tactile Sensor with Multi-Modal Force and Temperature Detection IROS2025
Tactile sensors can significantly enhance the perception of humanoid robotics systems by providing contact information that facilitates human-like interactions. However, existing commercial tactile sensors focus on improving the resolution and sensitivity of single-modal detection with high-cost components and densely integrated design, incurring complex manufacturing processes and unaffordable prices. In this work, we present Bio-Skin, a cost-effective multi-modal tactile sensor that utilizes single-axis Hall-effect sensors for planar normal force measurement and bar-shape piezo resistors for 2D shear force measurement. A thermistor coupling with a heating wire is integrated into a silicone body to achieve temperature sensation and thermostatic function analogous to human skin. We also present a cross-reference framework to validate the two modalities of the force sensing signal, improving the sensing fidelity in a complex electromagnetic environment. Bio-Skin has a multi-layer design, and each layer is manufactured sequentially and subsequently integrated, thereby offering a fast production pathway. After calibration, Bio-Skin demonstrates performance metrics-including signal-to-range ratio, sampling rate, and measurement range-comparable to current commercial products, with one-tenth of the cost. The sensor's real-world performance is evaluated using an Allegro hand in object grasping tasks, while its temperature regulation functionality was assessed in a material detection task.
comment: This work has been published by IROS2025
♻ ☆ Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems
The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.
comment: Survey; 40 pages, 7 figures, 9 tables; GitHub Repo at https://github.com/worldbench/awesome-spatial-intelligence
♻ ☆ FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis
Due to the deformability of garments, generating a large amount of high-quality data for robotic garment manipulation tasks is highly challenging. In this paper, we present a synthetic garment dataset that can be used for robotic garment folding. We begin by constructing geometric garment templates based on keypoints and applying generative models to generate realistic texture patterns. Leveraging these keypoint annotations, we generate folding demonstrations in simulation and train folding policies via closed-loop imitation learning. To improve robustness, we propose KG-DAgger, which uses a keypoint-based strategy to generate demonstration data for recovering from failures. KG-DAgger significantly improves the model performance, boosting the real-world success rate by 25\%. After training with 15K trajectories (about 2M image-action pairs), the model achieves a 75\% success rate in the real world. Experiments in both simulation and real-world settings validate the effectiveness of our proposed framework.
Robotics 40
☆ Choreographing a World of Dynamic Objects
Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies. Project page: https://yanzhelyu.github.io/chord
☆ Embedding Autonomous Agents in Resource-Constrained Robotic Platforms
Many embedded devices operate under resource constraints and in dynamic environments, requiring local decision-making capabilities. Enabling devices to make independent decisions in such environments can improve the responsiveness of the system and reduce the dependence on constant external control. In this work, we integrate an autonomous agent, programmed using AgentSpeak, with a small two-wheeled robot that explores a maze using its own decision-making and sensor data. Experimental results show that the agent successfully solved the maze in 59 seconds using 287 reasoning cycles, with decision phases taking less than one millisecond. These results indicate that the reasoning process is efficient enough for real-time execution on resource-constrained hardware. This integration demonstrates how high-level agent-based control can be applied to resource-constrained embedded systems for autonomous operation.
comment: This is an open-access, author-archived version of a manuscript published in European Conference on Multi-Agent Systems 2025
☆ Hierarchical GNN-Based Multi-Agent Learning for Dynamic Queue-Jump Lane and Emergency Vehicle Corridor Formation
Emergency vehicles require rapid passage through congested traffic, yet existing strategies fail to adapt to dynamic conditions. We propose a novel hierarchical graph neural network (GNN)-based multi-agent reinforcement learning framework to coordinate connected vehicles for emergency corridor formation. Our approach uses a high-level planner for global strategy and low-level controllers for trajectory execution, utilizing graph attention networks to scale with variable agent counts. Trained via Multi-Agent Proximal Policy Optimization (MAPPO), the system reduces emergency vehicle travel time by 28.3% compared to baselines and 44.6% compared to uncoordinated traffic in simulations. The design achieves near-zero collision rates (0.3%) while maintaining 81% of background traffic efficiency. Ablation and generalization studies confirm the framework's robustness across diverse scenarios. These results demonstrate the effectiveness of combining GNNs with hierarchical learning for intelligent transportation systems.
comment: 16 Pages, 5 Figures, 9 Tables, submitted to IEEE TITS
☆ Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test
As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
☆ CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos
Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: https://lin-shan.com/CLAP/.
comment: Project page: https://lin-shan.com/CLAP/
☆ Stable Language Guidance for Vision-Language-Action Models
Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose \textbf{Residual Semantic Steering (RSS)}, a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) \textbf{Monte Carlo Syntactic Integration}, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) \textbf{Residual Affordance Steering}, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
☆ CoINS: Counterfactual Interactive Navigation via Skill-Aware VLM
Recent Vision-Language Models (VLMs) have demonstrated significant potential in robotic planning. However, they typically function as semantic reasoners, lacking an intrinsic understanding of the specific robot's physical capabilities. This limitation is particularly critical in interactive navigation, where robots must actively modify cluttered environments to create traversable paths. Existing VLM-based navigators are predominantly confined to passive obstacle avoidance, failing to reason about when and how to interact with objects to clear blocked paths. To bridge this gap, we propose Counterfactual Interactive Navigation via Skill-aware VLM (CoINS), a hierarchical framework that integrates skill-aware reasoning and robust low-level execution. Specifically, we fine-tune a VLM, named InterNav-VLM, which incorporates skill affordance and concrete constraint parameters into the input context and grounds them into a metric-scale environmental representation. By internalizing the logic of counterfactual reasoning through fine-tuning on the proposed InterNav dataset, the model learns to implicitly evaluate the causal effects of object removal on navigation connectivity, thereby determining interaction necessity and target selection. To execute the generated high-level plans, we develop a comprehensive skill library through reinforcement learning, specifically introducing traversability-oriented strategies to manipulate diverse objects for path clearance. A systematic benchmark in Isaac Sim is proposed to evaluate both the reasoning and execution aspects of interactive navigation. Extensive simulations and real-world experiments demonstrate that CoINS significantly outperforms representative baselines, achieving a 17\% higher overall success rate and over 80\% improvement in complex long-horizon scenarios compared to the best-performing baseline
comment: 17 pages, 13 figures
☆ An Event-Based Opto-Tactile Skin
This paper presents a neuromorphic, event-driven tactile sensing system for soft, large-area skin, based on the Dynamic Vision Sensors (DVS) integrated with a flexible silicone optical waveguide skin. Instead of repetitively scanning embedded photoreceivers, this design uses a stereo vision setup comprising two DVS cameras looking sideways through the skin. Such a design produces events as changes in brightness are detected, and estimates press positions on the 2D skin surface through triangulation, utilizing Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to find the center of mass of contact events resulting from pressing actions. The system is evaluated over a 4620 mm2 probed area of the skin using a meander raster scan. Across 95 % of the presses visible to both cameras, the press localization achieved a Root-Mean-Squared Error (RMSE) of 4.66 mm. The results highlight the potential of this approach for wide-area flexible and responsive tactile sensors in soft robotics and interactive environments. Moreover, we examined how the system performs when the amount of event data is strongly reduced. Using stochastic down-sampling, the event stream was reduced to 1/1024 of its original size. Under this extreme reduction, the average localization error increased only slightly (from 4.66 mm to 9.33 mm), and the system still produced valid press localizations for 85 % of the trials. This reduction in pass rate is expected, as some presses no longer produce enough events to form a reliable cluster for triangulation. These results show that the sensing approach remains functional even with very sparse event data, which is promising for reducing power consumption and computational load in future implementations. The system exhibits a detection latency distribution with a characteristic width of 31 ms.
comment: Accepted for publication in Frontiers in Neuromorphic Engineering. 23 pages, 9 figures
☆ Towards Safe Autonomous Driving: A Real-Time Motion Planning Algorithm on Embedded Hardware
Ensuring the functional safety of Autonomous Vehicles (AVs) requires motion planning modules that not only operate within strict real-time constraints but also maintain controllability in case of system faults. Existing safeguarding concepts, such as Online Verification (OV), provide safety layers that detect infeasible planning outputs. However, they lack an active mechanism to ensure safe operation in the event that the main planner fails. This paper presents a first step toward an active safety extension for fail-operational Autonomous Driving (AD). We deploy a lightweight sampling-based trajectory planner on an automotive-grade, embedded platform running a Real-Time Operating System (RTOS). The planner continuously computes trajectories under constrained computational resources, forming the foundation for future emergency planning architectures. Experimental results demonstrate deterministic timing behavior with bounded latency and minimal jitter, validating the feasibility of trajectory planning on safety-certifiable hardware. The study highlights both the potential and the remaining challenges of integrating active fallback mechanisms as an integral part of next-generation safeguarding frameworks. The code is available at: https://github.com/TUM-AVS/real-time-motion-planning
comment: 7 pages, submitted to the IEEE Intelligent Vehicles Symposium (IV 2026), Detroit, MI, United States
☆ Bayesian Monocular Depth Refinement via Neural Radiance Fields
Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate superior performance on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.
comment: IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025). Oral presentation; Best Presenter Award
☆ Integrating Sample Inheritance into Bayesian Optimization for Evolutionary Robotics
In evolutionary robotics, robot morphologies are designed automatically using evolutionary algorithms. This creates a body-brain optimization problem, where both morphology and control must be optimized together. A common approach is to include controller optimization for each morphology, but starting from scratch for every new body may require a high controller learning budget. We address this by using Bayesian optimization for controller optimization, exploiting its sample efficiency and strong exploration capabilities, and using sample inheritance as a form of Lamarckian inheritance. Under a deliberately low controller learning budget for each morphology, we investigate two types of sample inheritance: (1) transferring all the parent's samples to the offspring to be used as prior without evaluating them, and (2) reevaluating the parent's best samples on the offspring. Both are compared to a baseline without inheritance. Our results show that reevaluation performs best, with prior-based inheritance also outperforming no inheritance. Analysis reveals that while the learning budget is too low for a single morphology, generational inheritance compensates for this by accumulating learned adaptations across generations. Furthermore, inheritance mainly benefits offspring morphologies that are similar to their parents. Finally, we demonstrate the critical role of the environment, with more challenging environments resulting in more stable walking gaits. Our findings highlight that inheritance mechanisms can boost performance in evolutionary robotics without needing large learning budgets, offering an efficient path toward more capable robot design.
☆ Generational Replacement and Learning for High-Performing and Diverse Populations in Evolvable Robots
Evolutionary Robotics offers the possibility to design robots to solve a specific task automatically by optimizing their morphology and control together. However, this co-optimization of body and control is challenging, because controllers need some time to adapt to the evolving morphology - which may make it difficult for new and promising designs to enter the evolving population. A solution to this is to add intra-life learning, defined as an additional controller optimization loop, to each individual in the evolving population. A related problem is the lack of diversity often seen in evolving populations as evolution narrows the search down to a few promising designs too quickly. This problem can be mitigated by implementing full generational replacement, where offspring robots replace the whole population. This solution for increasing diversity usually comes at the cost of lower performance compared to using elitism. In this work, we show that combining such generational replacement with intra-life learning can increase diversity while retaining performance. We also highlight the importance of performance metrics when studying learning in morphologically evolving robots, showing that evaluating according to function evaluations versus according to generations of evolution can give different conclusions.
☆ PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. Project website at https://point-world.github.io/.
☆ Dual-Attention Heterogeneous GNN for Multi-robot Collaborative Area Search via Deep Reinforcement Learning
In multi-robot collaborative area search, a key challenge is to dynamically balance the two objectives of exploring unknown areas and covering specific targets to be rescued. Existing methods are often constrained by homogeneous graph representations, thus failing to model and balance these distinct tasks. To address this problem, we propose a Dual-Attention Heterogeneous Graph Neural Network (DA-HGNN) trained using deep reinforcement learning. Our method constructs a heterogeneous graph that incorporates three entity types: robot nodes, frontier nodes, and interesting nodes, as well as their historical states. The dual-attention mechanism comprises the relational-aware attention and type-aware attention operations. The relational-aware attention captures the complex spatio-temporal relationships among robots and candidate goals. Building on this relational-aware heterogeneous graph, the type-aware attention separately computes the relevance between robots and each goal type (frontiers vs. points of interest), thereby decoupling the exploration and coverage from the unified tasks. Extensive experiments conducted in interactive 3D scenarios within the iGibson simulator, leveraging the Gibson and MatterPort3D datasets, validate the superior scalability and generalization capability of the proposed approach.
☆ Systematic Evaluation of Depth Backbones and Semantic Cues for Monocular Pseudo-LiDAR 3D Detection
Monocular 3D object detection offers a low-cost alternative to LiDAR, yet remains less accurate due to the difficulty of estimating metric depth from a single image. We systematically evaluate how depth backbones and feature engineering affect a monocular Pseudo-LiDAR pipeline on the KITTI validation split. Specifically, we compare NeWCRFs (supervised metric depth) against Depth Anything V2 Metric-Outdoor (Base) under an identical pseudo-LiDAR generation and PointRCNN detection protocol. NeWCRFs yields stronger downstream 3D detection, achieving 10.50\% AP$_{3D}$ at IoU$=0.7$ on the Moderate split using grayscale intensity (Exp~2). We further test point-cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence). Contrary to the expectation that semantics would substantially close the gap, these features provide only marginal gains, and mask-based sampling can degrade performance by removing contextual geometry. Finally, we report a depth-accuracy-versus-distance diagnostic using ground-truth 2D boxes (including Ped/Cyc), highlighting that coarse depth correctness does not fully predict strict 3D IoU. Overall, under an off-the-shelf LiDAR detector, depth-backbone choice and geometric fidelity dominate performance, outweighing secondary feature injection.
comment: 7 pages, 4 figures
☆ Locomotion Beyond Feet
Most locomotion methods for humanoid robots focus on leg-based gaits, yet natural bipeds frequently rely on hands, knees, and elbows to establish additional contacts for stability and support in complex environments. This paper introduces Locomotion Beyond Feet, a comprehensive system for whole-body humanoid locomotion across extremely challenging terrains, including low-clearance spaces under chairs, knee-high walls, knee-high platforms, and steep ascending and descending stairs. Our approach addresses two key challenges: contact-rich motion planning and generalization across diverse terrains. To this end, we combine physics-grounded keyframe animation with reinforcement learning. Keyframes encode human knowledge of motor skills, are embodiment-specific, and can be readily validated in simulation or on hardware, while reinforcement learning transforms these references into robust, physically accurate motions. We further employ a hierarchical framework consisting of terrain-specific motion-tracking policies, failure recovery mechanisms, and a vision-based skill planner. Real-world experiments demonstrate that Locomotion Beyond Feet achieves robust whole-body locomotion and generalizes across obstacle sizes, obstacle instances, and terrain sequences.
comment: Project website: https://locomotion-beyond-feet.github.io/
☆ From Score to Sound: An End-to-End MIDI-to-Motion Pipeline for Robotic Cello Performance
Robot musicians require precise control to obtain proper note accuracy, sound quality, and musical expression. Performance of string instruments, such as violin and cello, presents a significant challenge due to the precise control required over bow angle and pressure to produce the desired sound. While prior robotic cellists focus on accurate bowing trajectories, these works often rely on expensive motion capture techniques, and fail to sightread music in a human-like way. We propose a novel end-to-end MIDI score to robotic motion pipeline which converts musical input directly into collision-aware bowing motions for a UR5e robot cellist. Through use of Universal Robot Freedrive feature, our robotic musician can achieve human-like sound without the need for motion capture. Additionally, this work records live joint data via Real-Time Data Exchange (RTDE) as the robot plays, providing labeled robotic playing data from a collection of five standard pieces to the research community. To demonstrate the effectiveness of our method in comparison to human performers, we introduce the Musical Turing Test, in which a collection of 132 human participants evaluate our robot's performance against a human baseline. Human reference recordings are also released, enabling direct comparison for future studies. This evaluation technique establishes the first benchmark for robotic cello performance. Finally, we outline a residual reinforcement learning methodology to improve upon baseline robotic controls, highlighting future opportunities for improved string-crossing efficiency and sound quality.
☆ A Reinforcement Learning-Based Model for Mapping and Goal-Directed Navigation Using Multiscale Place Fields
Autonomous navigation in complex and partially observable environments remains a central challenge in robotics. Several bio-inspired models of mapping and navigation based on place cells in the mammalian hippocampus have been proposed. This paper introduces a new robust model that employs parallel layers of place fields at multiple spatial scales, a replay-based reward mechanism, and dynamic scale fusion. Simulations show that the model improves path efficiency and accelerates learning compared to single-scale baselines, highlighting the value of multiscale spatial representations for adaptive robot navigation.
comment: 11 pages, 8 figures. Submitted to IEEE Transactions on Cognitive and Developmental Systems
☆ A Vision-Language-Action Model with Visual Prompt for OFF-Road Autonomous Driving
Efficient trajectory planning in off-road terrains presents a formidable challenge for autonomous vehicles, often necessitating complex multi-step pipelines. However, traditional approaches exhibit limited adaptability in dynamic environments. To address these limitations, this paper proposes OFF-EMMA, a novel end-to-end multimodal framework designed to overcome the deficiencies of insufficient spatial perception and unstable reasoning in visual-language-action (VLA) models for off-road autonomous driving scenarios. The framework explicitly annotates input images through the design of a visual prompt block and introduces a chain-of-thought with self-consistency (COT-SC) reasoning strategy to enhance the accuracy and robustness of trajectory planning. The visual prompt block utilizes semantic segmentation masks as visual prompts, enhancing the spatial understanding ability of pre-trained visual-language models for complex terrains. The COT- SC strategy effectively mitigates the error impact of outliers on planning performance through a multi-path reasoning mechanism. Experimental results on the RELLIS-3D off-road dataset demonstrate that OFF-EMMA significantly outperforms existing methods, reducing the average L2 error of the Qwen backbone model by 13.3% and decreasing the failure rate from 16.52% to 6.56%.
Transformer-based Multi-agent Reinforcement Learning for Separation Assurance in Structured and Unstructured Airspaces
Conventional optimization-based metering depends on strict adherence to precomputed schedules, which limits the flexibility required for the stochastic operations of Advanced Air Mobility (AAM). In contrast, multi-agent reinforcement learning (MARL) offers a decentralized, adaptive framework that can better handle uncertainty, required for safe aircraft separation assurance. Despite this advantage, current MARL approaches often overfit to specific airspace structures, limiting their adaptability to new configurations. To improve generalization, we recast the MARL problem in a relative polar state space and train a transformer encoder model across diverse traffic patterns and intersection angles. The learned model provides speed advisories to resolve conflicts while maintaining aircraft near their desired cruising speeds. In our experiments, we evaluated encoder depths of 1, 2, and 3 layers in both structured and unstructured airspaces, and found that a single encoder configuration outperformed deeper variants, yielding near-zero near mid-air collision rates and shorter loss-of-separation infringements than the deeper configurations. Additionally, we showed that the same configuration outperforms a baseline model designed purely with attention. Together, our results suggest that the newly formulated state representation, novel design of neural network architecture, and proposed training strategy provide an adaptable and scalable decentralized solution for aircraft separation assurance in both structured and unstructured airspaces.
comment: 9 pages, 4 figures, 4 tables. Presented at SESAR Innovation Days 2025
☆ Enhanced-FQL($λ$), an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay
This paper introduces a fuzzy reinforcement learning framework, Enhanced-FQL($λ$), that integrates novel Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with Fuzzified Bellman Equation (FBE) for continuous control tasks. The proposed approach employs an interpretable fuzzy rule base instead of complex neural architectures, while maintaining competitive performance through two key innovations: a fuzzified Bellman equation with eligibility traces for stable multi-step credit assignment, and a memory-efficient segment-based experience replay mechanism for enhanced sample efficiency. Theoretical analysis proves the proposed method convergence under standard assumptions. Extensive evaluations in continuous control domains demonstrate that Enhanced-FQL($λ$) achieves superior sample efficiency and reduced variance compared to n-step fuzzy TD and fuzzy SARSA($λ$) baselines, while maintaining substantially lower computational complexity than deep RL alternatives such as DDPG. The framework's inherent interpretability, combined with its computational efficiency and theoretical convergence guarantees, makes it particularly suitable for safety-critical applications where transparency and resource constraints are essential.
comment: Submitted to ECC26 conference
☆ UNIC: Learning Unified Multimodal Extrinsic Contact Estimation
Contact-rich manipulation requires reliable estimation of extrinsic contacts-the interactions between a grasped object and its environment which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.
☆ Autonomous Reasoning for Spacecraft Control: A Large Language Model Framework with Group Relative Policy Optimization
This paper presents a learning-based guidance-and-control approach that couples a reasoning-enabled Large Language Model (LLM) with Group Relative Policy Optimization (GRPO). A two-stage procedure consisting of Supervised Fine-Tuning (SFT) to learn formatting and control primitives, followed by GRPO for interaction-driven policy improvement, trains controllers for each environment. The framework is demonstrated on four control problems spanning a gradient of dynamical complexity, from canonical linear systems through nonlinear oscillatory dynamics to three-dimensional spacecraft attitude control with gyroscopic coupling and thrust constraints. Results demonstrate that an LLM with explicit reasoning, optimized via GRPO, can synthesize feasible stabilizing policies under consistent training settings across both linear and nonlinear systems. The two-stage training methodology enables models to generate control sequences while providing human-readable explanations of their decision-making process. This work establishes a foundation for applying GRPO-based reasoning to autonomous control systems, with potential applications in aerospace and other safety-critical domains.
☆ Online Action-Stacking Improves Reinforcement Learning Performance for Air Traffic Control
We introduce online action-stacking, an inference-time wrapper for reinforcement learning policies that produces realistic air traffic control commands while allowing training on a much smaller discrete action space. Policies are trained with simple incremental heading or level adjustments, together with an action-damping penalty that reduces instruction frequency and leads agents to issue commands in short bursts. At inference, online action-stacking compiles these bursts of primitive actions into domain-appropriate compound clearances. Using Proximal Policy Optimisation and the BluebirdDT digital twin platform, we train agents to navigate aircraft along lateral routes, manage climb and descent to target flight levels, and perform two-aircraft collision avoidance under a minimum separation constraint. In our lateral navigation experiments, action stacking greatly reduces the number of issued instructions relative to a damped baseline and achieves comparable performance to a policy trained with a 37-dimensional action space, despite operating with only five actions. These results indicate that online action-stacking helps bridge a key gap between standard reinforcement learning formulations and operational ATC requirements, and provides a simple mechanism for scaling to more complex control scenarios.
☆ Correcting Autonomous Driving Object Detection Misclassifications with Automated Commonsense Reasoning
Autonomous Vehicle (AV) technology has been heavily researched and sought after, yet there are no SAE Level 5 AVs available today in the marketplace. We contend that over-reliance on machine learning technology is the main reason. Use of automated commonsense reasoning technology, we believe, can help achieve SAE Level 5 autonomy. In this paper, we show how automated common-sense reasoning technology can be deployed in situations where there are not enough data samples available to train a deep learning-based AV model that can handle certain abnormal road scenarios. Specifically, we consider two situations where (i) a traffic signal is malfunctioning at an intersection and (ii) all the cars ahead are slowing down and steering away due to an unexpected obstruction (e.g., animals on the road). We show that in such situations, our commonsense reasoning-based solution accurately detects traffic light colors and obstacles not correctly captured by the AV's perception model. We also provide a pathway for efficiently invoking commonsense reasoning by measuring uncertainty in the computer vision model and using commonsense reasoning to handle uncertain scenarios. We describe our experiments conducted using the CARLA simulator and the results obtained. The main contribution of our research is to show that automated commonsense reasoning effectively corrects AV-based object detection misclassifications and that hybrid models provide an effective pathway to improving AV perception.
comment: In Proceedings ICLP 2025, arXiv:2601.00047
♻ ☆ Supercomputing for High-speed Avoidance and Reactive Planning in Robots
This paper presents SHARP (Supercomputing for High-speed Avoidance and Reactive Planning), a proof-of-concept study demonstrating how high-performance computing (HPC) can enable millisecond-scale responsiveness in robotic control. While modern robots face increasing demands for reactivity in human-robot shared workspaces, onboard processors are constrained by size, power, and cost. Offloading to HPC offers massive parallelism for trajectory planning, but its feasibility for real-time robotics remains uncertain due to network latency and jitter. We evaluate SHARP in a stress-test scenario where a 7-DOF manipulator must dodge high-speed foam projectiles. Using a hash-distributed multi-goal A* search implemented with MPI on both local and remote HPC clusters, the system achieves mean planning latencies of 22.9 ms (local) and 30.0 ms (remote, ~300 km away), with avoidance success rates of 84% and 88%, respectively. These results show that when round-trip latency remains within the tens-of-milliseconds regime, HPC-side computation is no longer the bottleneck, enabling avoidance well below human reaction times. The SHARP results motivate hybrid control architectures: low-level reflexes remain onboard for safety, while bursty, high-throughput planning tasks are offloaded to HPC for scalability. By reporting per-stage timing and success rates, this study provides a reproducible template for assessing real-time feasibility of HPC-driven robotics. Collectively, SHARP reframes HPC offloading as a viable pathway toward dependable, reactive robots in dynamic environments.
comment: Error in the graph size calculation, recalculated and resubmitted
♻ ☆ Uncertainty-Aware Robotic World Model Makes Offline Model-Based Reinforcement Learning Work on Real Robots
Reinforcement Learning (RL) has achieved impressive results in robotics, yet high-performing pipelines remain highly task-specific, with little reuse of prior data. Offline Model-based RL (MBRL) offers greater data efficiency by training policies entirely from existing datasets, but suffers from compounding errors and distribution shift in long-horizon rollouts. Although existing methods have shown success in controlled simulation benchmarks, robustly applying them to the noisy, biased, and partially observed datasets typical of real-world robotics remains challenging. We present a principled pipeline for making offline MBRL effective on physical robots. Our RWM-U extends autoregressive world models with epistemic uncertainty estimation, enabling temporally consistent multi-step rollouts with uncertainty effectively propagated over long horizons. We combine RWM-U with MOPO-PPO, which adapts uncertainty-penalized policy optimization to the stable, on-policy PPO framework for real-world control. We evaluate our approach on diverse manipulation and locomotion tasks in simulation and on real quadruped and humanoid, training policies entirely from offline datasets. The resulting policies consistently outperform model-free and uncertainty-unaware model-based baselines, and fusing real-world data in model learning further yields robust policies that surpass online model-free baselines trained solely in simulation.
♻ ☆ Real-time Velocity Profile Optimization for Time-Optimal Maneuvering with Generic Acceleration Constraints
The computation of time-optimal velocity profiles along prescribed paths, subject to generic acceleration constraints, is a crucial problem in robot trajectory planning, with particular relevance to autonomous racing. However, the existing methods either support arbitrary acceleration constraints at high computational cost or use conservative box constraints for computational efficiency. We propose FBGA, a new \underline{F}orward-\underline{B}ackward algorithm with \underline{G}eneric \underline{A}cceleration constraints, which achieves both high accuracy and low computation time. FBGA operates forward and backward passes to maximize the velocity profile in short, discretized path segments, while satisfying user-defined performance limits. Tested on five racetracks and two vehicle classes, FBGA handles complex, non-convex acceleration constraints with custom formulations. Its maneuvers and lap times closely match optimal control baselines (within $0.11\%$-$0.36\%$), while being up to three orders of magnitude faster. FBGA maintains high accuracy even with coarse discretization, making it well-suited for online multi-query trajectory planning. Our open-source \texttt{C++} implementation is available at: https://anonymous.4open.science/r/FB_public_RAL.
♻ ☆ The Combined Problem of Online Task Assignment and Lifelong Path Finding in Logistics Warehouses: Rule-Based Systems Matter
We study the combined problem of online task assignment and lifelong path finding, which is crucial for the logistics industries. However, most literature either (1) focuses on lifelong path finding assuming a given task assigner, or (2) studies the offline version of this problem where tasks are known in advance. We argue that, to maximize the system throughput, the online version that integrates these two components should be tackled directly. To this end, we introduce a formal framework of the combined problem and its solution concept. Then, we design a rule-based lifelong planner under a practical robot model that works well even in environments with severe local congestion. Upon that, we automate the search for the task assigner with respect to the underlying path planner. Simulation experiments conducted in warehouse scenarios at Meituan, one of the largest shopping platforms in China, demonstrate that (a)in terms of time efficiency, our system requires only 83.77% of the execution time needed for the currently deployed system at Meituan, outperforming other SOTA algorithms by 8.09%; (b)in terms of economic efficiency, ours can achieve the same throughput with only 60% of the agents currently in use. The code and demos are available at https://github.com/Fernadoo/Online-TAPF.
comment: In Proceedings ICLP 2025, arXiv:2601.00047
♻ ☆ Tackling the Kidnapped Robot Problem via Sparse Feasible Hypothesis Sampling and Reliable Batched Multi-Stage Inference
This paper addresses the Kidnapped Robot Problem (KRP), a core localization challenge of relocalizing a robot in a known map without prior pose estimate when localization loss or at SLAM initialization. For this purpose, a passive 2-D global relocalization framework is proposed. It estimates the global pose efficiently and reliably from a single LiDAR scan and an occupancy grid map while the robot remains stationary, thereby enhancing the long-term autonomy of mobile robots. The proposed framework casts global relocalization as a non-convex problem and solves it via the multi-hypothesis scheme with batched multi-stage inference and early termination, balancing completeness and efficiency. The Rapidly-exploring Random Tree (RRT), under traversability constraints, asymptotically covers the reachable space to generate sparse, uniformly distributed feasible positional hypotheses, fundamentally reducing the sampling space. The hypotheses are preliminarily ordered by the proposed Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that facilitates the early termination by prioritizing high-likelihood candidates. The SMAD computation is optimized for non-panoramic scans. The Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for reliable orientation selection at hypothesized positions and accurate final pose evaluation to mitigate degradation in conventional likelihood-field metrics under translational uncertainty induced by sparse hypotheses, as well as non-panoramic LiDAR scan and environmental changes. Real-world experiments on a resource-constrained mobile robot with non-panoramic LiDAR scans show that the proposed framework achieves competitive performance in both global relocalization success rate and computational efficiency.
comment: 10 pages, 8 figures. This work has been submitted to the IEEE for possible publication
♻ ☆ Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. We introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a vision-language model pre-trained for Physical AI, with a diffusion-based trajectory decoder that generates dynamically feasible trajectories in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to enforce reasoning-action consistency and optimize reasoning quality. AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. Model weights are available at https://huggingface.co/nvidia/Alpamayo-R1-10B with inference code at https://github.com/NVlabs/alpamayo.
♻ ☆ From Human Intention to Action Prediction: Intention-Driven End-to-End Autonomous Driving
While end-to-end autonomous driving has achieved remarkable progress in geometric control, current systems remain constrained by a command-following paradigm that relies on simple navigational instructions. Transitioning to genuinely intelligent agents requires the capability to interpret and fulfill high-level, abstract human intentions. However, this advancement is hindered by the lack of dedicated benchmarks and semantic-aware evaluation metrics. In this paper, we formally define the task of Intention-Driven End-to-End Autonomous Driving and present Intention-Drive, a comprehensive benchmark designed to bridge this gap. We construct a large-scale dataset featuring complex natural language intentions paired with high-fidelity sensor data. To overcome the limitations of conventional trajectory-based metrics, we introduce the Imagined Future Alignment (IFA), a novel evaluation protocol leveraging generative world models to assess the semantic fulfillment of human goals beyond mere geometric accuracy. Furthermore, we explore the solution space by proposing two distinct paradigms: an end-to-end vision-language planner and a hierarchical agent-based framework. The experiments reveal a critical dichotomy where existing models exhibit satisfactory driving stability but struggle significantly with intention fulfillment. Notably, the proposed frameworks demonstrate superior alignment with human intentions.
♻ ☆ Physics-Driven Data Generation for Contact-Rich Manipulation via Trajectory Optimization
We present a low-cost data generation pipeline that integrates physics-based simulation, human demonstrations, and model-based planning to efficiently generate large-scale, high-quality datasets for contact-rich robotic manipulation tasks. Starting with a small number of embodiment-flexible human demonstrations collected in a virtual reality simulation environment, the pipeline refines these demonstrations using optimization-based kinematic retargeting and trajectory optimization to adapt them across various robot embodiments and physical parameters. This process yields a diverse, physically consistent dataset that enables cross-embodiment data transfer, and offers the potential to reuse legacy datasets collected under different hardware configurations or physical parameters. We validate the pipeline's effectiveness by training diffusion policies from the generated datasets for challenging contact-rich manipulation tasks across multiple robot embodiments, including a floating Allegro hand and bimanual robot arms. The trained policies are deployed zero-shot on hardware for bimanual iiwa arms, achieving high success rates with minimal human input. Project website: https://lujieyang.github.io/physicsgen/.
♻ ☆ OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation
Embodied navigation presents a core challenge for intelligent robots, requiring the comprehension of visual environments, natural language instructions, and autonomous exploration. Existing models often fall short in offering a unified solution across diverse navigation paradigms, resulting in low success rates and limited generalization. We introduce OmniNav, a unified framework addressing instruct-goal, object-goal, point-goal navigation, and frontier-based exploration within a single architecture. Our approach features a lightweight, low-latency policy that accurately predicts continuous-space waypoints (coordinates and orientations). This policy surpasses action-chunk methods in precision and supports real-world deployment at control frequencies up to 5 Hz. Architecturally, OmniNav employs a fast-slow system design: a fast module generates waypoints using short-horizon visual context and subtasks, while a slow module performs deliberative planning with long-horizon observations and candidate frontiers to select subsequent subgoals and subtasks. This collaboration enhances path efficiency and maintains trajectory coherence, particularly in exploration and memory-intensive scenarios. Crucially, we identify that the primary bottleneck isn't merely navigation policy learning, but a robust understanding of general instructions and objects. To boost generalization, OmniNav integrates large-scale, general-purpose training datasets, including those for image captioning and visual recognition, into a joint multi-task regimen. This significantly improves success rates and robustness. Extensive experiments confirm OmniNav's state-of-the-art performance across various navigation benchmarks, with real-world deployment further validating its efficacy. OmniNav provides practical insights for embodied navigation, charting a scalable path towards versatile, highly generalizable robotic intelligence.
♻ ☆ Real-time Sampling-based Model Predictive Control based on Reverse Kullback-Leibler Divergence and Its Adaptive Acceleration
Sampling-based model predictive control (MPC) has the potential for use in a wide variety of robotic systems. However, its unstable updates and poor convergence render it unsuitable for real-time control of robotic systems. This study addresses this challenge with a novel approach from reverse Kullback-Leibler divergence, which has a mode-seeking property and is likely to find one of the locally optimal solutions early. Using this approach, a weighted maximum likelihood estimation with positive and negative weights is obtained and solved using the mirror descent (MD) algorithm. Negative weights eliminate unnecessary actions, but a practical implementation needs to be designed to avoid interference with positive and negative updates based on rejection sampling. In addition, Nesterov's acceleration method for the proposed MD is modified to improve heuristic step size adaptive to the noise estimated in update amounts. Real-time simulations show that the proposed method can solve a wider variety of tasks statistically than the conventional method. In addition, higher degrees-of-freedom tasks can be solved by the improved acceleration even with a CPU only. The real-world applicability of the proposed method is also demonstrated by optimizing the operability in a variable impedance control of a force-driven mobile robot. https://youtu.be/D8bFMzct1XM
comment: 18 pages, 16 figures
♻ ☆ Adaptive Anomaly Recovery for Telemanipulation: A Diffusion Model Approach to Vision-Based Tracking
Dexterous telemanipulation critically relies on the continuous and stable tracking of the human operator's commands to ensure robust operation. Vison-based tracking methods are widely used but have low stability due to anomalies such as occlusions, inadequate lighting, and loss of sight. Traditional filtering, regression, and interpolation methods are commonly used to compensate for explicit information such as angles and positions. These approaches are restricted to low-dimensional data and often result in information loss compared to the original high-dimensional image and video data. Recent advances in diffusion-based approaches, which can operate on high-dimensional data, have achieved remarkable success in video reconstruction and generation. However, these methods have not been fully explored in continuous control tasks in robotics. This work introduces the Diffusion-Enhanced Telemanipulation (DET) framework, which incorporates the Frame-Difference Detection (FDD) technique to identify and segment anomalies in video streams. These anomalous clips are replaced after reconstruction using diffusion models, ensuring robust telemanipulation performance under challenging visual conditions. We validated this approach in various anomaly scenarios and compared it with the baseline methods. Experiments show that DET achieves an average RMSE reduction of 17.2% compared to the cubic spline and 51.1% compared to FFT-based interpolation for different occlusion durations.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Nonholonomic Robot Parking by Feedback -- Part I: Modular Strict CLF Designs
It has been known in the robotics literature since about 1995 that, in polar coordinates, the nonholonomic unicycle is asymptotically stabilizable by smooth feedback, even globally. We introduce a modular design framework that selects the forward velocity to decouple the radial coordinate, allowing the steering subsystem to be stabilized independently. Within this structure, we develop families of feedback laws using passivity, backstepping, and integrator forwarding. Each law is accompanied by a strict control Lyapunov function, including barrier variants that enforce angular constraints. These strict CLFs provide constructive class KL convergence estimates and enable eigenvalue assignment at the target equilibrium. The framework generalizes and extends prior modular and nonmodular approaches, while preparing the ground for inverse optimal and adaptive redesigns in the sequel paper.
comment: arXiv admin note: text overlap with arXiv:2509.25575
♻ ☆ Deep Reinforcement Learning for Bipedal Locomotion: A Brief Survey
Bipedal robots are gaining global recognition due to their potential applications and advancements in artificial intelligence, particularly through Deep Reinforcement Learning (DRL). While DRL has significantly advanced bipedal locomotion, the development of a unified framework capable of handling a wide range of tasks remains an ongoing challenge. This survey systematically categorises, compares, and analyses existing DRL frameworks for bipedal locomotion, organising them into end-to-end and hierarchical control schemes. End-to-end frameworks are evaluated based on their learning approaches, while hierarchical frameworks are examined in terms of layered structures that integrate learning-based or traditional model-based methods. We provide a detailed evaluation of the composition, strengths, limitations, and capabilities of each framework. Additionally, this survey identifies key research gaps and proposes future directions aimed at creating a more integrated and efficient framework for bipedal locomotion, with wide-ranging applications in real-world environments.
comment: Published in Artificial Intelligence Review
♻ ☆ Enhancing Sampling-based Planning with a Library of Paths
Path planning for 3D solid objects is a challenging problem, requiring a search in a six-dimensional configuration space, which is, nevertheless, essential in many robotic applications such as bin-picking and assembly. The commonly used sampling-based planners, such as Rapidly-exploring Random Trees, struggle with narrow passages where the sampling probability is low, increasing the time needed to find a solution. In scenarios like robotic bin-picking, various objects must be transported through the same environment. However, traditional planners start from scratch each time, losing valuable information gained during the planning process. We address this by using a library of past solutions, allowing the reuse of previous experiences even when planning for a new, previously unseen object. Paths for a set of objects are stored, and when planning for a new object, we find the most similar one in the library and use its paths as approximate solutions, adjusting for possible mutual transformations. The configuration space is then sampled along the approximate paths. Our method is tested in various narrow passage scenarios and compared with state-of-the-art methods from the OMPL library. Results show significant speed improvements (up to 85% decrease in the required time) of our method, often finding a solution in cases where the other planners fail. Our implementation of the proposed method is released as an open-source package.
♻ ☆ Generative Models From and For Sampling-Based MPC: A Bootstrapped Approach For Adaptive Contact-Rich Manipulation
We present a generative predictive control (GPC) framework that amortizes sampling-based Model Predictive Control (SPC) by bootstrapping it with conditional flow-matching models trained on SPC control sequences collected in simulation. Unlike prior work relying on iterative refinement or gradient-based solvers, we show that meaningful proposal distributions can be learned directly from noisy SPC data, enabling more efficient and informed sampling during online planning. We further demonstrate, for the first time, the application of this approach to real-world contact-rich loco-manipulation with a quadruped robot. Extensive experiments in simulation and on hardware show that our method improves sample efficiency, reduces planning horizon requirements, and generalizes robustly across task variations.
comment: 9 pages, 4 figures
Robotics 50
☆ FIRE-VLM: A Vision-Language-Driven Reinforcement Learning Framework for UAV Wildfire Tracking in a Physics-Grounded Fire Digital Twin
Wildfire monitoring demands autonomous systems capable of reasoning under extreme visual degradation, rapidly evolving physical dynamics, and scarce real-world training data. Existing UAV navigation approaches rely on simplified simulators and supervised perception pipelines, and lack embodied agents interacting with physically realistic fire environments. We introduce FIRE-VLM, the first end-to-end vision-language model (VLM) guided reinforcement learning (RL) framework trained entirely within a high-fidelity, physics-grounded wildfire digital twin. Built from USGS Digital Elevation Model (DEM) terrain, LANDFIRE fuel inventories, and semi-physical fire-spread solvers, this twin captures terrain-induced runs, wind-driven acceleration, smoke plume occlusion, and dynamic fuel consumption. Within this environment, a PPO agent with dual-view UAV sensing is guided by a CLIP-style VLM. Wildfire-specific semantic alignment scores, derived from a single prompt describing active fire and smoke plumes, are integrated as potential-based reward shaping signals. Our contributions are: (1) a GIS-to-simulation pipeline for constructing wildfire digital twins; (2) a VLM-guided RL agent for UAV firefront tracking; and (3) a wildfire-aware reward design that combines physical terms with VLM semantics. Across five digital-twin evaluation tasks, our VLM-guided policy reduces time-to-detection by up to 6 times, increases time-in-FOV, and is, to our knowledge, the first RL-based UAV wildfire monitoring system demonstrated in kilometer-scale, physics-grounded digital-twin fires.
☆ Cost-Effective Radar Sensors for Field-Based Water Level Monitoring with Sub-Centimeter Accuracy
Water level monitoring is critical for flood management, water resource allocation, and ecological assessment, yet traditional methods remain costly and limited in coverage. This work explores radar-based sensing as a low-cost alternative for water level estimation, leveraging its non-contact nature and robustness to environmental conditions. Commercial radar sensors are evaluated in real-world field tests, applying statistical filtering techniques to improve accuracy. Results show that a single radar sensor can achieve centimeter-scale precision with minimal calibration, making it a practical solution for autonomous water monitoring using drones and robotic platforms.
comment: 10 pages, 6 figures. Preliminary results presented as a poster at an academic conference
☆ Towards Zero-Knowledge Task Planning via a Language-based Approach
In this work, we introduce and formalize the Zero-Knowledge Task Planning (ZKTP) problem, i.e., formulating a sequence of actions to achieve some goal without task-specific knowledge. Additionally, we present a first investigation and approach for ZKTP that leverages a large language model (LLM) to decompose natural language instructions into subtasks and generate behavior trees (BTs) for execution. If errors arise during task execution, the approach also uses an LLM to adjust the BTs on-the-fly in a refinement loop. Experimental validation in the AI2-THOR simulator demonstrate our approach's effectiveness in improving overall task performance compared to alternative approaches that leverage task-specific knowledge. Our work demonstrates the potential of LLMs to effectively address several aspects of the ZKTP problem, providing a robust framework for automated behavior generation with no task-specific setup.
☆ Modeling and Control for UAV with Off-center Slung Load
Unmanned aerial vehicle (UAV) with slung load system is a classic air transportation system. In practical applications, the suspension point of the slung load does not always align with the center of mass (CoM) of the UAV due to mission requirements or mechanical interference. This offset creates coupling in the system's nonlinear dynamics which leads to a complicated motion control problem. In existing research, modeling of the system are performed about the UAV's CoM. In this work we use the point of suspension instead. Based on the new model, a cascade control strategy is developed. In the middle-loop controller, the acceleration of the suspension point is used to regulate the swing angle of the slung load without the need for considering the coupling between the slung load and the UAV. Using the off-center reference frame, an inner-loop controller is designed to track the UAV's attitude without the need of simplification on the coupling effects. We prove local exponential stability of the closed-loop using Lyapunov approach. Finally, simulations and experiments are conducted to validate the proposed control system.
☆ Lunar Rover Cargo Transport: Mission Concept and Field Test
In future operations on the lunar surface, automated vehicles will be required to transport cargo between known locations. Such vehicles must be able to navigate precisely in safe regions to avoid natural hazards, human-constructed infrastructure, and dangerous dark shadows. Rovers must be able to park their cargo autonomously within a small tolerance to achieve a successful pickup and delivery. In this field test, Lidar Teach and Repeat provides an ideal autonomy solution for transporting cargo in this way. A one-tonne path-to-flight rover was driven in a semi-autonomous remote-control mode to create a network of safe paths. Once the route was taught, the rover immediately repeated the entire network of paths autonomously while carrying cargo. The closed-loop performance is accurate enough to align the vehicle to the cargo and pick it up. This field report describes a two-week deployment at the Canadian Space Agency's Analogue Terrain, culminating in a simulated lunar operation to evaluate the system's capabilities. Successful cargo collection and delivery were demonstrated in harsh environmental conditions.
comment: 15 Pages, 13 Figures, to appear in IEEE Transactions on Field Robotics
☆ Revisiting Continuous-Time Trajectory Estimation via Gaussian Processes and the Magnus Expansion
Continuous-time state estimation has been shown to be an effective means of (i) handling asynchronous and high-rate measurements, (ii) introducing smoothness to the estimate, (iii) post hoc querying the estimate at times other than those of the measurements, and (iv) addressing certain observability issues related to scanning-while-moving sensors. A popular means of representing the trajectory in continuous time is via a Gaussian process (GP) prior, with the prior's mean and covariance functions generated by a linear time-varying (LTV) stochastic differential equation (SDE) driven by white noise. When the state comprises elements of Lie groups, previous works have resorted to a patchwork of local GPs each with a linear time-invariant SDE kernel, which while effective in practice, lacks theoretical elegance. Here we revisit the full LTV GP approach to continuous-time trajectory estimation, deriving a global GP prior on Lie groups via the Magnus expansion, which offers a more elegant and general solution. We provide a numerical comparison between the two approaches and discuss their relative merits.
comment: 21 pages, 12 figures
☆ Nonlinear Spectral Modeling and Control of Soft-Robotic Muscles from Data
Artificial muscles are essential for compliant musculoskeletal robotics but complicate control due to nonlinear multiphysics dynamics. Hydraulically amplified electrostatic (HASEL) actuators, a class of soft artificial muscles, offer high performance but exhibit memory effects and hysteresis. Here we present a data-driven reduction and control strategy grounded in spectral submanifold (SSM) theory. In the adiabatic regime, where inputs vary slowly relative to intrinsic transients, trajectories rapidly converge to a low-dimensional slow manifold. We learn an explicit input-to-output map on this manifold from forced-response trajectories alone, avoiding decay experiments that can trigger hysteresis. We deploy the SSM-based model for real-time control of an antagonistic HASEL-clutch joint. This approach yields a substantial reduction in tracking error compared to feedback-only and feedforward-only baselines under identical settings. This record-and-control workflow enables rapid characterization and high-performance control of soft muscles and muscle-driven joints without detailed physics-based modeling.
☆ A High-Fidelity Digital Twin for Robotic Manipulation Based on 3D Gaussian Splatting
Developing high-fidelity, interactive digital twins is crucial for enabling closed-loop motion planning and reliable real-world robot execution, which are essential to advancing sim-to-real transfer. However, existing approaches often suffer from slow reconstruction, limited visual fidelity, and difficulties in converting photorealistic models into planning-ready collision geometry. We present a practical framework that constructs high-quality digital twins within minutes from sparse RGB inputs. Our system employs 3D Gaussian Splatting (3DGS) for fast, photorealistic reconstruction as a unified scene representation. We enhance 3DGS with visibility-aware semantic fusion for accurate 3D labelling and introduce an efficient, filter-based geometry conversion method to produce collision-ready models seamlessly integrated with a Unity-ROS2-MoveIt physics engine. In experiments with a Franka Emika Panda robot performing pick-and-place tasks, we demonstrate that this enhanced geometric accuracy effectively supports robust manipulation in real-world trials. These results demonstrate that 3DGS-based digital twins, enriched with semantic and geometric consistency, offer a fast, reliable, and scalable path from perception to manipulation in unstructured environments.
comment: Under review of Journal of Robot Learning
☆ Limited Linguistic Diversity in Embodied AI Datasets
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.
☆ Dual-quaternion learning control for autonomous vehicle trajectory tracking with safety guarantees
We propose a learning-based trajectory tracking controller for autonomous robotic platforms whose motion can be described kinematically on $\mathrm{SE}(3)$. The controller is formulated in the dual quaternion framework and operates at the velocity level, assuming direct command of angular and linear velocities, as is standard in many aerial vehicles and omnidirectional mobile robots. Gaussian Process (GP) regression is integrated into a geometric feedback law to learn and compensate online for unknown, state-dependent disturbances and modeling imperfections affecting both attitude and position, while preserving the algebraic structure and coupling properties inherent to rigid-body motion. The proposed approach does not rely on explicit parametric models of the unknown effects, making it well-suited for robotic systems subject to sensor-induced disturbances, unmodeled actuation couplings, and environmental uncertainties. A Lyapunov-based analysis establishes probabilistic ultimate boundedness of the pose tracking error under bounded GP uncertainty, providing formal stability guarantees for the learning-based controller. Simulation results demonstrate accurate and smooth trajectory tracking in the presence of realistic, localized disturbances, including correlated rotational and translational effects arising from magnetometer perturbations. These results illustrate the potential of combining geometric modeling and probabilistic learning to achieve robust, data-efficient pose control for autonomous robotic systems.
☆ HEXAR: a Hierarchical Explainability Architecture for Robots
As robotic systems become increasingly complex, the need for explainable decision-making becomes critical. Existing explainability approaches in robotics typically either focus on individual modules, which can be difficult to query from the perspective of high-level behaviour, or employ monolithic approaches, which do not exploit the modularity of robotic architectures. We present HEXAR (Hierarchical EXplainability Architecture for Robots), a novel framework that provides a plug-in, hierarchical approach to generate explanations about robotic systems. HEXAR consists of specialised component explainers using diverse explanation techniques (e.g., LLM-based reasoning, causal models, feature importance, etc) tailored to specific robot modules, orchestrated by an explainer selector that chooses the most appropriate one for a given query. We implement and evaluate HEXAR on a TIAGo robot performing assistive tasks in a home environment, comparing it against end-to-end and aggregated baseline approaches across 180 scenario-query variations. We observe that HEXAR significantly outperforms baselines in root cause identification, incorrect information exclusion, and runtime, offering a promising direction for transparent autonomous systems.
comment: 8 pages, 6 figures
☆ A Fast Semidefinite Convex Relaxation for Optimal Control Problems With Spatio-Temporal Constraints
Solving optimal control problems (OCPs) of autonomous agents operating under spatial and temporal constraints fast and accurately is essential in applications ranging from eco-driving of autonomous vehicles to quadrotor navigation. However, the nonlinear programs approximating the OCPs are inherently nonconvex due to the coupling between the dynamics and the event timing, and therefore, they are challenging to solve. Most approaches address this challenge by predefining waypoint times or just using nonconvex trajectory optimization, which simplifies the problem but often yields suboptimal solutions. To significantly improve the numerical properties, we propose a formulation with a time-scaling direct multiple shooting scheme that partitions the prediction horizon into segments aligned with characteristic time constraints. Moreover, we develop a fast semidefinite-programming-based convex relaxation that exploits the sparsity pattern of the lifted formulation. Comprehensive simulation studies demonstrate the solution optimality and computational efficiency. Furthermore, real-world experiments on a quadrotor waypoint flight task with constrained open time windows validate the practical applicability of the approach in complex environments.
comment: 9 pages, 6 figures
☆ SOP: A Scalable Online Post-Training System for Vision-Language-Action Models
Vision-language-action (VLA) models achieve strong generalization through large-scale pre-training, but real-world deployment requires expert-level task proficiency in addition to broad generality. Existing post-training approaches for VLA models are typically offline, single-robot, or task-specific, limiting effective on-policy adaptation and scalable learning from real-world interaction. We introduce a Scalable Online Post-training (SOP) system that enables online, distributed, multi-task post-training of generalist VLA models directly in the physical world. SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. This design supports prompt on-policy correction, scales experience collection through parallel deployment, and preserves generality during adaptation. SOP is agnostic to the choice of post-training algorithm; we instantiate it with both interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP). Across a range of real-world manipulation tasks including cloth folding, box assembly, and grocery restocking, we show that SOP substantially improves the performance of large pretrained VLA models while maintaining a single shared policy across tasks. Effective post-training can be achieved within hours of real-world interaction, and performance scales near-linearly with the number of robots in the fleet. These results suggest that tightly coupling online learning with fleet-scale deployment is instrumental to enabling efficient, reliable, and scalable post-training of generalist robot policies in the physical world.
☆ PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms
A fundamental requirement for full autonomy is the ability to sustain accurate navigation in the absence of external data, such as GNSS signals or visual information. In these challenging environments, the platform must rely exclusively on inertial sensors, leading to pure inertial navigation. However, the inherent noise and other error terms of the inertial sensors in such real-world scenarios will cause the navigation solution to drift over time. Although conventional deep-learning models have emerged as a possible approach to inertial navigation, they are inherently black-box in nature. Furthermore, they struggle to learn effectively with limited supervised sensor data and often fail to preserve physical principles. To address these limitations, we propose PiDR, a physics-informed inertial dead-reckoning framework for autonomous platforms in situations of pure inertial navigation. PiDR offers transparency by explicitly integrating inertial navigation principles into the network training process through the physics-informed residual component. PiDR plays a crucial role in mitigating abrupt trajectory deviations even under limited or sparse supervision. We evaluated PiDR on real-world datasets collected by a mobile robot and an autonomous underwater vehicle. We obtained more than 29% positioning improvement in both datasets, demonstrating the ability of PiDR to generalize different platforms operating in various environments and dynamics. Thus, PiDR offers a robust, lightweight, yet effective architecture and can be deployed on resource-constrained platforms, enabling real-time pure inertial navigation in adverse scenarios.
comment: 11 pages and 7 figures
☆ Validating Generalist Robots with Situation Calculus and STL Falsification
Generalist robots are becoming a reality, capable of interpreting natural language instructions and executing diverse operations. However, their validation remains challenging because each task induces its own operational context and correctness specification, exceeding the assumptions of traditional validation methods. We propose a two-layer validation framework that combines abstract reasoning with concrete system falsification. At the abstract layer, situation calculus models the world and derives weakest preconditions, enabling constraint-aware combinatorial testing to systematically generate diverse, semantically valid world-task configurations with controllable coverage strength. At the concrete layer, these configurations are instantiated for simulation-based falsification with STL monitoring. Experiments on tabletop manipulation tasks show that our framework effectively uncovers failure cases in the NVIDIA GR00T controller, demonstrating its promise for validating general-purpose robot autonomy.
☆ A Bi-directional Adaptive Framework for Agile UAV Landing
Autonomous landing on mobile platforms is crucial for extending quadcopter operational flexibility, yet conventional methods are often too inefficient for highly dynamic scenarios. The core limitation lies in the prevalent ``track-then-descend'' paradigm, which treats the platform as a passive target and forces the quadcopter to perform complex, sequential maneuvers. This paper challenges that paradigm by introducing a bi-directional cooperative landing framework that redefines the roles of the vehicle and the platform. The essential innovation is transforming the problem from a single-agent tracking challenge into a coupled system optimization. Our key insight is that the mobile platform is not merely a target, but an active agent in the landing process. It proactively tilts its surface to create an optimal, stable terminal attitude for the approaching quadcopter. This active cooperation fundamentally breaks the sequential model by parallelizing the alignment and descent phases. Concurrently, the quadcopter's planning pipeline focuses on generating a time-optimal and dynamically feasible trajectory that minimizes energy consumption. This bi-directional coordination allows the system to execute the recovery in an agile manner, characterized by aggressive trajectory tracking and rapid state synchronization within transient windows. The framework's effectiveness, validated in dynamic scenarios, significantly improves the efficiency, precision, and robustness of autonomous quadrotor recovery in complex and time-constrained missions.
comment: This work has been submitted to the IEEE Robotics and Automation Letters (RA-L) for possible publication
☆ Learning to Act Robustly with View-Invariant Latent Actions
Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.
comment: Website: https://joon-stack.github.io/VILA/
☆ Parameter-Robust MPPI for Safe Online Learning of Unknown Parameters
Robots deployed in dynamic environments must remain safe even when key physical parameters are uncertain or change over time. We propose Parameter-Robust Model Predictive Path Integral (PRMPPI) control, a framework that integrates online parameter learning with probabilistic safety constraints. PRMPPI maintains a particle-based belief over parameters via Stein Variational Gradient Descent, evaluates safety constraints using Conformal Prediction, and optimizes both a nominal performance-driven and a safety-focused backup trajectory in parallel. This yields a controller that is cautious at first, improves performance as parameters are learned, and ensures safety throughout. Simulation and hardware experiments demonstrate higher success rates, lower tracking error, and more accurate parameter estimates than baselines.
☆ LOST-3DSG: Lightweight Open-Vocabulary 3D Scene Graphs with Semantic Tracking in Dynamic Environments
Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at https://lab-rococo-sapienza.github.io/lost-3dsg/.
☆ Warm-Starting Collision-Free Model Predictive Control With Object-Centric Diffusion
Acting in cluttered environments requires predicting and avoiding collisions while still achieving precise control. Conventional optimization-based controllers can enforce physical constraints, but they struggle to produce feasible solutions quickly when many obstacles are present. Diffusion models can generate diverse trajectories around obstacles, yet prior approaches lacked a general and efficient way to condition them on scene structure. In this paper, we show that combining diffusion-based warm-starting conditioned with a latent object-centric representation of the scene and with a collision-aware model predictive controller (MPC) yields reliable and efficient motion generation under strict time limits. Our approach conditions a diffusion transformer on the system state, task, and surroundings, using an object-centric slot attention mechanism to provide a compact obstacle representation suitable for control. The sampled trajectories are refined by an optimal control problem that enforces rigid-body dynamics and signed-distance collision constraints, producing feasible motions in real time. On benchmark tasks, this hybrid method achieved markedly higher success rates and lower latency than sampling-based planners or either component alone. Real-robot experiments with a torque-controlled Panda confirm reliable and safe execution with MPC.
comment: An open-source implementation is provided https://cozy-fairy-0e0139.netlify.app/
☆ Soft Responsive Materials Enhance Humanoid Safety
Humanoid robots are envisioned as general-purpose platforms in human-centered environments, yet their deployment is limited by vulnerability to falls and the risks posed by rigid metal-plastic structures to people and surroundings. We introduce a soft-rigid co-design framework that leverages non-Newtonian fluid-based soft responsive materials to enhance humanoid safety. The material remains compliant during normal interaction but rapidly stiffens under impact, absorbing and dissipating fall-induced forces. Physics-based simulations guide protector placement and thickness and enable learning of active fall policies. Applied to a 42 kg life-size humanoid, the protector markedly reduces peak impact and allows repeated falls without hardware damage, including drops from 3 m and tumbles down long staircases. Across diverse scenarios, the approach improves robot robustness and environmental safety. By uniting responsive materials, structural co-design, and learning-based control, this work advances interact-safe, industry-ready humanoid robots.
comment: 40 pages, 11 figures
☆ Reinforcement Learning for Follow-the-Leader Robotic Endoscopic Navigation via Synthetic Data
Autonomous navigation is crucial for both medical and industrial endoscopic robots, enabling safe and efficient exploration of narrow tubular environments without continuous human intervention, where avoiding contact with the inner walls has been a longstanding challenge for prior approaches. We present a follow-the-leader endoscopic robot based on a flexible continuum structure designed to minimize contact between the endoscope body and intestinal walls, thereby reducing patient discomfort. To achieve this objective, we propose a vision-based deep reinforcement learning framework guided by monocular depth estimation. A realistic intestinal simulation environment was constructed in \textit{NVIDIA Omniverse} to train and evaluate autonomous navigation strategies. Furthermore, thousands of synthetic intraluminal images were generated using NVIDIA Replicator to fine-tune the Depth Anything model, enabling dense three-dimensional perception of the intestinal environment with a single monocular camera. Subsequently, we introduce a geometry-aware reward and penalty mechanism to enable accurate lumen tracking. Compared with the original Depth Anything model, our method improves $δ_{1}$ depth accuracy by 39.2% and reduces the navigation J-index by 0.67 relative to the second-best method, demonstrating the robustness and effectiveness of the proposed approach.
☆ Closing the Reality Gap: Zero-Shot Sim-to-Real Deployment for Dexterous Force-Based Grasping and Manipulation
Human-like dexterous hands with multiple fingers offer human-level manipulation capabilities, but training control policies that can directly deploy on real hardware remains difficult due to contact-rich physics and imperfect actuation. We close this gap with a practical sim-to-real reinforcement learning (RL) framework that utilizes dense tactile feedback combined with joint torque sensing to explicitly regulate physical interactions. To enable effective sim-to-real transfer, we introduce (i) a computationally fast tactile simulation that computes distances between dense virtual tactile units and the object via parallel forward kinematics, providing high-rate, high-resolution touch signals needed by RL; (ii) a current-to-torque calibration that eliminates the need for torque sensors on dexterous hands by mapping motor current to joint torque; and (iii) actuator dynamics modeling to bridge the actuation gaps with randomization of non-ideal effects such as backlash, torque-speed saturation. Using an asymmetric actor-critic PPO pipeline trained entirely in simulation, our policies deploy directly to a five-finger hand. The resulting policies demonstrated two essential skills: (1) command-based, controllable grasp force tracking, and (2) reorientation of objects in the hand, both of which were robustly executed without fine-tuning on the robot. By combining tactile and torque in the observation space with effective sensing/actuation modeling, our system provides a practical solution to achieve reliable dexterous manipulation. To our knowledge, this is the first demonstration of controllable grasping on a multi-finger dexterous hand trained entirely in simulation and transferred zero-shot on real hardware.
☆ M-SEVIQ: A Multi-band Stereo Event Visual-Inertial Quadruped-based Dataset for Perception under Rapid Motion and Challenging Illumination
Agile locomotion in legged robots poses significant challenges for visual perception. Traditional frame-based cameras often fail in these scenarios for producing blurred images, particularly under low-light conditions. In contrast, event cameras capture changes in brightness asynchronously, offering low latency, high temporal resolution, and high dynamic range. These advantages make them suitable for robust perception during rapid motion and under challenging illumination. However, existing event camera datasets exhibit limitations in stereo configurations and multi-band sensing domains under various illumination conditions. To address this gap, we present M-SEVIQ, a multi-band stereo event visual and inertial quadruped dataset collected using a Unitree Go2 equipped with stereo event cameras, a frame-based camera, an inertial measurement unit (IMU), and joint encoders. This dataset contains more than 30 real-world sequences captured across different velocity levels, illumination wavelengths, and lighting conditions. In addition, comprehensive calibration data, including intrinsic, extrinsic, and temporal alignments, are provided to facilitate accurate sensor fusion and benchmarking. Our M-SEVIQ can be used to support research in agile robot perception, sensor fusion, semantic segmentation and multi-modal vision in challenging environments.
comment: 6 pages, 7 figures
☆ Advancing Assistive Robotics: Multi-Modal Navigation and Biophysical Monitoring for Next-Generation Wheelchairs
Assistive electric-powered wheelchairs (EPWs) have become essential mobility aids for people with disabilities such as amyotrophic lateral sclerosis (ALS), post-stroke hemiplegia, and dementia-related mobility impairment. This work presents a novel multi-modal EPW control system designed to prioritize patient needs while allowing seamless switching between control modes. Four complementary interfaces, namely joystick, speech, hand gesture, and electrooculography (EOG), are integrated with a continuous vital sign monitoring framework measuring heart rate variability, oxygen saturation (SpO2), and skin temperature. This combination enables greater patient independence while allowing caregivers to maintain real-time supervision and early intervention capability. Two-point calibration of the biophysical sensors against clinical reference devices resulted in root mean square errors of at most 2 bpm for heart rate, 0.5 degree Celsius for skin temperature, and 1 percent for SpO2. Experimental evaluation involved twenty participants with mobility impairments executing a total of 500 indoor navigation commands. The achieved command recognition accuracies were 99 percent for joystick control, 97 percent plus or minus 2 percent for speech, and 95 percent plus or minus 3 percent for hand gesture, with an average closed-loop latency of 20 plus or minus 0.5 milliseconds. Caregivers receive real-time alerts through an Android application following encrypted cloud transmission of physiological data. By integrating multi-modal mobility control with cloud-enabled health monitoring and reporting latency and energy budgets, the proposed prototype addresses key challenges in assistive robotics, contributes toward compliance with ISO 7176-31 and IEC 80601-2-78 safety standards, and establishes a foundation for future adaptive machine learning enhancements.
☆ Unified Meta-Representation and Feedback Calibration for General Disturbance Estimation
Precise control in modern robotic applications is always an open issue due to unknown time-varying disturbances. Existing meta-learning-based approaches require a shared representation of environmental structures, which lack flexibility for realistic non-structural disturbances. Besides, representation error and the distribution shifts can lead to heavy degradation in prediction accuracy. This work presents a generalizable disturbance estimation framework that builds on meta-learning and feedback-calibrated online adaptation. By extracting features from a finite time window of past observations, a unified representation that effectively captures general non-structural disturbances can be learned without predefined structural assumptions. The online adaptation process is subsequently calibrated by a state-feedback mechanism to attenuate the learning residual originating from the representation and generalizability limitations. Theoretical analysis shows that simultaneous convergence of both the online learning error and the disturbance estimation error can be achieved. Through the unified meta-representation, our framework effectively estimates multiple rapidly changing disturbances, as demonstrated by quadrotor flight experiments. See the project page for video, supplementary material and code: https://nonstructural-metalearn.github.io.
comment: 8 pages, 10 figures
☆ Towards Zero-Shot Point Cloud Registration Across Diverse Scales, Scenes, and Sensor Setups ICCV 2025
Some deep learning-based point cloud registration methods struggle with zero-shot generalization, often requiring dataset-specific hyperparameter tuning or retraining for new environments. We identify three critical limitations: (a) fixed user-defined parameters (e.g., voxel size, search radius) that fail to generalize across varying scales, (b) learned keypoint detectors exhibit poor cross-domain transferability, and (c) absolute coordinates amplify scale mismatches between datasets. To address these three issues, we present BUFFER-X, a training-free registration framework that achieves zero-shot generalization through: (a) geometric bootstrapping for automatic hyperparameter estimation, (b) distribution-aware farthest point sampling to replace learned detectors, and (c) patch-level coordinate normalization to ensure scale consistency. Our approach employs hierarchical multi-scale matching to extract correspondences across local, middle, and global receptive fields, enabling robust registration in diverse environments. For efficiency-critical applications, we introduce BUFFER-X-Lite, which reduces total computation time by 43% (relative to BUFFER-X) through early exit strategies and fast pose solvers while preserving accuracy. We evaluate on a comprehensive benchmark comprising 12 datasets spanning object-scale, indoor, and outdoor scenes, including cross-sensor registration between heterogeneous LiDAR configurations. Results demonstrate that our approach generalizes effectively without manual tuning or prior knowledge of test domains. Code: https://github.com/MIT-SPARK/BUFFER-X.
comment: 18 pages, 15 figures. Extended version of our ICCV 2025 highlight paper [arXiv:2503.07940]. arXiv admin note: substantial text overlap with arXiv:2503.07940
☆ Optimizing Control-Friendly Trajectories with Self-Supervised Residual Learning
Real-world physics can only be analytically modeled with a certain level of precision for modern intricate robotic systems. As a result, tracking aggressive trajectories accurately could be challenging due to the existence of residual physics during controller synthesis. This paper presents a self-supervised residual learning and trajectory optimization framework to address the aforementioned challenges. At first, unknown dynamic effects on the closed-loop model are learned and treated as residuals of the nominal dynamics, jointly forming a hybrid model. We show that learning with analytic gradients can be achieved using only trajectory-level data while enjoying accurate long-horizon prediction with an arbitrary integration step size. Subsequently, a trajectory optimizer is developed to compute the optimal reference trajectory with the residual physics along it minimized. It ends up with trajectories that are friendly to the following control level. The agile flight of quadrotors illustrates that by utilizing the hybrid dynamics, the proposed optimizer outputs aggressive motions that can be precisely tracked.
comment: 10 pages, 9 figures
☆ Loop Closure using AnyLoc Visual Place Recognition in DPV-SLAM
Loop closure is crucial for maintaining the accuracy and consistency of visual SLAM. We propose a method to improve loop closure performance in DPV-SLAM. Our approach integrates AnyLoc, a learning-based visual place recognition technique, as a replacement for the classical Bag of Visual Words (BoVW) loop detection method. In contrast to BoVW, which relies on handcrafted features, AnyLoc utilizes deep feature representations, enabling more robust image retrieval across diverse viewpoints and lighting conditions. Furthermore, we propose an adaptive mechanism that dynamically adjusts similarity threshold based on environmental conditions, removing the need for manual tuning. Experiments on both indoor and outdoor datasets demonstrate that our method significantly outperforms the original DPV-SLAM in terms of loop closure accuracy and robustness. The proposed method offers a practical and scalable solution for enhancing loop closure performance in modern SLAM systems.
comment: Accepted at IEEE/SICE International Symposium on System Integration(SII) 2026. 6 pages, 14 figures
☆ Analysis of Various Manipulator Configurations Based on Multi-Objective Black-Box Optimization
Various 6-degree-of-freedom (DOF) and 7-DOF manipulators have been developed to date. Over a long history, their joint configurations and link length ratios have been determined empirically. In recent years, the development of robotic foundation models has become increasingly active, leading to the continuous proposal of various manipulators to support these models. However, none of these manipulators share exactly the same structure, as the order of joints and the ratio of link lengths differ among robots. Therefore, in order to discuss the optimal structure of a manipulator, we performed multi-objective optimization from the perspectives of end-effector reachability and joint torque. We analyze where existing manipulator structures stand within the sampling results of the optimization and provide insights for future manipulator design.
comment: Accepted to Advanced Robotics, website: https://haraduka.github.io/bbo-manip-design
☆ Learning to Nudge: A Scalable Barrier Function Framework for Safe Robot Interaction in Dense Clutter
Robots operating in everyday environments must navigate and manipulate within densely cluttered spaces, where physical contact with surrounding objects is unavoidable. Traditional safety frameworks treat contact as unsafe, restricting robots to collision avoidance and limiting their ability to function in dense, everyday settings. As the number of objects grows, model-based approaches for safe manipulation become computationally intractable; meanwhile, learned methods typically tie safety to the task at hand, making them hard to transfer to new tasks without retraining. In this work we introduce Dense Contact Barrier Functions(DCBF). Our approach bypasses the computational complexity of explicitly modeling multi-object dynamics by instead learning a composable, object-centric function that implicitly captures the safety constraints arising from physical interactions. Trained offline on interactions with a few objects, the learned DCBFcomposes across arbitrary object sets at runtime, producing a single global safety filter that scales linearly and transfers across tasks without retraining. We validate our approach through simulated experiments in dense clutter, demonstrating its ability to enable collision-free navigation and safe, contact-rich interaction in suitable settings.
☆ CageDroneRF: A Large-Scale RF Benchmark and Toolkit for Drone Perception
We present CageDroneRF (CDRF), a large-scale benchmark for Radio-Frequency (RF) drone detection and identification built from real-world captures and systematically generated synthetic variants. CDRF addresses the scarcity and limited diversity of existing RF datasets by coupling extensive raw recordings with a principled augmentation pipeline that (i) precisely controls Signal-to-Noise Ratio (SNR), (ii) injects interfering emitters, and (iii) applies frequency shifts with label-consistent bounding-box transformations for detection. This dataset spans a wide range of contemporary drone models, many unavailable in current public datasets, and acquisition conditions, derived from data collected at the Rowan University campus and within a controlled RF-cage facility. CDRF is released with interoperable open-source tools for data generation, preprocessing, augmentation, and evaluation that also operate on existing public benchmarks. CDRF enables standardized benchmarking for classification, open-set recognition, and object detection, supporting rigorous comparisons and reproducible pipelines. By releasing this comprehensive benchmark and tooling, CDRF aims to accelerate progress toward robust, generalizable RF perception models.
☆ Effective Online 3D Bin Packing with Lookahead Parcels Using Monte Carlo Tree Search
Online 3D Bin Packing (3D-BP) with robotic arms is crucial for reducing transportation and labor costs in modern logistics. While Deep Reinforcement Learning (DRL) has shown strong performance, it often fails to adapt to real-world short-term distribution shifts, which arise as different batches of goods arrive sequentially, causing performance drops. We argue that the short-term lookahead information available in modern logistics systems is key to mitigating this issue, especially during distribution shifts. We formulate online 3D-BP with lookahead parcels as a Model Predictive Control (MPC) problem and adapt the Monte Carlo Tree Search (MCTS) framework to solve it. Our framework employs a dynamic exploration prior that automatically balances a learned RL policy and a robust random policy based on the lookahead characteristics. Additionally, we design an auxiliary reward to penalize long-term spatial waste from individual placements. Extensive experiments on real-world datasets show that our method consistently outperforms state-of-the-art baselines, achieving over 10\% gains under distributional shifts, 4\% average improvement in online deployment, and up to more than 8\% in the best case--demonstrating the effectiveness of our framework.
☆ Making Infeasible Tasks Feasible: Planning to Reconfigure Disconnected 3D Environments with Movable Objects
Several planners have been developed to compute dynamically feasible, collision-free robot paths from an initial to a goal configuration. A key assumption in these works is that the goal region is reachable; an assumption that often fails in practice when environments are disconnected. Motivated by this limitation, we consider known 3D environments comprising objects, also called blocks, that form distinct navigable support surfaces (planes), and that are either non-movable (e.g., tables) or movable (e.g., boxes). These surfaces may be mutually disconnected due to height differences, holes, or lateral separations. Our focus is on tasks where the robot must reach a goal region residing on an elevated plane that is unreachable. Rather than declaring such tasks infeasible, an effective strategy is to enable the robot to interact with the environment, rearranging movable objects to create new traversable connections; a problem known as Navigation Among Movable Objects (NAMO). Existing NAMO planners typically address 2D environments, where obstacles are pushed aside to clear a path. These methods cannot directly handle the considered 3D setting; in such cases, obstacles must be placed strategically to bridge these physical disconnections. We address this challenge by developing BRiDGE (Block-based Reconfiguration in Disconnected 3D Geometric Environments), a sampling-based planner that incrementally builds trees over robot and object configurations to compute feasible plans specifying which objects to move, where to place them, and in what order, while accounting for a limited number of movable objects. To accelerate planning, we introduce non-uniform sampling strategies. We show that our method is probabilistically complete and we provide extensive numerical and hardware experiments validating its effectiveness.
♻ ☆ Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing
Large language models (LLMs) have recently demonstrated success in decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in environments where sensors are noisy or unreliable. Characterizing the behavior of LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically investigate the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, the second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. However, manually covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime.
comment: 30 pages, 24 figures, 6 tables
♻ ☆ Indicating Robot Vision Capabilities with Augmented Reality
Research indicates that humans can mistakenly assume that robots and humans have the same field of view, possessing an inaccurate mental model of robots. This misperception may lead to failures during human-robot collaboration tasks where robots might be asked to complete impossible tasks about out-of-view objects. The issue is more severe when robots do not have a chance to scan the scene to update their world model while focusing on assigned tasks. To help align humans' mental models of robots' vision capabilities, we propose four field-of-view indicators in augmented reality and conducted a human-subjects experiment (N=41) to evaluate them in a collaborative assembly task regarding accuracy, confidence, task efficiency, and workload. These indicators span a spectrum of positions: two at robot's eye and head space -- deepening eye socket and adding blocks to two sides of the eyes (i.e., egocentric), and two anchoring in the robot's task space -- adding extended blocks from the sides of eyes to the table and placing blocks directly on the tables (i.e., allocentric). Results showed that, when placed directly in the task space, the allocentric indicator yields the highest accuracy, although with a delay in interpreting the robot's field of view. When placed at the robot's eyes, the egocentric indicator of deeper eye sockets, possible for physical alteration, also increased accuracy. In all indicators, participants' confidence was high while cognitive load remained low. Finally, we contribute six guidelines for practitioners to apply our augmented reality indicators or physical alterations to align humans' mental models with robots' vision capabilities.
♻ ☆ FICO: Finite-Horizon Closed-Loop Factorization for Unified Multi-Agent Path Finding
Multi-Agent Path Finding is a fundamental problem in robotics and AI, yet most existing formulations treat planning and execution separately and address variants of the problem in an ad hoc manner. This paper presents a system-level framework for MAPF that integrates planning and execution, generalizes across variants, and explicitly models uncertainties. At its core is the MAPF system, a formal model that casts MAPF as a control design problem encompassing classical and uncertainty-aware formulations. To solve it, we introduce Finite-Horizon Closed-Loop Factorization (FICO), a factorization-based algorithm inspired by receding-horizon control that exploits compositional structure for efficient closed-loop operation. FICO enables real-time responses -- commencing execution within milliseconds -- while scaling to thousands of agents and adapting seamlessly to execution-time uncertainties. Extensive case studies demonstrate that it reduces computation time by up to two orders of magnitude compared with open-loop baselines, while delivering significantly higher throughput under stochastic delays and agent arrivals. These results establish a principled foundation for analyzing and advancing MAPF through system-level modeling, factorization, and closed-loop design.
♻ ☆ Evaluating Gemini Robotics Policies in a Veo World Simulator
Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.
♻ ☆ DDBot: Differentiable Physics-based Digging Robot for Unknown Granular Materials
Automating the manipulation of granular materials poses significant challenges due to complex contact dynamics, unpredictable material properties, and intricate system states. Existing approaches often fail to achieve efficiency and accuracy in such tasks. To fill the research gap, this article studies the small-scale and high-precision granular material digging task with unknown physical properties. A key scientific problem addressed is the feasibility of applying first-order gradient-based optimization to complex differentiable granular material simulation and overcoming associated numerical instability. A new framework, named differentiable digging robot (DDBot), is proposed to manipulate granular materials, including sand and soil. Specifically, we equip DDBot with a differentiable physics-based simulator, tailored for granular material manipulation, powered by GPU-accelerated parallel computing and automatic differentiation. DDBot can perform efficient differentiable system identification and high-precision digging skill optimization for unknown granular materials, which is enabled by a differentiable skill-to-action mapping, a task-oriented demonstration method, gradient clipping and line search-based gradient descent. Experimental results show that DDBot can efficiently (converge within 5 to 20 minutes) identify unknown granular material dynamics and optimize digging skills, with high-precision results in zero-shot real-world deployments, highlighting its practicality. Benchmark results against state-of-the-art baselines also confirm the robustness and efficiency of DDBot in such digging tasks.
comment: Published as a regular paper by the IEEE Transactions on Robotics
♻ ☆ An Informative Planning Framework for Target Tracking and Active Mapping in Dynamic Environments with ASVs
Mobile robot platforms are increasingly being used to automate information gathering tasks such as environmental monitoring. Efficient target tracking in dynamic environments is critical for applications such as search and rescue and pollutant cleanups. In this letter, we study active mapping of floating targets that drift due to environmental disturbances such as wind and currents. This is a challenging problem as it involves predicting both spatial and temporal variations in the map due to changing conditions. We introduce an integrated framework combining dynamic occupancy grid mapping and an informative planning approach to actively map and track freely drifting targets with an autonomous surface vehicle. A key component of our adaptive planning approach is a spatiotemporal prediction network that predicts target position distributions over time. We further propose a planning objective for target tracking that leverages these predictions. Simulation experiments show that this planning objective improves target tracking performance compared to existing methods that consider only entropy reduction as the planning objective. Finally, we validate our approach in field tests, showcasing its ability to track targets in real-world monitoring scenarios.
comment: Accepted for publication in Robotics and Automation Letters (RA-L)
♻ ☆ Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies Towards Visual Robustness
Visuomotor policies trained on human expert demonstrations have recently shown strong performance across a wide range of robotic manipulation tasks. However, these policies remain highly sensitive to domain shifts stemming from background or robot embodiment changes, which limits their generalization capabilities. In this paper, we present ARRO, a novel visual representation that leverages zero-shot open-vocabulary segmentation and object detection models to efficiently mask out task-irrelevant regions of the scene in real time without requiring additional training, modeling of the setup, or camera calibration. By filtering visual distractors and overlaying virtual guides during both training and inference, ARRO improves robustness to scene variations and reduces the need for additional data collection. We extensively evaluate ARRO with Diffusion Policy on a range of tabletop manipulation tasks in both simulation and real-world environments, and further demonstrate its compatibility and effectiveness with generalist robot policies, such as Octo, OpenVLA and Pi Zero. Across all settings in our evaluation, ARRO yields consistent performance gains, allows for selective masking to choose between different objects, and shows robustness even to challenging segmentation conditions. Videos showcasing our results are available at: https://augmented-reality-for-robots.github.io/
♻ ☆ RobotDiffuse: Diffusion-Based Motion Planning for Redundant Manipulators with the ROP Obstacle Avoidance Dataset
Redundant manipulators, with their higher Degrees of Freedom (DoFs), offer enhanced kinematic performance and versatility, making them suitable for applications like manufacturing, surgical robotics, and human-robot collaboration. However, motion planning for these manipulators is challenging due to increased DoFs and complex, dynamic environments. While traditional motion planning algorithms struggle with high-dimensional spaces, deep learning-based methods often face instability and inefficiency in complex tasks. This paper introduces RobotDiffuse, a diffusion model-based approach for motion planning in redundant manipulators. By integrating physical constraints with a point cloud encoder and replacing the U-Net structure with an encoder-only transformer, RobotDiffuse improves the model's ability to capture temporal dependencies and generate smoother, more coherent motion plans. We validate the approach using a complex simulator and release a new dataset, Robot-obtalcles-panda (ROP), with 35M robot poses and 0.14M obstacle avoidance scenarios. The highest overall score obtained in the experiment demonstrates the effectiveness of RobotDiffuse and the promise of diffusion models for motion planning tasks. The dataset can be accessed at https://github.com/ACRoboT-buaa/RobotDiffuse.
♻ ☆ VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.
♻ ☆ Steering Flexible Linear Objects in Planar Environments by Two Robot Hands Using Euler's Elastica Solutions
The manipulation of flexible objects such as cables, wires and fresh food items by robot hands forms a special challenge in robot grasp mechanics. This paper considers the steering of flexible linear objects in planar environments by two robot hands. The flexible linear object, modeled as an elastic non-stretchable rod, is manipulated by varying the gripping endpoint positions while keeping equal endpoint tangents. The flexible linear object shape has a closed form solution in terms of the grasp endpoint positions and tangents, called Euler's elastica. This paper obtains the elastica solutions under the optimal control framework, then uses the elastica solutions to obtain closed-form criteria for non self-intersection, stability and obstacle avoidance of the flexible linear object. The new tools are incorporated into a planning scheme for steering flexible linear objects in planar environments populated by sparsely spaced obstacles. The scheme is fully implemented and demonstrated with detailed examples.
♻ ☆ Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
♻ ☆ Efficient Swept Volume-Based Trajectory Generation for Arbitrary-Shaped Ground Robot Navigation
Navigating an arbitrary-shaped ground robot safely in cluttered environments remains a challenging problem. The existing trajectory planners that account for the robot's physical geometry severely suffer from the intractable runtime. To achieve both computational efficiency and Continuous Collision Avoidance (CCA) of arbitrary-shaped ground robot planning, we proposed a novel coarse-to-fine navigation framework that significantly accelerates planning. In the first stage, a sampling-based method selectively generates distinct topological paths that guarantee a minimum inflated margin. In the second stage, a geometry-aware front-end strategy is designed to discretize these topologies into full-state robot motion sequences while concurrently partitioning the paths into SE(2) sub-problems and simpler R2 sub-problems for back-end optimization. In the final stage, an SVSDF-based optimizer generates trajectories tailored to these sub-problems and seamlessly splices them into a continuous final motion plan. Extensive benchmark comparisons show that the proposed method is one to several orders of magnitude faster than the cutting-edge methods in runtime while maintaining a high planning success rate and ensuring CCA.
♻ ☆ RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence
While data-driven imitation learning has revolutionized robotic manipulation, current approaches remain constrained by the scarcity of large-scale, diverse real-world demonstrations. Consequently, the ability of existing models to generalize across long-horizon bimanual tasks and mobile manipulation in unstructured environments remains limited. To bridge this gap, we present RoboMIND 2.0, a comprehensive real-world dataset comprising over 310K dual-arm manipulation trajectories collected across six distinct robot embodiments and 739 complex tasks. Crucially, to support research in contact-rich and spatially extended tasks, the dataset incorporates 12K tactile-enhanced episodes and 20K mobile manipulation trajectories. Complementing this physical data, we construct high-fidelity digital twins of our real-world environments, releasing an additional 20K-trajectory simulated dataset to facilitate robust sim-to-real transfer. To fully exploit the potential of RoboMIND 2.0, we propose MIND-2 system, a hierarchical dual-system frame-work optimized via offline reinforcement learning. MIND-2 integrates a high-level semantic planner (MIND-2-VLM) to decompose abstract natural language instructions into grounded subgoals, coupled with a low-level Vision-Language-Action executor (MIND-2-VLA), which generates precise, proprioception-aware motor actions.
♻ ☆ RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes. See the project page at https://zhoues.github.io/RoboTracer.
comment: Project page: https://zhoues.github.io/RoboTracer
♻ ☆ AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans
Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
♻ ☆ DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Project website: https://darkeqa-benchmark.github.io/
comment: Submitted to IEEE Robotics and Automation Letters (RA-L)
Robotics 33
☆ CycleVLA: Proactive Self-Correcting Vision-Language-Action Models via Subtask Backtracking and Minimum Bayes Risk Decoding
Current work on robot failure detection and correction typically operate in a post hoc manner, analyzing errors and applying corrections only after failures occur. This work introduces CycleVLA, a system that equips Vision-Language-Action models (VLAs) with proactive self-correction, the capability to anticipate incipient failures and recover before they fully manifest during execution. CycleVLA achieves this by integrating a progress-aware VLA that flags critical subtask transition points where failures most frequently occur, a VLM-based failure predictor and planner that triggers subtask backtracking upon predicted failure, and a test-time scaling strategy based on Minimum Bayes Risk (MBR) decoding to improve retry success after backtracking. Extensive experiments show that CycleVLA improves performance for both well-trained and under-trained VLAs, and that MBR serves as an effective zero-shot test-time scaling strategy for VLAs. Project Page: https://dannymcy.github.io/cyclevla/
comment: Project Page: https://dannymcy.github.io/cyclevla/
☆ Differential Barometric Altimetry for Submeter Vertical Localization and Floor Recognition Indoors
Accurate altitude estimation and reliable floor recognition are critical for mobile robot localization and navigation within complex multi-storey environments. In this paper, we present a robust, low-cost vertical estimation framework leveraging differential barometric sensing integrated within a fully ROS-compliant software package. Our system simultaneously publishes real-time altitude data from both a stationary base station and a mobile sensor, enabling precise and drift-free vertical localization. Empirical evaluations conducted in challenging scenarios -- such as fully enclosed stairwells and elevators, demonstrate that our proposed barometric pipeline achieves sub-meter vertical accuracy (RMSE: 0.29 m) and perfect (100%) floor-level identification. In contrast, our results confirm that standalone height estimates, obtained solely from visual- or LiDAR-based SLAM odometry, are insufficient for reliable vertical localization. The proposed ROS-compatible barometric module thus provides a practical and cost-effective solution for robust vertical awareness in real-world robotic deployments. The implementation of our method is released as open source at https://github.com/witsir/differential-barometric.
☆ SingingBot: An Avatar-Driven System for Robotic Face Singing Performance
Equipping robotic faces with singing capabilities is crucial for empathetic Human-Robot Interaction. However, existing robotic face driving research primarily focuses on conversations or mimicking static expressions, struggling to meet the high demands for continuous emotional expression and coherence in singing. To address this, we propose a novel avatar-driven framework for appealing robotic singing. We first leverage portrait video generation models embedded with extensive human priors to synthesize vivid singing avatars, providing reliable expression and emotion guidance. Subsequently, these facial features are transferred to the robot via semantic-oriented mapping functions that span a wide expression space. Furthermore, to quantitatively evaluate the emotional richness of robotic singing, we propose the Emotion Dynamic Range metric to measure the emotional breadth within the Valence-Arousal space, revealing that a broad emotional spectrum is crucial for appealing performances. Comprehensive experiments prove that our method achieves rich emotional expressions while maintaining lip-audio synchronization, significantly outperforming existing approaches.
☆ Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots
Strawberry harvesting robots faced persistent challenges such as low integration of visual perception, fruit-gripper misalignment, empty grasping, and strawberry slippage from the gripper due to insufficient gripping force, all of which compromised harvesting stability and efficiency in orchard environments. To overcome these issues, this paper proposed a visual fault diagnosis and self-recovery framework that integrated multi-task perception with corrective control strategies. At the core of this framework was SRR-Net, an end-to-end multi-task perception model that simultaneously performed strawberry detection, segmentation, and ripeness estimation, thereby unifying visual perception with fault diagnosis. Based on this integrated perception, a relative error compensation method based on the simultaneous target-gripper detection was designed to address positional misalignment, correcting deviations when error exceeded the tolerance threshold. To mitigate empty grasping and fruit-slippage faults, an early abort strategy was implemented. A micro-optical camera embedded in the end-effector provided real-time visual feedback, enabling grasp detection during the deflating stage and strawberry slip prediction during snap-off through MobileNet V3-Small classifier and a time-series LSTM classifier. Experiments demonstrated that SRR-Net maintained high perception accuracy. For detection, it achieved a precision of 0.895 and recall of 0.813 on strawberries, and 0.972/0.958 on hands. In segmentation, it yielded a precision of 0.887 and recall of 0.747 for strawberries, and 0.974/0.947 for hands. For ripeness estimation, SRR-Net attained a mean absolute error of 0.035, while simultaneously supporting multi-task perception and sustaining a competitive inference speed of 163.35 FPS.
☆ Realistic adversarial scenario generation via human-like pedestrian model for autonomous vehicle control parameter optimisation
Autonomous vehicles (AVs) are rapidly advancing and are expected to play a central role in future mobility. Ensuring their safe deployment requires reliable interaction with other road users, not least pedestrians. Direct testing on public roads is costly and unsafe for rare but critical interactions, making simulation a practical alternative. Within simulation-based testing, adversarial scenarios are widely used to probe safety limits, but many prioritise difficulty over realism, producing exaggerated behaviours which may result in AV controllers that are overly conservative. We propose an alternative method, instead using a cognitively inspired pedestrian model featuring both inter-individual and intra-individual variability to generate behaviourally plausible adversarial scenarios. We provide a proof of concept demonstration of this method's potential for AV control optimisation, in closed-loop testing and tuning of an AV controller. Our results show that replacing the rule-based CARLA pedestrian with the human-like model yields more realistic gap acceptance patterns and smoother vehicle decelerations. Unsafe interactions occur only for certain pedestrian individuals and conditions, underscoring the importance of human variability in AV testing. Adversarial scenarios generated by this model can be used to optimise AV control towards safer and more efficient behaviour. Overall, this work illustrates how incorporating human-like road user models into simulation-based adversarial testing can enhance the credibility of AV evaluation and provide a practical basis to behaviourally informed controller optimisation.
☆ Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot
The development of robust and generalizable robot learning models is critically contingent upon the availability of large-scale, diverse training data and reliable evaluation benchmarks. Collecting data in the physical world poses prohibitive costs and scalability challenges, and prevailing simulation benchmarks frequently suffer from fragmentation, narrow scope, or insufficient fidelity to enable effective sim-to-real transfer. To address these challenges, we introduce Genie Sim 3.0, a unified simulation platform for robotic manipulation. We present Genie Sim Generator, a large language model (LLM)-powered tool that constructs high-fidelity scenes from natural language instructions. Its principal strength resides in rapid and multi-dimensional generalization, facilitating the synthesis of diverse environments to support scalable data collection and robust policy evaluation. We introduce the first benchmark that pioneers the application of LLM for automated evaluation. It leverages LLM to mass-generate evaluation scenarios and employs Vision-Language Model (VLM) to establish an automated assessment pipeline. We also release an open-source dataset comprising more than 10,000 hours of synthetic data across over 200 tasks. Through systematic experimentation, we validate the robust zero-shot sim-to-real transfer capability of our open-source dataset, demonstrating that synthetic data can server as an effective substitute for real-world data under controlled conditions for scalable policy training. For code and dataset details, please refer to: https://github.com/AgibotTech/genie_sim.
☆ VIT-Ped: Visionary Intention Transformer for Pedestrian Behavior Analysis
Pedestrian Intention prediction is one of the key technologies in the transition from level 3 to level 4 autonomous driving. To understand pedestrian crossing behaviour, several elements and features should be taken into consideration to make the roads of tomorrow safer for everybody. We introduce a transformer / video vision transformer based algorithm of different sizes which uses different data modalities .We evaluated our algorithms on popular pedestrian behaviour dataset, JAAD, and have reached SOTA performance and passed the SOTA in metrics like Accuracy, AUC and F1-score. The advantages brought by different model design choices are investigated via extensive ablation studies.
☆ Deep Robust Koopman Learning from Noisy Data
Koopman operator theory has emerged as a leading data-driven approach that relies on a judicious choice of observable functions to realize global linear representations of nonlinear systems in the lifted observable space. However, real-world data is often noisy, making it difficult to obtain an accurate and unbiased approximation of the Koopman operator. The Koopman operator generated from noisy datasets is typically corrupted by noise-induced bias that severely degrades prediction and downstream tracking performance. In order to address this drawback, this paper proposes a novel autoencoder-based neural architecture to jointly learn the appropriate lifting functions and the reduced-bias Koopman operator from noisy data. The architecture initially learns the Koopman basis functions that are consistent for both the forward and backward temporal dynamics of the system. Subsequently, by utilizing the learned forward and backward temporal dynamics, the Koopman operator is synthesized with a reduced bias making the method more robust to noise compared to existing techniques. Theoretical analysis is used to demonstrate significant bias reduction in the presence of training noise. Dynamics prediction and tracking control simulations are conducted for multiple serial manipulator arms, including performance comparisons with leading alternative designs, to demonstrate its robustness under various noise levels. Experimental studies with the Franka FR3 7-DoF manipulator arm are further used to demonstrate the effectiveness of the proposed approach in a practical setting.
☆ What you reward is what you learn: Comparing rewards for online speech policy optimization in public HRI
Designing policies that are both efficient and acceptable for conversational service robots in open and diverse environments is non-trivial. Unlike fixed, hand-tuned parameters, online learning can adapt to non-stationary conditions. In this paper, we study how to adapt a social robot's speech policy in the wild. During a 12-day in-situ deployment with over 1,400 public encounters, we cast online policy optimization as a multi-armed bandit problem and use Thompson sampling to select among six actions defined by speech rate (slow/normal/fast) and verbosity (concise/detailed). We compare three complementary binary rewards--Ru (user rating), Rc (conversation closure), and Rt (>=2 turns)--and show that each induces distinct arm distributions and interaction behaviors. We complement the online results with offline evaluations that analyze contextual factors (e.g., crowd level, group size) using video-annotated data. Taken together, we distill ready-to-use design lessons for deploying online optimization of speech policies in real public HRI settings.
☆ Learning Diffusion Policy from Primitive Skills for Robot Manipulation AAAI2026
Diffusion policies (DP) have recently shown great promise for generating actions in robotic manipulation. However, existing approaches often rely on global instructions to produce short-term control signals, which can result in misalignment in action generation. We conjecture that the primitive skills, referred to as fine-grained, short-horizon manipulations, such as ``move up'' and ``open the gripper'', provide a more intuitive and effective interface for robot learning. To bridge this gap, we propose SDP, a skill-conditioned DP that integrates interpretable skill learning with conditional action planning. SDP abstracts eight reusable primitive skills across tasks and employs a vision-language model to extract discrete representations from visual observations and language instructions. Based on them, a lightweight router network is designed to assign a desired primitive skill for each state, which helps construct a single-skill policy to generate skill-aligned actions. By decomposing complex tasks into a sequence of primitive skills and selecting a single-skill policy, SDP ensures skill-consistent behavior across diverse tasks. Extensive experiments on two challenging simulation benchmarks and real-world robot deployments demonstrate that SDP consistently outperforms SOTA methods, providing a new paradigm for skill-based robot learning with diffusion policies.
comment: Accepted to AAAI2026
☆ From Metrics to Meaning: Insights from a Mixed-Methods Field Experiment on Retail Robot Deployment
We report a mixed-methods field experiment of a conversational service robot deployed under everyday staffing discretion in a live bedding store. Over 12 days we alternated three conditions--Baseline (no robot), Robot-only, and Robot+Fixture--and video-annotated the service funnel from passersby to purchase. An explanatory sequential design then used six post-experiment staff interviews to interpret the quantitative patterns. Quantitatively, the robot increased stopping per passerby (highest with the fixture), yet clerk-led downstream steps per stopper--clerk approach, store entry, assisted experience, and purchase--decreased. Interviews explained this divergence: clerks avoided interrupting ongoing robot-customer talk, struggled with ambiguous timing amid conversational latency, and noted child-centered attraction that often satisfied curiosity at the doorway. The fixture amplified visibility but also anchored encounters at the threshold, creating a well-defined micro-space where needs could ``close'' without moving inside. We synthesize these strands into an integrative account from the initial show of interest on the part of a customer to their entering the store and derive actionable guidance. The results advance the understanding of interactions between customers, staff members, and the robot and offer practical recommendations for deploying service robots in high-touch retail.
☆ CausalNav: A Long-term Embodied Navigation System for Autonomous Mobile Robots in Dynamic Outdoor Scenarios
Autonomous language-guided navigation in large-scale outdoor environments remains a key challenge in mobile robotics, due to difficulties in semantic reasoning, dynamic conditions, and long-term stability. We propose CausalNav, the first scene graph-based semantic navigation framework tailored for dynamic outdoor environments. We construct a multi-level semantic scene graph using LLMs, referred to as the Embodied Graph, that hierarchically integrates coarse-grained map data with fine-grained object entities. The constructed graph serves as a retrievable knowledge base for Retrieval-Augmented Generation (RAG), enabling semantic navigation and long-range planning under open-vocabulary queries. By fusing real-time perception with offline map data, the Embodied Graph supports robust navigation across varying spatial granularities in dynamic outdoor environments. Dynamic objects are explicitly handled in both the scene graph construction and hierarchical planning modules. The Embodied Graph is continuously updated within a temporal window to reflect environmental changes and support real-time semantic navigation. Extensive experiments in both simulation and real-world settings demonstrate superior robustness and efficiency.
comment: Accepted by IEEE Robotics and Automation Letters (RA-L)
☆ DisCo-FLoc: Using Dual-Level Visual-Geometric Contrasts to Disambiguate Depth-Aware Visual Floorplan Localization
Since floorplan data is readily available, long-term persistent, and robust to changes in visual appearance, visual Floorplan Localization (FLoc) has garnered significant attention. Existing methods either ingeniously match geometric priors or utilize sparse semantics to reduce FLoc uncertainty. However, they still suffer from ambiguous FLoc caused by repetitive structures within minimalist floorplans. Moreover, expensive but limited semantic annotations restrict their applicability. To address these issues, we propose DisCo-FLoc, which utilizes dual-level visual-geometric Contrasts to Disambiguate depth-aware visual Floc, without requiring additional semantic labels. Our solution begins with a ray regression predictor tailored for ray-casting-based FLoc, predicting a series of FLoc candidates using depth estimation expertise. In addition, a novel contrastive learning method with position-level and orientation-level constraints is proposed to strictly match depth-aware visual features with the corresponding geometric structures in the floorplan. Such matches can effectively eliminate FLoc ambiguity and select the optimal imaging pose from FLoc candidates. Exhaustive comparative studies on two standard visual Floc benchmarks demonstrate that our method outperforms the state-of-the-art semantic-based method, achieving significant improvements in both robustness and accuracy.
comment: 7 pages, 4 figures
☆ AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving
End-to-end autonomous driving has rapidly progressed, enabling joint perception and planning in complex environments. In the planning stage, state-of-the-art (SOTA) end-to-end autonomous driving models decouple planning into parallel lateral and longitudinal predictions. While effective, this parallel design can lead to i) coordination failures between the planned path and speed, and ii) underutilization of the drive path as a prior for longitudinal planning, thus redundantly encoding static information. To address this, we propose a novel cascaded framework that explicitly conditions longitudinal planning on the drive path, enabling coordinated and collision-aware lateral and longitudinal planning. Specifically, we introduce a path-conditioned formulation that explicitly incorporates the drive path into longitudinal planning. Building on this, the model predicts longitudinal displacements along the drive path rather than full 2D trajectory waypoints. This design simplifies longitudinal reasoning and more tightly couples it with lateral planning. Additionally, we introduce a planning-oriented data augmentation strategy that simulates rare safety-critical events, such as vehicle cut-ins, by adding agents and relabeling longitudinal targets to avoid collision. Evaluated on the challenging Bench2Drive benchmark, our method sets a new SOTA, achieving a driving score of 89.07 and a success rate of 73.18%, demonstrating significantly improved coordination and safety
comment: underreview
☆ Simulations and Advancements in MRI-Guided Power-Driven Ferric Tools for Wireless Therapeutic Interventions
Designing a robotic system that functions effectively within the specific environment of a Magnetic Resonance Imaging (MRI) scanner requires solving numerous technical issues, such as maintaining the robot's precision and stability under strong magnetic fields. This research focuses on enhancing MRI's role in medical imaging, especially in its application to guide intravascular interventions using robot-assisted devices. A newly developed computational system is introduced, designed for seamless integration with the MRI scanner, including a computational unit and user interface. This system processes MR images to delineate the vascular network, establishing virtual paths and boundaries within vessels to prevent procedural damage. Key findings reveal the system's capability to create tailored magnetic field gradient patterns for device control, considering the vessel's geometry and safety norms, and adapting to different blood flow characteristics for finer navigation. Additionally, the system's modeling aspect assesses the safety and feasibility of navigating pre-set vascular paths. Conclusively, this system, based on the Qt framework and C/C++, with specialized software modules, represents a major step forward in merging imaging technology with robotic aid, significantly enhancing precision and safety in intravascular procedures.
comment: 10 pages, 7 figures
☆ Explicit World Models for Reliable Human-Robot Collaboration AAAI-26
This paper addresses the topic of robustness under sensing noise, ambiguous instructions, and human-robot interaction. We take a radically different tack to the issue of reliable embodied AI: instead of focusing on formal verification methods aimed at achieving model predictability and robustness, we emphasise the dynamic, ambiguous and subjective nature of human-robot interactions that requires embodied AI systems to perceive, interpret, and respond to human intentions in a manner that is consistent, comprehensible and aligned with human expectations. We argue that when embodied agents operate in human environments that are inherently social, multimodal, and fluid, reliability is contextually determined and only has meaning in relation to the goals and expectations of humans involved in the interaction. This calls for a fundamentally different approach to achieving reliable embodied AI that is centred on building and updating an accessible "explicit world model" representing the common ground between human and AI, that is used to align robot behaviours with human expectations.
comment: Accepted to AAAI-26 Bridge Program B10: Making Embodied AI Reliable with Testing and Formal Verification
☆ Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems
Real-time lane detection in embedded systems encounters significant challenges due to subtle and sparse visual signals in RGB images, often constrained by limited computational resources and power consumption. Although deep learning models for lane detection categorized into segmentation-based, anchor-based, and curve-based methods there remains a scarcity of universally applicable optimization techniques tailored for low-power embedded environments. To overcome this, we propose an innovative Covariance Distribution Optimization (CDO) module specifically designed for efficient, real-time applications. The CDO module aligns lane feature distributions closely with ground-truth labels, significantly enhancing detection accuracy without increasing computational complexity. Evaluations were conducted on six diverse models across all three method categories, including two optimized for real-time applications and four state-of-the-art (SOTA) models, tested comprehensively on three major datasets: CULane, TuSimple, and LLAMAS. Experimental results demonstrate accuracy improvements ranging from 0.01% to 1.5%. The proposed CDO module is characterized by ease of integration into existing systems without structural modifications and utilizes existing model parameters to facilitate ongoing training, thus offering substantial benefits in performance, power efficiency, and operational flexibility in embedded systems.
☆ AMC26: High-performance DOb for robust position control
This paper presents a new HPDOb that significantly improves disturbance estimation accuracy and robustness in motion control systems, surpassing the capabilities of conventional DObs. The proposed observer is analysed and synthesised in the discrete-time domain, providing a realistic representation of their dynamic behaviour and enabling enhanced controller design for practical applications. The core contribution of the HPDOb is a novel synthesis method that incorporates higher-order truncation error dynamics into disturbance estimation. Unlike conventional DObs, which are limited to zero-order truncation error, the HPDOb achieves first-order truncation error, yielding markedly improved estimation accuracy and robustness against disturbances in motion control systems. Simulation and experiments verify the stability and performance of HPDOb.
☆ Learning and Optimizing the Efficacy of Spatio-Temporal Task Allocation under Temporal and Resource Constraints
Complex multi-robot missions often require heterogeneous teams to jointly optimize task allocation, scheduling, and path planning to improve team performance under strict constraints. We formalize these complexities into a new class of problems, dubbed Spatio-Temporal Efficacy-optimized Allocation for Multi-robot systems (STEAM). STEAM builds upon trait-based frameworks that model robots using their capabilities (e.g., payload and speed), but goes beyond the typical binary success-failure model by explicitly modeling the efficacy of allocations as trait-efficacy maps. These maps encode how the aggregated capabilities assigned to a task determine performance. Further, STEAM accommodates spatio-temporal constraints, including a user-specified time budget (i.e., maximum makespan). To solve STEAM problems, we contribute a novel algorithm named Efficacy-optimized Incremental Task Allocation Graph Search (E-ITAGS) that simultaneously optimizes task performance and respects time budgets by interleaving task allocation, scheduling, and path planning. Motivated by the fact that trait-efficacy maps are difficult, if not impossible, to specify, E-ITAGS efficiently learns them using a realizability-aware active learning module. Our approach is realizability-aware since it explicitly accounts for the fact that not all combinations of traits are realizable by the robots available during learning. Further, we derive experimentally-validated bounds on E-ITAGS' suboptimality with respect to efficacy. Detailed numerical simulations and experiments using an emergency response domain demonstrate that E-ITAGS generates allocations of higher efficacy compared to baselines, while respecting resource and spatio-temporal constraints. We also show that our active learning approach is sample efficient and establishes a principled tradeoff between data and computational efficiency.
comment: The journal extension version of our conference paper: arXiv:2404.07902, which has been accepted by ISRR 2024
☆ InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
Prevalent Vision-Language-Action (VLA) models are typically built upon Multimodal Large Language Models (MLLMs) and demonstrate exceptional proficiency in semantic understanding, but they inherently lack the capability to deduce physical world dynamics. Consequently, recent approaches have shifted toward World Models, typically formulated via video prediction; however, these methods often suffer from a lack of semantic grounding and exhibit brittleness when handling prediction errors. To synergize semantic understanding with dynamic predictive capabilities, we present InternVLA-A1. This model employs a unified Mixture-of-Transformers architecture, coordinating three experts for scene understanding, visual foresight generation, and action execution. These components interact seamlessly through a unified masked self-attention mechanism. Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. We pre-train these models on hybrid synthetic-real datasets spanning InternData-A1 and Agibot-World, covering over 533M frames. This hybrid training strategy effectively harnesses the diversity of synthetic simulation data while minimizing the sim-to-real gap. We evaluated InternVLA-A1 across 12 real-world robotic tasks and simulation benchmark. It significantly outperforms leading models like pi0 and GR00T N1.5, achieving a 14.5\% improvement in daily tasks and a 40\%-73.3\% boost in dynamic settings, such as conveyor belt sorting.
comment: Homepage: https://internrobotics.github.io/internvla-a1.github.io/
♻ ☆ LIMOncello: Iterated Error-State Kalman Filter on the SGal(3) Manifold for Fast LiDAR-Inertial Odometry
This work introduces LIMOncello, a tightly coupled LiDAR-Inertial Odometry system that models 6-DoF motion on the $\mathrm{SGal}(3)$ manifold within an iterated error-state Kalman filter backend. Compared to state representations defined on $\mathrm{SO}(3)\times\mathbb{R}^6$, the use of $\mathrm{SGal}(3)$ provides a coherent and numerically stable discrete-time propagation model that helps limit drift in low-observability conditions. LIMOncello also includes a lightweight incremental i-Octree mapping backend that enables faster updates and substantially lower memory usage than incremental kd-tree style map structures, without relying on locality-restricted search heuristics. Experiments on multiple real-world datasets show that LIMOncello achieves competitive accuracy while improving robustness in geometrically sparse environments. The system maintains real-time performance with stable memory growth and is released as an extensible open-source implementation at https://github.com/CPerezRuiz335/LIMOncello.
♻ ☆ FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation
Handling fragile objects remains a major challenge for robotic manipulation. Tactile sensing and soft robotics can improve delicate object handling, but typically involve high integration complexity or slow response times. We address these issues through FORTE, an easy-to-fabricate tactile sensing system. FORTE uses 3D-printed fin-ray grippers with internal air channels to provide low-latency force and slip feedback. This feedback allows us to apply just enough force to grasp objects without damaging them. We accurately estimate grasping forces from 0-8 N with an average error of 0.2 N, and detect slip events within 100 ms of occurring. FORTE can grasp a wide range of slippery, fragile, and deformable objects, including raspberries and potato chips with 92% success and achieves 93% accuracy in detecting slip events. These results highlight FORTE's potential as a robust solution for delicate robotic manipulation. Project page: https://merge-lab.github.io/FORTE/
♻ ☆ Volume-Consistent Kneading-Based Deformation Manufacturing for Material-Efficient Shaping
Conventional subtractive manufacturing inevitably involves material loss during geometric realization, while additive manufacturing still suffers from limitations in surface quality, process continuity, and productivity when fabricating complex geometries. To address these challenges, this paper proposes a volume-consistent kneading-based forming method for plastic materials, enabling continuous and controllable three-dimensional deformation under mass conservation. An integrated kneading-based manufacturing system is developed, in which geometry-aware kneading command generation, layer-wise kneading execution, and in-process point-cloud scanning are tightly coupled to form a closed-loop workflow of scanning, forming, and feedback compensation. Target geometries are analyzed through layer-wise point-cloud processing and classified into enveloping and non-enveloping types. Accordingly, an Envelope Shaping First strategy and a Similar Gradient Method are adopted to ensure stable material flow and continuous deformation. An RMSE-based compensation scheme is further introduced to correct systematic geometric deviations induced by elastic rebound and material redistribution. Experimental validation on five representative geometries demonstrates high geometric fidelity, with material utilization consistently exceeding 98%. The results indicate that kneading-based forming provides a promising alternative manufacturing paradigm for low-waste, customizable production.
comment: 39 pages, 31 figures
♻ ☆ Compositional Diffusion with Guided Search for Long-Horizon Planning
Generative models have emerged as powerful tools for planning, with compositional approaches offering particular promise for modeling long-horizon task distributions by composing together local, modular generative models. This compositional paradigm spans diverse domains, from multi-step manipulation planning to panoramic image synthesis to long video generation. However, compositional generative models face a critical challenge: when local distributions are multimodal, existing composition methods average incompatible modes, producing plans that are neither locally feasible nor globally coherent. We propose Compositional Diffusion with Guided Search (CDGS), which addresses this mode averaging problem by embedding search directly within the diffusion denoising process. Our method explores diverse combinations of local modes through population-based sampling, prunes infeasible candidates using likelihood-based filtering, and enforces global consistency through iterative resampling between overlapping segments. CDGS matches oracle performance on seven robot manipulation tasks, outperforming baselines that lack compositionality or require long-horizon training data. The approach generalizes across domains, enabling coherent text-guided panoramic images and long videos through effective local-to-global message passing. More details: https://cdgsearch.github.io/
comment: 38 pages, 18 figures
♻ ☆ Interconnection and Damping Assignment Passivity-Based Control using Sparse Neural ODEs
Interconnection and Damping Assignment Passivity-Based Control (IDA-PBC) is a nonlinear control technique that assigns a port-Hamiltonian (pH) structure to a controlled system using a state-feedback law. While IDA-PBC has been extensively studied and applied to many systems, its practical implementation often remains confined to academic examples and, almost exclusively, to stabilization tasks. The main limitation of IDA-PBC stems from the complexity of analytically solving a set of partial differential equations (PDEs), referred to as the matching conditions, which enforce the pH structure of the closed-loop system. However, this is extremely challenging, especially for complex physical systems and tasks. In this work, we propose a novel numerical approach for designing IDA-PBC controllers without solving the matching PDEs exactly. We cast the IDA-PBC problem as the learning of a neural ordinary differential equation. In particular, we rely on sparse dictionary learning to parametrize the desired closed-loop system as a sparse linear combination of nonlinear state-dependent functions. Optimization of the controller parameters is achieved by solving a multi-objective optimization problem whose cost function is composed of a generic task-dependent cost and a matching condition-dependent cost. Our numerical results show that the proposed method enables (i) IDA-PBC to be applicable to complex tasks beyond stabilization, such as the discovery of periodic oscillatory behaviors, (ii) the derivation of closed-form expressions of the controlled system, including residual terms in case of approximate matching, and (iii) stability analysis of the learned controller.
♻ ☆ Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models
The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy
comment: 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/
♻ ☆ MOON: Multi-Objective Optimization-Driven Object-Goal Navigation Using a Variable-Horizon Set-Orienteering Planner
This paper proposes MOON (Multi-Objective Optimization-driven Object-goal Navigation), a novel framework designed for efficient navigation in large-scale, complex indoor environments. While existing methods often rely on local heuristics, they frequently fail to address the strategic trade-offs between competing objectives in vast areas. To overcome this, we formulate the task as a multi-objective optimization problem (MOO) that balances frontier-based exploration with the exploitation of observed landmarks. Our prototype integrates three key pillars: (1) QOM [IROS05] for discriminative landmark encoding; (2) StructNav [RSS23] to enhance the navigation pipeline; and (3) a variable-horizon Set Orienteering Problem (SOP) formulation for globally coherent planning. To further support the framework's scalability, we provide a detailed theoretical foundation for the budget-constrained SOP formulation and the data-driven mode-switching strategy that enables long-horizon resource allocation. Additionally, we introduce a high-speed neural planner that distills the expert solver into a transformer-based model, reducing decision latency by a factor of nearly 10 while maintaining high planning quality.
comment: 9 pages, 7 figures, technical report
♻ ☆ Multi-Robot Data-Free Continual Communicative Learning (CCL) from Black-Box Visual Place Recognition Models
In emerging multi-robot societies, heterogeneous agents must continually extract and integrate local knowledge from one another through communication, even when their internal models are completely opaque. Existing approaches to continual or collaborative learning for visual place recognition (VPR) largely assume white-box access to model parameters or shared training datasets, which is unrealistic when robots encounter unknown peers in the wild. This paper introduces \emph{Continual Communicative Learning (CCL)}, a data-free multi-robot framework in which a traveler robot (student) continually improves its VPR capability by communicating with black-box teacher models via a constrained query--response channel. We repurpose Membership Inference Attacks (MIA), originally developed as privacy attacks on machine learning models, as a constructive communication primitive to reconstruct pseudo-training sets from black-box VPR teachers without accessing their parameters or raw data. To overcome the intrinsic communication bottleneck caused by the low sampling efficiency of black-box MIA, we propose a prior-based query strategy that leverages the student's own VPR prior to focus queries on informative regions of the embedding space, thereby reducing the knowledge transfer (KT) cost. Experimental results on a standard multi-session VPR benchmark demonstrate that the proposed CCL framework yields substantial performance gains for low-performing robots under modest communication budgets, highlighting CCL as a promising building block for scalable and fault-tolerant multi-robot systems.
comment: 6 pages, 4 figures, technical report
♻ ☆ AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans
Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
♻ ☆ RoboBPP: Benchmarking Robotic Online Bin Packing with Physics-based Simulation
Physical feasibility in 3D bin packing is a key requirement in modern industrial logistics and robotic automation. With the growing adoption of industrial automation, online bin packing has gained increasing attention. However, inconsistencies in problem settings, test datasets, and evaluation metrics have hindered progress in the field, and there is a lack of a comprehensive benchmarking system. Direct testing on real hardware is costly, and building a realistic simulation environment is also challenging. To address these limitations, we introduce RoboBPP, a benchmarking system designed for robotic online bin packing. RoboBPP integrates a physics-based simulator to assess physical feasibility. In our simulation environment, we introduce a robotic arm and boxes at real-world scales to replicate real industrial packing workflows. By simulating conditions that arise in real industrial applications, we ensure that evaluated algorithms are practically deployable. In addition, prior studies often rely on synthetic datasets whose distributions differ from real-world industrial data. To address this issue, we collect three datasets from real industrial workflows, including assembly-line production, logistics packing, and furniture manufacturing. The benchmark comprises three carefully designed test settings and extends existing evaluation metrics with new metrics for structural stability and operational safety. We design a scoring system and derive a range of insights from the evaluation results. RoboBPP is fully open-source and is equipped with visualization tools and an online leaderboard, providing a reproducible and extensible foundation for future research and industrial applications (https://robot-bin-packing-benchmark.github.io).
comment: Under review at the International Journal of Robotics Research (IJRR)
♻ ☆ RNBF: Real-Time RGB-D Based Neural Barrier Functions for Safe Robotic Navigation
Autonomous safe navigation in unstructured and novel environments poses significant challenges, especially when environment information can only be provided through low-cost vision sensors. Although safe reactive approaches have been proposed to ensure robot safety in complex environments, many base their theory off the assumption that the robot has prior knowledge on obstacle locations and geometries. In this paper, we present a real-time, vision-based framework that constructs continuous, first-order differentiable Signed Distance Fields (SDFs) of unknown environments entirely online, without any pre-training, and is fully compatible with established SDF-based reactive controllers. To achieve robust performance under practical sensing conditions, our approach explicitly accounts for noise in affordable RGB-D cameras, refining the neural SDF representation online for smoother geometry and stable gradient estimates. We validate the proposed method in simulation and real-world experiments using a Fetch robot. Videos and supplementary material are available at https://satyajeetburla.github.io/rnbf/.
♻ ☆ SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling
Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
♻ ☆ Openpi Comet: Competition Solution For 2025 BEHAVIOR Challenge
The 2025 BEHAVIOR Challenge is designed to rigorously track progress toward solving long-horizon tasks by physical agents in simulated environments. BEHAVIOR-1K focuses on everyday household tasks that people most want robots to assist with and these tasks introduce long-horizon mobile manipulation challenges in realistic settings, bridging the gap between current research and real-world, human-centric applications. This report presents our solution to the 2025 BEHAVIOR Challenge in a very close 2nd place and substantially outperforms the rest of the submissions. Building on $π_{0.5}$, we focus on systematically building our solution by studying the effects of training techniques and data. Through careful ablation studies, we reveal the scaling benefits in both the pre-training and post-training phases, leading to a validation Q-score of 0.345, significantly surpassing previous state-of-the-art performance. We summarize our practical lessons and design recommendations that we hope will provide actionable insights for the broader embodied AI community when adapting powerful foundation models to complex embodied scenarios. Project page: https://github.com/mli0603/openpi-comet
comment: Post-challenge bug fix
Robotics 21
☆ VisuoTactile 6D Pose Estimation of an In-Hand Object using Vision and Tactile Sensor Data ICRA 2022
Knowledge of the 6D pose of an object can benefit in-hand object manipulation. In-hand 6D object pose estimation is challenging because of heavy occlusion produced by the robot's grippers, which can have an adverse effect on methods that rely on vision data only. Many robots are equipped with tactile sensors at their fingertips that could be used to complement vision data. In this paper, we present a method that uses both tactile and vision data to estimate the pose of an object grasped in a robot's hand. To address challenges like lack of standard representation for tactile data and sensor fusion, we propose the use of point clouds to represent object surfaces in contact with the tactile sensor and present a network architecture based on pixel-wise dense fusion. We also extend NVIDIA's Deep Learning Dataset Synthesizer to produce synthetic photo-realistic vision data and corresponding tactile point clouds. Results suggest that using tactile data in addition to vision data improves the 6D pose estimate, and our network generalizes successfully from synthetic training to real physical robots.
comment: Accepted for publication in IEEE Robotics and Automation Letters (RA-L), January 2022. Presented at ICRA 2022. This is the author's version of the manuscript
☆ DemoBot: Efficient Learning of Bimanual Manipulation with Dexterous Hands From Third-Person Human Videos
This work presents DemoBot, a learning framework that enables a dual-arm, multi-finger robotic system to acquire complex manipulation skills from a single unannotated RGB-D video demonstration. The method extracts structured motion trajectories of both hands and objects from raw video data. These trajectories serve as motion priors for a novel reinforcement learning (RL) pipeline that learns to refine them through contact-rich interactions, thereby eliminating the need to learn from scratch. To address the challenge of learning long-horizon manipulation skills, we introduce: (1) Temporal-segment based RL to enforce temporal alignment of the current state with demonstrations; (2) Success-Gated Reset strategy to balance the refinement of readily acquired skills and the exploration of subsequent task stages; and (3) Event-Driven Reward curriculum with adaptive thresholding to guide the RL learning of high-precision manipulation. The novel video processing and RL framework successfully achieved long-horizon synchronous and asynchronous bimanual assembly tasks, offering a scalable approach for direct skill acquisition from human videos.
☆ Action-Sketcher: From Reasoning to Action via Visual Sketches for Long-Horizon Robotic Manipulation
Long-horizon robotic manipulation is increasingly important for real-world deployment, requiring spatial disambiguation in complex layouts and temporal resilience under dynamic interaction. However, existing end-to-end and hierarchical Vision-Language-Action (VLA) policies often rely on text-only cues while keeping plan intent latent, which undermines referential grounding in cluttered or underspecified scenes, impedes effective task decomposition of long-horizon goals with close-loop interaction, and limits causal explanation by obscuring the rationale behind action choices. To address these issues, we first introduce Visual Sketch, an implausible visual intermediate that renders points, boxes, arrows, and typed relations in the robot's current views to externalize spatial intent, connect language to scene geometry. Building on Visual Sketch, we present Action-Sketcher, a VLA framework that operates in a cyclic See-Think-Sketch-Act workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. To enable scalable training and evaluation, we curate diverse corpus with interleaved images, text, Visual Sketch supervision, and action sequences, and train Action-Sketcher with a multi-stage curriculum recipe that combines interleaved sequence alignment for modality unification, language-to-sketch consistency for precise linguistic grounding, and imitation learning augmented with sketch-to-action reinforcement for robustness. Extensive experiments on cluttered scenes and multi-object tasks, in simulation and on real-world tasks, show improved long-horizon success, stronger robustness to dynamic scene changes, and enhanced interpretability via editable sketches and step-wise plans. Project website: https://action-sketcher.github.io
comment: 26 pages, 14 figures
☆ HanoiWorld : A Joint Embedding Predictive Architecture BasedWorld Model for Autonomous Vehicle Controller
Current attempts of Reinforcement Learning for Autonomous Controller are data-demanding while the results are under-performed, unstable, and unable to grasp and anchor on the concept of safety, and over-concentrating on noise features due to the nature of pixel reconstruction. While current Self-Supervised Learningapproachs that learning on high-dimensional representations by leveraging the JointEmbedding Predictive Architecture (JEPA) are interesting and an effective alternative, as the idea mimics the natural ability of the human brain in acquiring new skill usingimagination and minimal samples of observations. This study introduces Hanoi-World, a JEPA-based world model that using recurrent neural network (RNN) formaking longterm horizontal planning with effective inference time. Experimentsconducted on the Highway-Env package with difference enviroment showcase the effective capability of making a driving plan while safety-awareness, with considerablecollision rate in comparison with SOTA baselines
☆ AIMS: An Adaptive Integration of Multi-Sensor Measurements for Quadrupedal Robot Localization
This paper addresses the problem of accurate localization for quadrupedal robots operating in narrow tunnel-like environments. Due to the long and homogeneous characteristics of such scenarios, LiDAR measurements often provide weak geometric constraints, making traditional sensor fusion methods susceptible to accumulated motion estimation errors. To address these challenges, we propose AIMS, an adaptive LiDAR-IMU-leg odometry fusion method for robust quadrupedal robot localization in degenerate environments. The proposed method is formulated within an error-state Kalman filtering framework, where LiDAR and leg odometry measurements are integrated with IMU-based state prediction, and measurement noise covariance matrices are adaptively adjusted based on online degeneracy-aware reliability assessment. Experimental results obtained in narrow corridor environments demonstrate that the proposed method improves localization accuracy and robustness compared with state-of-the-art approaches.
☆ DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving
Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.
comment: 10 pages, 4 figures; Project Website: https://drivinggen-bench.github.io/
☆ Online Estimation and Manipulation of Articulated Objects
From refrigerators to kitchen drawers, humans interact with articulated objects effortlessly every day while completing household chores. For automating these tasks, service robots must be capable of manipulating arbitrary articulated objects. Recent deep learning methods have been shown to predict valuable priors on the affordance of articulated objects from vision. In contrast, many other works estimate object articulations by observing the articulation motion, but this requires the robot to already be capable of manipulating the object. In this article, we propose a novel approach combining these methods by using a factor graph for online estimation of articulation which fuses learned visual priors and proprioceptive sensing during interaction into an analytical model of articulation based on Screw Theory. With our method, a robotic system makes an initial prediction of articulation from vision before touching the object, and then quickly updates the estimate from kinematic and force sensing during manipulation. We evaluate our method extensively in both simulations and real-world robotic manipulation experiments. We demonstrate several closed-loop estimation and manipulation experiments in which the robot was capable of opening previously unseen drawers. In real hardware experiments, the robot achieved a 75% success rate for autonomous opening of unknown articulated objects.
comment: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in Autonomous Robots, and is available online at [Link will be updated when available]
☆ Sampling Strategy Design for Model Predictive Path Integral Control on Legged Robot Locomotion
Model Predictive Path Integral (MPPI) control has emerged as a powerful sampling-based optimal control method for complex, nonlinear, and high-dimensional systems. However, directly applying MPPI to legged robotic systems presents several challenges. This paper systematically investigates the role of sampling strategy design within the MPPI framework for legged robot locomotion. Based upon the idea of structured control parameterization, we explore and compare multiple sampling strategies within the framework, including both unstructured and spline-based approaches. Through extensive simulations on a quadruped robot platform, we evaluate how different sampling strategies affect control smoothness, task performance, robustness, and sample efficiency. The results provide new insights into the practical implications of sampling design for deploying MPPI on complex legged systems.
♻ ☆ How Robot Dogs See the Unseeable: Improving Visual Interpretability via Peering for Exploratory Robots
In vegetated environments, such as forests, exploratory robots play a vital role in navigating complex, cluttered environments where human access is limited and traditional equipment struggles. Visual occlusion from obstacles, such as foliage, can severely obstruct a robot's sensors, impairing scene understanding. We show that "peering", a characteristic side-to-side movement used by insects to overcome their visual limitations, can also allow robots to markedly improve visual reasoning under partial occlusion. This is accomplished by applying core signal processing principles, specifically optical synthetic aperture sensing, together with the vision reasoning capabilities of modern large multimodal models. Peering enables real-time, high-resolution, and wavelength-independent perception, which is crucial for vision-based scene understanding across a wide range of applications. The approach is low-cost and immediately deployable on any camera-equipped robot. We investigated different peering motions and occlusion masking strategies, demonstrating that, unlike peering, state-of-the-art multi-view 3D vision techniques fail in these conditions due to their high susceptibility to occlusion. Our experiments were carried out on an industrial-grade quadrupedal robot. However, the ability to peer is not limited to such platforms, but potentially also applicable to bipedal, hexapod, wheeled, or crawling platforms. Robots that can effectively see through partial occlusion will gain superior perception abilities - including enhanced scene understanding, situational awareness, camouflage breaking, and advanced navigation in complex environments.
♻ ☆ General Dynamic Goal Recognition using Goal-Conditioned and Meta Reinforcement Learning AAMAS 2026
Understanding an agent's goal through its behavior is a common AI problem called Goal Recognition (GR). This task becomes particularly challenging in dynamic environments where goals are numerous and ever-changing. We introduce the General Dynamic Goal Recognition (GDGR) problem, a broader definition of GR aimed at real-time adaptation of GR systems. This paper presents two novel approaches to tackle GDGR: (1) GC-AURA, generalizing to new goals using Model-Free Goal-Conditioned Reinforcement Learning, and (2) Meta-AURA, adapting to novel environments with Meta-Reinforcement Learning. We evaluate these methods across diverse environments, demonstrating their ability to achieve rapid adaptation and high GR accuracy under dynamic and noisy conditions. This work is a significant step forward in enabling GR in dynamic and unpredictable real-world environments.
comment: Accepted for publication at AAMAS 2026
♻ ☆ Standing Tall: Sim to Real Fall Classification and Lead Time Prediction for Bipedal Robots
This paper extends a previously proposed fall prediction algorithm to a real-time (online) setting, with implementations in both hardware and simulation. The system is validated on the full-sized bipedal robot Digit, where the real-time version achieves performance comparable to the offline implementation while maintaining a zero false positive rate, an average lead time (defined as the difference between the true and predicted fall time) of 1.1s (well above the required minimum of 0.2s), and a maximum lead time error of just 0.03s. It also achieves a high recovery rate of 0.97, demonstrating its effectiveness in real-world deployment. In addition to the real-time implementation, this work identifies key limitations of the original algorithm, particularly under omnidirectional faults, and introduces a fine-tuned strategy to improve robustness. The enhanced algorithm shows measurable improvements across all evaluated metrics, including a 0.05 reduction in average false positive rate and a 1.19s decrease in the maximum error of the average predicted lead time.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Phase-based Nonlinear Model Predictive Control for Humanoid Walking Stabilization with Single and Double Support Time Adjustments
The contact sequence of humanoid walking consists of single and double support phases (SSP and DSP), and their coordination through proper duration and dynamic transition based on the robot's state is crucial for maintaining walking stability. Numerous studies have investigated phase duration optimization as an effective means of improving walking stability. This paper presents a phase-based Nonlinear Model Predictive Control (NMPC) framework that jointly optimizes Zero Moment Point (ZMP) modulation, step location, SSP duration (step timing), and DSP duration within a single formulation. Specifically, the proposed framework reformulates the nonlinear DCM (Divergent Component of Motion) error dynamics into a phase-consistent representation and incorporates them as dynamic constraints within the NMPC. The proposed framework also guarantees ZMP input continuity during contact-phase transitions and disables footstep updates during the DSP, thereby enabling dynamically reliable balancing control regardless of whether the robot is in SSP or DSP. The effectiveness of the proposed method is validated through extensive simulation and hardware experiments, demonstrating improved balance performance under external disturbances.
comment: 15 pages, 9 figures
♻ ☆ H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos
Large-scale pre-training using videos has proven effective for robot learning. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant visual gap between human hands and those of different robots. To remedy this, we propose H2R, a simple data augmentation technique that detects human hand keypoints, synthesizes robot motions in simulation, and composites rendered robots into egocentric videos. This process explicitly bridges the visual gap between human and robot embodiments during pre-training. We apply H2R to augment large-scale egocentric human video datasets such as Ego4D and SSv2, replacing human hands with simulated robotic arms to generate robot-centric training data. Based on this, we construct and release a family of 1M-scale datasets covering multiple robot embodiments (UR5 with gripper/Leaphand, Franka) and data sources (SSv2, Ego4D). To verify the effectiveness of the augmentation pipeline, we introduce a CLIP-based image-text similarity metric that quantitatively evaluates the semantic fidelity of robot-rendered frames to the original human actions. We validate H2R across three simulation benchmarks: Robomimic, RLBench and PushT and real-world manipulation tasks with a UR5 robot equipped with Gripper and Leaphand end-effectors. H2R consistently improves downstream success rates, yielding gains of 5.0%-10.2% in simulation and 6.7%-23.3% in real-world tasks across various visual encoders and policy learning methods. These results indicate that H2R improves the generalization ability of robotic policies by mitigating the visual discrepancies between human and robot domains.
♻ ☆ Multimodal Classification Network Guided Trajectory Planning for Four-Wheel Independent Steering Autonomous Parking Considering Obstacle Attributes
Four-wheel Independent Steering (4WIS) vehicles have attracted increasing attention for their superior maneuverability. Human drivers typically choose to cross or drive over the low-profile obstacles (e.g., plastic bags) to efficiently navigate through narrow spaces, while existing planners neglect obstacle attributes, leading to suboptimal efficiency or planning failures. To address this issue, we propose a novel multimodal trajectory planning framework that employs a neural network for scene perception, combines 4WIS hybrid A* search to generate a warm start, and utilizes an optimal control problem (OCP) for trajectory optimization. Specifically, a multimodal perception network fusing visual information and vehicle states is employed to capture semantic and contextual scene understanding, enabling the planner to adapt the strategy according to scene complexity (hard or easy task). For hard tasks, guided points are introduced to decompose complex tasks into local subtasks, improving the search efficiency. The multiple steering modes of 4WIS vehicles, Ackermann, diagonal, and zero-turn, are also incorporated as kinematically feasible motion primitives. Moreover, a hierarchical obstacle handling strategy, which categorizes obstacles as "non-traversable", "crossable", and "drive-over", is incorporated into the node expansion process, explicitly linking obstacle attributes to planning actions to enable efficient decisions. Furthermore, to address dynamic obstacles with motion uncertainty, we introduce a probabilistic risk field model, constructing risk-aware driving corridors that serve as linear collision constraints in OCP. Experimental results demonstrate the proposed framework's effectiveness in generating safe, efficient, and smooth trajectories for 4WIS vehicles, especially in constrained environments.
♻ ☆ Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and human-aligned driving policies. This work provides a structured characterization of the emerging VLA landscape for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks and organize existing methods into two principal paradigms: End-to-End VLA, which integrates perception, reasoning, and planning within a single model, and Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners). Within these paradigms, we further distinguish subclasses such as textual vs. numerical action generators and explicit vs. implicit guidance mechanisms. We also summarize representative datasets and benchmarks for evaluating VLA-based driving systems and highlight key challenges and open directions, including robustness, interpretability, and instruction fidelity. Overall, this work aims to establish a coherent foundation for advancing human-compatible autonomous driving systems.
comment: Survey; 47 pages, 7 figures, 9 tables; GitHub Repo at https://github.com/worldbench/awesome-vla-for-ad
♻ ☆ VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs
In most existing embodied navigation tasks, instructions are well-defined and unambiguous, such as instruction following and object searching. Under this idealized setting, agents are required solely to produce effective navigation outputs conditioned on vision and language inputs. However, real-world navigation instructions are often vague and ambiguous, requiring the agent to resolve uncertainty and infer user intent through active dialog. To address this gap, we propose Interactive Instance Goal Navigation (IIGN), a task that requires agents not only to generate navigation actions but also to produce language outputs via active dialog, thereby aligning more closely with practical settings. IIGN extends Instance Goal Navigation (IGN) by allowing agents to freely consult an oracle in natural language while navigating. Building on this task, we present the Vision Language-Language Navigation (VL-LN) benchmark, which provides a large-scale, automatically generated dataset and a comprehensive evaluation protocol for training and assessing dialog-enabled navigation models. VL-LN comprises over 41k long-horizon dialog-augmented trajectories for training and an automatic evaluation protocol with an oracle capable of responding to agent queries. Using this benchmark, we train a navigation model equipped with dialog capabilities and show that it achieves significant improvements over the baselines. Extensive experiments and analyses further demonstrate the effectiveness and reliability of VL-LN for advancing research on dialog-enabled embodied navigation. Code and dataset: https://0309hws.github.io/VL-LN.github.io/
♻ ☆ Evaluation of Impression Difference of a Domestic Mobile Manipulator with Autonomous and/or Remote Control in Fetch-and-Carry Tasks
A single service robot can present two distinct agencies: its onboard autonomy and an operator-mediated agency, yet users experience them through one physical body. We formalize this dual-agency structure as a User-Robot-Operator triad in an autonomous remote-control setting that integrates teleoperation with autonomous execution and human-in-the-loop remote assistance. Prior to the recent surge of language-based and multimodal interfaces, we developed and evaluated an early-stage prototype in 2020 that combined natural-language text chat with a sketch-based interface enabling freehand on-image annotation over the robot's live camera view to support remote intervention. We evaluated three modes - remote control via teleoperation, autonomous control, and autonomous remote control (a hybrid mode representing different levels of autonomy) - in controlled fetch-and-carry mobile manipulation tasks using a domestic mobile manipulator, the Human Support Robot (HSR), on a World Robot Summit 2020 rule-compliant test field. The results show systematic mode-dependent differences in user-rated affinity and perceived security, indicating that switching or blending agency within one robot measurably shapes human impressions in Human-Robot Interaction (HRI). These findings provide empirical guidance for designing human-in-the-loop mobile manipulation in domestic physical tasks.
comment: Published in Advanced Robotics (2020). v2 updates Abstract/Comments (metadata only); paper content unchanged. Please cite: Advanced Robotics 34(20):1291-1308, 2020. https://doi.org/10.1080/01691864.2020.1780152
♻ ☆ Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.
♻ ☆ RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion
Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
♻ ☆ ManiBox: Enhancing Embodied Spatial Generalization via Scalable Simulation Data Generations
Embodied agents require robust spatial intelligence to execute precise real-world manipulations. However, this remains a significant challenge, as current methods often struggle to accurately position objects in space. Collecting extensive data can help address this issue by enhancing the agent's spatial understanding. Nonetheless, obtaining such data with real robots is prohibitively expensive, and relying on simulation data frequently leads to visual generalization gaps during real-world deployment. To tackle these challenges, we propose ManiBox, a novel bounding-box-guided framework. By decoupling perception from policy generalization, ManiBox effectively reduces the Sim2Real gap, leverages Internet-scale data, and scales our policy data collection in simulation. Specifically, within ManiBox, the RL teacher policy efficiently generates scalable simulation data. The student policy is distilled from this data and takes bounding boxes as input, which is proven sufficient for determining objects' spatial positions, thus enabling zero-shot transfer to real robots. Comprehensive evaluations in both simulated and real-world environments demonstrate that ManiBox exhibits strong spatial generalization and adaptability across various manipulation tasks and settings. Furthermore, our empirical study provides preliminary verification of spatial scaling laws, i.e., the amount of data required for spatial generalization scales with spatial volume following a power-law relationship. At a given spatial volume level, the success rate of manipulation tasks follows Michaelis-Menten kinetics with respect to data volume, exhibiting a saturation effect as data increases. Our videos and code are available at https://thkkk.github.io/manibox
♻ ☆ LSRE: Latent Semantic Rule Encoding for Real-Time Semantic Risk Detection in Autonomous Driving
Real-world autonomous driving must adhere to complex human social rules that extend beyond legally codified traffic regulations. Many of these semantic constraints, such as yielding to emergency vehicles, complying with traffic officers' gestures, or stopping for school buses, are intuitive for humans yet difficult to encode explicitly. Although large vision-language models (VLMs) can interpret such semantics, their inference cost makes them impractical for real-time deployment. This work proposes LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within the latent space of a recurrent world model. By encoding language-defined safety semantics into a lightweight latent classifier, LSRE enables real-time semantic risk assessment at 10 Hz without per-frame VLM queries. Experiments on six semantic-failure scenarios in CARLA demonstrate that LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency. LSRE further generalizes to rarely seen semantic-similar test cases, indicating that language-guided latent classification offers an effective and deployable mechanism for semantic safety monitoring in autonomous driving.
Robotics 23
☆ PyBatchRender: A Python Library for Batched 3D Rendering at Up to One Million FPS
Reinforcement learning from pixels is often bottlenecked by the performance and complexity of 3D rendered environments. Researchers face a trade-off between high-speed, low-level engines and slower, more accessible Python frameworks. To address this, we introduce PyBatchRender, a Python library for high-throughput, batched 3D rendering that achieves over 1 million FPS on simple scenes. Built on the Panda3D game engine, it utilizes its mature ecosystem while enhancing performance through optimized batched rendering for up to 1000X speedups. Designed as a physics-agnostic renderer for reinforcement learning from pixels, PyBatchRender offers greater flexibility than dedicated libraries, simpler setup than typical game-engine wrappers, and speeds rivaling state-of-the-art C++ engines like Madrona. Users can create custom scenes entirely in Python with tens of lines of code, enabling rapid prototyping for scalable AI training. Open-source and easy to integrate, it serves to democratize high-performance 3D simulation for researchers and developers. The library is available at https://github.com/dolphin-in-a-coma/PyBatchRender.
☆ SAHA: Supervised Autonomous HArvester for selective forest thinning
Forestry plays a vital role in our society, creating significant ecological, economic, and recreational value. Efficient forest management involves labor-intensive and complex operations. One essential task for maintaining forest health and productivity is selective thinning, which requires skilled operators to remove specific trees to create optimal growing conditions for the remaining ones. In this work, we present a solution based on a small-scale robotic harvester (SAHA) designed for executing this task with supervised autonomy. We build on a 4.5-ton harvester platform and implement key hardware modifications for perception and automatic control. We implement learning- and model-based approaches for precise control of hydraulic actuators, accurate navigation through cluttered environments, robust state estimation, and reliable semantic estimation of terrain traversability. Integrating state-of-the-art techniques in perception, planning, and control, our robotic harvester can autonomously navigate forest environments and reach targeted trees for selective thinning. We present experimental results from extensive field trials over kilometer-long autonomous missions in northern European forests, demonstrating the harvester's ability to operate in real forests. We analyze the performance and provide the lessons learned for advancing robotic forest management.
☆ Bridging Language Gaps: Utilizing Interactive Robots to Teach Cantonese in Real-Life Contexts for Newly-Arrived Children
Hong Kong's education system is notably multicultural, including local, non-Chinese-speaking, and newly arrived students (NAS) (Mandarine Chinese-speaking). NAS can guess the meaning of vocabulary but cannot speak out, presenting unique challenges for them, particularly language barriers and cultural differences. These challenges hinder their academic success and social integration, leading to feelings of isolation and demotivation. Current resources often fail to address the emotional well-being of these students and predominantly focus on English language acquisition, leaving a gap in support for learning Cantonese and navigating the local cultural landscape. This study explores the effectiveness of an interactive robot, Boon Boon, in teaching Cantonese through real-life contexts to enhance NAS children learning engagement and motivation. The research questions are: (1) How does interactive robot-empowered scenario learning influence the learning engagement and motivation of NAS in learning Cantonese? and (2) What is the impact of a robot-empowered scenario learning system on the Cantonese language proficiency of NAS? Fourteen children are invited to participate in a four-day learning program with Boon Boon. The preliminary result indicated that Boon Boon drove students' attention to learning and academic achievement. Future research will focus on long-term assessments of robot-empowered learning's effectiveness and explore the scalability of this approach across diverse educational settings and cultural backgrounds.
☆ LiveBo: Empowering Non-Chinese Speaking Students through AI-Driven Real-Life Scenarios in Cantonese
Language learning is a multifaceted process. Insufficient vocabulary can hinder communication and lead to demotivation. For non-Chinese speaking (NCS) students, learning Traditional Chinese (Cantonese) poses distinct challenges, particularly due to the complexity of converting spoken and written forms. To address this issue, this study examines the effectiveness of real-life scenario simulations integrated with interactive social robots in enhancing NCS student engagement and language acquisition. The research employs a quasi-experimental design involving NCS students who interact with an AI-driven, robot-assisted language learning system, LiveBo. The study aims to assess the impact of this innovative approach on active participation and motivation. Data are collected through proficiency tests, questionnaires and semi-structured interviews. Findings indicate that NCS students experience positive improvements in behavioural and emotional engagement, motivation and learning outcomes, highlighting the potential of integrating novel technologies in language education. We plan to compare with the control group in the future. This study highlights the significance of interactive and immersive learning experiences in promoting motivation and enhancing language acquisition among NCS students.
☆ MotiBo: The Impact of Interactive Digital Storytelling Robots on Student Motivation through Self-Determination Theory
Creativity is increasingly recognized as an important skill in education, and storytelling can enhance motivation and engagement among students. However, conventional storytelling methods often lack the interactive elements necessary to engage students. To this end, this study examines the impact of an interactive digital storytelling system incorporating a human-like robot on student engagement and creativity. The study aims to compare engagement levels across three modalities: paper-based, PowerPoint, and robot-assisted storytelling, MotiBo. Utilizing a quasi-experimental design, this work involves three groups of students who interact with the storytelling system over a five-day learning. Findings reveal that students using MotiBo exhibit statistically significant improvement in behavioural and cognitive engagement compared to those using traditional methods. These results suggest that the integration of novel technologies can effectively enhance the learning experience, ultimately promoting creativity and self-learning ability in educational settings. Future research will investigate the long-term effects of these technologies on learning outcomes and explore their potential for broader applications in diverse educational contexts.
☆ Real-Time LiDAR Point Cloud Densification for Low-Latency Spatial Data Transmission
To realize low-latency spatial transmission system for immersive telepresence, there are two major problems: capturing dynamic 3D scene densely and processing them in real time. LiDAR sensors capture 3D in real time, but produce sparce point clouds. Therefore, this paper presents a high-speed LiDAR point cloud densification method to generate dense 3D scene with minimal latency, addressing the need for on-the-fly depth completion while maintaining real-time performance. Our approach combines multiple LiDAR inputs with high-resolution color images and applies a joint bilateral filtering strategy implemented through a convolutional neural network architecture. Experiments demonstrate that the proposed method produces dense depth maps at full HD resolution in real time (30 fps), which is over 15x faster than a recent training-based depth completion approach. The resulting dense point clouds exhibit accurate geometry without multiview inconsistencies or ghosting artifacts.
☆ EduSim-LLM: An Educational Platform Integrating Large Language Models and Robotic Simulation for Beginners
In recent years, the rapid development of Large Language Models (LLMs) has significantly enhanced natural language understanding and human-computer interaction, creating new opportunities in the field of robotics. However, the integration of natural language understanding into robotic control is an important challenge in the rapid development of human-robot interaction and intelligent automation industries. This challenge hinders intuitive human control over complex robotic systems, limiting their educational and practical accessibility. To address this, we present the EduSim-LLM, an educational platform that integrates LLMs with robot simulation and constructs a language-drive control model that translates natural language instructions into executable robot behavior sequences in CoppeliaSim. We design two human-robot interaction models: direct control and autonomous control, conduct systematic simulations based on multiple language models, and evaluate multi-robot collaboration, motion planning, and manipulation capabilities. Experiential results show that LLMs can reliably convert natural language into structured robot actions; after applying prompt-engineering templates instruction-parsing accuracy improves significantly; as task complexity increases, overall accuracy rate exceeds 88.9% in the highest complexity tests.
☆ DST-Calib: A Dual-Path, Self-Supervised, Target-Free LiDAR-Camera Extrinsic Calibration Network
LiDAR-camera extrinsic calibration is essential for multi-modal data fusion in robotic perception systems. However, existing approaches typically rely on handcrafted calibration targets (e.g., checkerboards) or specific, static scene types, limiting their adaptability and deployment in real-world autonomous and robotic applications. This article presents the first self-supervised LiDAR-camera extrinsic calibration network that operates in an online fashion and eliminates the need for specific calibration targets. We first identify a significant generalization degradation problem in prior methods, caused by the conventional single-sided data augmentation strategy. To overcome this limitation, we propose a novel double-sided data augmentation technique that generates multi-perspective camera views using estimated depth maps, thereby enhancing robustness and diversity during training. Built upon this augmentation strategy, we design a dual-path, self-supervised calibration framework that reduces the dependence on high-precision ground truth labels and supports fully adaptive online calibration. Furthermore, to improve cross-modal feature association, we replace the traditional dual-branch feature extraction design with a difference map construction process that explicitly correlates LiDAR and camera features. This not only enhances calibration accuracy but also reduces model complexity. Extensive experiments conducted on five public benchmark datasets, as well as our own recorded dataset, demonstrate that the proposed method significantly outperforms existing approaches in terms of generalizability.
☆ ORION: Option-Regularized Deep Reinforcement Learning for Cooperative Multi-Agent Online Navigation
Existing methods for multi-agent navigation typically assume fully known environments, offering limited support for partially known scenarios such as warehouses or factory floors. There, agents may need to plan trajectories that balance their own path optimality with their ability to collect and share information about the environment that can help their teammates reach their own goals. To these ends, we propose ORION, a novel deep reinforcement learning framework for cooperative multi-agent online navigation in partially known environments. Starting from an imperfect prior map, ORION trains agents to make decentralized decisions, coordinate to reach their individual targets, and actively reduce map uncertainty by sharing online observations in a closed perception-action loop. We first design a shared graph encoder that fuses prior map with online perception into a unified representation, providing robust state embeddings under dynamic map discrepancies. At the core of ORION is an option-critic framework that learns to reason about a set of high-level cooperative modes that translate into sequences of low-level actions, allowing agents to switch between individual navigation and team-level exploration adaptively. We further introduce a dual-stage cooperation strategy that enables agents to assist teammates under map uncertainty, thereby reducing the overall makespan. Across extensive maze-like maps and large-scale warehouse environments, our simulation results show that ORION achieves high-quality, real-time decentralized cooperation over varying team sizes, outperforming state-of-the-art classical and learning-based baselines. Finally, we validate ORION on physical robot teams, demonstrating its robustness and practicality for real-world cooperative navigation.
☆ VISO: Robust Underwater Visual-Inertial-Sonar SLAM with Photometric Rendering for Dense 3D Reconstruction
Visual challenges in underwater environments significantly hinder the accuracy of vision-based localisation and the high-fidelity dense reconstruction. In this paper, we propose VISO, a robust underwater SLAM system that fuses a stereo camera, an inertial measurement unit (IMU), and a 3D sonar to achieve accurate 6-DoF localisation and enable efficient dense 3D reconstruction with high photometric fidelity. We introduce a coarse-to-fine online calibration approach for extrinsic parameters estimation between the 3D sonar and the camera. Additionally, a photometric rendering strategy is proposed for the 3D sonar point cloud to enrich the sonar map with visual information. Extensive experiments in a laboratory tank and an open lake demonstrate that VISO surpasses current state-of-the-art underwater and visual-based SLAM algorithms in terms of localisation robustness and accuracy, while also exhibiting real-time dense 3D reconstruction performance comparable to the offline dense mapping method.
☆ Latent Space Reinforcement Learning for Multi-Robot Exploration
Autonomous mapping of unknown environments is a critical challenge, particularly in scenarios where time is limited. Multi-agent systems can enhance efficiency through collaboration, but the scalability of motion-planning algorithms remains a key limitation. Reinforcement learning has been explored as a solution, but existing approaches are constrained by the limited input size required for effective learning, restricting their applicability to discrete environments. This work addresses that limitation by leveraging autoencoders to perform dimensionality reduction, compressing high-fidelity occupancy maps into latent state vectors while preserving essential spatial information. Additionally, we introduce a novel procedural generation algorithm based on Perlin noise, designed to generate topologically complex training environments that simulate asteroid fields, caves and forests. These environments are used for training the autoencoder and the navigation algorithm using a hierarchical deep reinforcement learning framework for decentralized coordination. We introduce a weighted consensus mechanism that modulates reliance on shared data via a tuneable trust parameter, ensuring robustness to accumulation of errors. Experimental results demonstrate that the proposed system scales effectively with number of agents and generalizes well to unfamiliar, structurally distinct environments and is resilient in communication-constrained settings.
☆ Towards reliable subsea object recovery: a simulation study of an auv with a suction-actuated end effector
Autonomous object recovery in the hadal zone is challenging due to extreme hydrostatic pressure, limited visibility and currents, and the need for precise manipulation at full ocean depth. Field experimentation in such environments is costly, high-risk, and constrained by limited vehicle availability, making early validation of autonomous behaviors difficult. This paper presents a simulation-based study of a complete autonomous subsea object recovery mission using a Hadal Small Vehicle (HSV) equipped with a three-degree-of-freedom robotic arm and a suction-actuated end effector. The Stonefish simulator is used to model realistic vehicle dynamics, hydrodynamic disturbances, sensing, and interaction with a target object under hadal-like conditions. The control framework combines a world-frame PID controller for vehicle navigation and stabilization with an inverse-kinematics-based manipulator controller augmented by acceleration feed-forward, enabling coordinated vehicle - manipulator operation. In simulation, the HSV autonomously descends from the sea surface to 6,000 m, performs structured seafloor coverage, detects a target object, and executes a suction-based recovery. The results demonstrate that high-fidelity simulation provides an effective and low-risk means of evaluating autonomous deep-sea intervention behaviors prior to field deployment.
☆ Scalable Data-Driven Reachability Analysis and Control via Koopman Operators with Conformal Coverage Guarantees
We propose a scalable reachability-based framework for probabilistic, data-driven safety verification of unknown nonlinear dynamics. We use Koopman theory with a neural network (NN) lifting function to learn an approximate linear representation of the dynamics and design linear controllers in this space to enable closed-loop tracking of a reference trajectory distribution. Closed-loop reachable sets are efficiently computed in the lifted space and mapped back to the original state space via NN verification tools. To capture model mismatch between the Koopman dynamics and the true system, we apply conformal prediction to produce statistically-valid error bounds that inflate the reachable sets to ensure the true trajectories are contained with a user-specified probability. These bounds generalize across references, enabling reuse without recomputation. Results on high-dimensional MuJoCo tasks (11D Hopper, 28D Swimmer) and 12D quadcopters show improved reachable set coverage rate, computational efficiency, and conservativeness over existing methods.
comment: Under review, 28 pages, 12 figures
☆ Topological Mapping and Navigation using a Monocular Camera based on AnyLoc
This paper proposes a method for topological mapping and navigation using a monocular camera. Based on AnyLoc, keyframes are converted into descriptors to construct topological relationships, enabling loop detection and map building. Unlike metric maps, topological maps simplify path planning and navigation by representing environments with key nodes instead of precise coordinates. Actions for visual navigation are determined by comparing segmented images with the image associated with target nodes. The system relies solely on a monocular camera, ensuring fast map building and navigation using key nodes. Experiments show effective loop detection and navigation in real and simulation environments without pre-training. Compared to a ResNet-based method, this approach improves success rates by 60.2% on average while reducing time and space costs, offering a lightweight solution for robot and human navigation in various scenarios.
comment: Published in Proc. IEEE CASE 2025. 7 pages, 11 figures
♻ ☆ No Minima, No Collisions: Combining Modulation and Control Barrier Function Strategies for Feasible Dynamic Collision Avoidance
Control Barrier Function Quadratic Programs (CBF-QPs) have become a central tool for real-time safety-critical control due to their applicability to general control-affine systems and their ability to enforce constraints through optimization. Yet, they often generate trajectories with undesirable local minima that prevent convergence to goals. On the other hand, Modulation of Dynamical Systems (Mod-DS) methods (including normal, reference, and on-manifold variants) reshape nominal vector fields geometrically and achieve obstacle avoidance with few or even no local minima. However, Mod-DS provides no straightforward mechanism for handling input constraints and remains largely restricted to fully actuated systems. In this paper, we revisit the theoretical foundations of both approaches and show that, despite their seemingly different constructions, the normal Mod-DS is a special case of the CBF-QP, and the reference Mod-DS is linked to the CBF-QP through a single shared equation. These connections motivate our Modulated CBF-QP (MCBF-QP) framework, which introduces reference and on-manifold modulation variants that reduce or fully eliminate the spurious equilibria inherent to CBF-QPs for general control-affine systems operating in dynamic, cluttered environments. We validate the proposed controllers in simulated hospital settings and in real-world experiments on fully actuated Ridgeback robots and underactuated Fetch platforms. Across all evaluations, Modulated CBF-QPs consistently outperform standard CBF-QPs on every performance metric.
♻ ☆ Nonlinear Oscillatory Response of Automated Vehicle Car-following: Theoretical Analysis with Traffic State and Control Input Limits
This paper presents a framework grounded in the theory of describing function (DF) and incremental-input DF to theoretically analyze the nonlinear oscillatory response of automated vehicles (AVs) car-following (CF) amidst traffic oscillations, considering the limits of traffic state and control input. While prevailing approaches largely ignore these limits (i.e., saturation of acceleration/deceleration and speed) and focus on linear string stability analysis, this framework establishes a basis for theoretically analyzing the frequency response of AV systems with nonlinearities imposed by these limits. To this end, trajectories of CF pairs are decomposed into nominal and oscillatory trajectories, subsequently, the controlled AV system is repositioned within the oscillatory trajectory coordinates. Built on this base, DFs are employed to approximate the frequency responses of nonlinear saturation components by using their first harmonic output, thereby capturing the associated amplification ratio and phase shift. Considering the closed-loop nature of AV control systems, where system states and control input mutually influence each other, amplification ratios and phase shifts are balanced within the loop to ensure consistency. This balancing process may render multiple solutions, hence the incremental-input DF is further applied to identify the reasonable ones. The proposed method is validated by estimations from Simulink, and further comparisons with prevailing methods are conducted. Results confirm the alignment of our framework with Simulink results and exhibit its superior accuracy in analysis compared to the prevailing methods. Furthermore, the framework proves valuable in string stability analysis, especially when conventional linear methods offer misleading insights.
♻ ☆ Stochastic Online Optimization for Cyber-Physical and Robotic Systems
We propose a novel gradient-based online optimization framework for solving stochastic programming problems that frequently arise in the context of cyber-physical and robotic systems. Our problem formulation accommodates constraints that model the evolution of a cyber-physical system, which has, in general, a continuous state and action space, is nonlinear, and where the state is only partially observed. We also incorporate an approximate model of the dynamics as prior knowledge into the learning process and show that even rough estimates of the dynamics can significantly improve the convergence of our algorithms. Our online optimization framework encompasses both gradient descent and quasi-Newton methods, and we provide a unified convergence analysis of our algorithms in a non-convex setting. We also characterize the impact of modeling errors in the system dynamics on the convergence rate of the algorithms. Finally, we evaluate our algorithms in simulations of a flexible beam, a four-legged walking robot, and in real-world experiments with a ping-pong playing robot.
comment: 46 pages, 16 figures
♻ ☆ SPARC: Spine with Prismatic and Revolute Compliance for Quadruped Robots
We present SPARC, a compact, open-source 3-DoF sagittal-plane spine module that combines revolute (pitch) and prismatic (axial) motion with programmable task-space impedance for quadruped robots. The system integrates three torque-controlled actuators, a 1~kHz control board, and protected power electronics in a 1.26~kg package. A floating-base impedance controller with dynamics compensation renders tunable spring-damper behavior along horizontal, vertical, and rotational axes. Benchtop experiments validate the approach: quasi-static tests demonstrate linear force-displacement characteristics with commanded horizontal stiffness spanning 300--700~N/m ($\leq 1.5\%$ error, $R^2 \geq 0.992$), while dynamic trials confirm predictable damping behavior across multiple settings. Furthermore, simulation studies reveal that adapting spine stiffness based on terrain slope and stride length significantly reduces the cost of transport. SPARC provides an open-source platform for systematic studies of spine compliance in legged locomotion, with complete hardware and firmware resources available for community use. Github repository: https://github.com/YueWang996/sparc.git
♻ ☆ RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics NeurIPS 2025
Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.
comment: Accepted by NeurIPS 2025. Project page: https://zhoues.github.io/RoboRefer/
♻ ☆ P2U-SLAM: A Monocular Wide-FoV SLAM System Based on Point Uncertainty and Pose Uncertainty
This paper presents P2U-SLAM, a visual Simultaneous Localization And Mapping (SLAM) system with a wide Field of View (FoV) camera, which utilizes pose uncertainty and point uncertainty. While the wide FoV enables considerable repetitive observations of historical map points for matching cross-view features, the data properties of the historical map points and the poses of historical keyframes have changed during the optimization process. The neglect of data property changes results in the lack of partial information matrices in optimization, increasing the risk of long-term positioning performance degradation. The purpose of our research is to mitigate the risks posed by wide-FoV visual input to the SLAM system. Based on the conditional probability model, this work reveals the definite impacts of the above data properties changes on the optimization process, concretizes these impacts as point uncertainty and pose uncertainty, and gives their specific mathematical form. P2U-SLAM embeds point uncertainty into the tracking module and pose uncertainty into the local mapping module respectively, and updates these uncertainties after each optimization operation including local mapping, map merging, and loop closing. We present an exhaustive evaluation on 27 sequences from two popular public datasets with wide-FoV visual input. P2U-SLAM shows excellent performance compared with other state-of-the-art methods. The source code will be made publicly available at https://github.com/BambValley/P2U-SLAM.
comment: Accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS). The source code will be made publicly available at https://github.com/BambValley/P2U-SLAM
♻ ☆ Stairway to Success: An Online Floor-Aware Zero-Shot Object-Goal Navigation Framework via LLM-Driven Coarse-to-Fine Exploration
Deployable service and delivery robots struggle to navigate multi-floor buildings to reach object goals, as existing systems fail due to single-floor assumptions and requirements for offline, globally consistent maps. Multi-floor environments pose unique challenges including cross-floor transitions and vertical spatial reasoning, especially navigating unknown buildings. Object-Goal Navigation benchmarks like HM3D and MP3D also capture this multi-floor reality, yet current methods lack support for online, floor-aware navigation. To bridge this gap, we propose \textbf{\textit{ASCENT}}, an online framework for Zero-Shot Object-Goal Navigation that enables robots to operate without pre-built maps or retraining on new object categories. It introduces: (1) a \textbf{Multi-Floor Abstraction} module that dynamically constructs hierarchical representations with stair-aware obstacle mapping and cross-floor topology modeling, and (2) a \textbf{Coarse-to-Fine Reasoning} module that combines frontier ranking with LLM-driven contextual analysis for multi-floor navigation decisions. We evaluate on HM3D and MP3D benchmarks, outperforming state-of-the-art zero-shot approaches, and demonstrate real-world deployment on a quadruped robot.
comment: Accepted to IEEE Robotics and Automation Letters (RAL). Project Page at https://zeying-gong.github.io/projects/ascent
♻ ☆ Aerial World Model for Long-horizon Visual Generation and Navigation in 3D Space
Unmanned aerial vehicles (UAVs) have emerged as powerful embodied agents. One of the core abilities is autonomous navigation in large-scale three-dimensional environments. Existing navigation policies, however, are typically optimized for low-level objectives such as obstacle avoidance and trajectory smoothness, lacking the ability to incorporate high-level semantics into planning. To bridge this gap, we propose ANWM, an aerial navigation world model that predicts future visual observations conditioned on past frames and actions, thereby enabling agents to rank candidate trajectories by their semantic plausibility and navigational utility. ANWM is trained on 4-DoF UAV trajectories and introduces a physics-inspired module: Future Frame Projection (FFP), which projects past frames into future viewpoints to provide coarse geometric priors. This module mitigates representational uncertainty in long-distance visual generation and captures the mapping between 3D trajectories and egocentric observations. Empirical results demonstrate that ANWM significantly outperforms existing world models in long-distance visual forecasting and improves UAV navigation success rates in large-scale environments.
♻ ☆ Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation AAAI 2026
In open-vocabulary mobile manipulation (OVMM), task success often hinges on the selection of an appropriate base placement for the robot. Existing approaches typically navigate to proximity-based regions without considering affordances, resulting in frequent manipulation failures. We propose Affordance-Guided Coarse-to-Fine Exploration, a zero-shot framework for base placement that integrates semantic understanding from vision-language models (VLMs) with geometric feasibility through an iterative optimization process. Our method constructs cross-modal representations, namely Affordance RGB and Obstacle Map+, to align semantics with spatial context. This enables reasoning that extends beyond the egocentric limitations of RGB perception. To ensure interaction is guided by task-relevant affordances, we leverage coarse semantic priors from VLMs to guide the search toward task-relevant regions and refine placements with geometric constraints, thereby reducing the risk of convergence to local optima. Evaluated on five diverse open-vocabulary mobile manipulation tasks, our system achieves an 85% success rate, significantly outperforming classical geometric planners and VLM-based methods. This demonstrates the promise of affordance-aware and multimodal reasoning for generalizable, instruction-conditioned planning in OVMM.
comment: Accepted to AAAI 2026
Robotics 23
☆ Calling for Backup: How Children Navigate Successive Robot Communication Failures
How do children respond to repeated robot errors? While prior research has examined adult reactions to successive robot errors, children's responses remain largely unexplored. In this study, we explore children's reactions to robot social errors and performance errors. For the latter, this study reproduces the successive robot failure paradigm of Liu et al. with child participants (N=59, ages 8-10) to examine how young users respond to repeated robot conversational errors. Participants interacted with a robot that failed to understand their prompts three times in succession, with their behavioral responses video-recorded and analyzed. We found both similarities and differences compared to adult responses from the original study. Like adults, children adjusted their prompts, modified their verbal tone, and exhibited increasingly emotional non-verbal responses throughout successive errors. However, children demonstrated more disengagement behaviors, including temporarily ignoring the robot or actively seeking an adult. Errors did not affect participants' perception of the robot, suggesting more flexible conversational expectations in children. These findings inform the design of more effective and developmentally appropriate human-robot interaction systems for young users.
☆ DefVINS: Visual-Inertial Odometry for Deformable Scenes
Deformable scenes violate the rigidity assumptions underpinning classical visual-inertial odometry (VIO), often leading to over-fitting to local non-rigid motion or severe drift when deformation dominates visual parallax. We introduce DefVINS, a visual-inertial odometry framework that explicitly separates a rigid, IMU-anchored state from a non--rigid warp represented by an embedded deformation graph. The system is initialized using a standard VIO procedure that fixes gravity, velocity, and IMU biases, after which non-rigid degrees of freedom are activated progressively as the estimation becomes well conditioned. An observability analysis is included to characterize how inertial measurements constrain the rigid motion and render otherwise unobservable modes identifiable in the presence of deformation. This analysis motivates the use of IMU anchoring and informs a conditioning-based activation strategy that prevents ill-posed updates under poor excitation. Ablation studies demonstrate the benefits of combining inertial constraints with observability-aware deformation activation, resulting in improved robustness under non-rigid environments.
comment: 4 figures, 3 tables. Submitted to RA-L
☆ Bayesian Inverse Games with High-Dimensional Multi-Modal Observations
Many multi-agent interaction scenarios can be naturally modeled as noncooperative games, where each agent's decisions depend on others' future actions. However, deploying game-theoretic planners for autonomous decision-making requires a specification of all agents' objectives. To circumvent this practical difficulty, recent work develops maximum likelihood techniques for solving inverse games that can identify unknown agent objectives from interaction data. Unfortunately, these methods only infer point estimates and do not quantify estimator uncertainty; correspondingly, downstream planning decisions can overconfidently commit to unsafe actions. We present an approximate Bayesian inference approach for solving the inverse game problem, which can incorporate observation data from multiple modalities and be used to generate samples from the Bayesian posterior over the hidden agent objectives given limited sensor observations in real time. Concretely, the proposed Bayesian inverse game framework trains a structured variational autoencoder with an embedded differentiable Nash game solver on interaction datasets and does not require labels of agents' true objectives. Extensive experiments show that our framework successfully learns prior and posterior distributions, improves inference quality over maximum likelihood estimation-based inverse game approaches, and enables safer downstream decision-making without sacrificing efficiency. When trajectory information is uninformative or unavailable, multimodal inference further reduces uncertainty by exploiting additional observation modalities.
☆ RoboReward: General-Purpose Vision-Language Reward Models for Robotics
A well-designed reward is critical for effective reinforcement learning-based policy improvement. In real-world robotic domains, obtaining such rewards typically requires either labor-intensive human labeling or brittle, handcrafted objectives. Vision-language models (VLMs) have shown promise as automatic reward models, yet their effectiveness on real robot tasks is poorly understood. In this work, we aim to close this gap by introducing (1) \textbf{RoboReward}, a robotics reward dataset and benchmark built on large-scale real-robot corpora from Open X-Embodiment (OXE) and RoboArena, and (2) vision-language reward models trained on this dataset (RoboReward 4B/8B). Because OXE is success-heavy and lacks failure examples, we propose a \emph{negative examples data augmentation} pipeline that generates calibrated \emph{negatives} and \emph{near-misses} via counterfactual relabeling of successful episodes and temporal clipping to create partial-progress outcomes from the same videos. Using this framework, we produce an extensive training and evaluation dataset that spans diverse tasks and embodiments and enables systematic evaluation of whether state-of-the-art VLMs can reliably provide rewards for robotics. Our evaluation of leading open-weight and proprietary VLMs reveals that no model excels across all tasks, underscoring substantial room for improvement. We then train general-purpose 4B- and 8B-parameter models that outperform much larger VLMs in assigning rewards for short-horizon robotic tasks. Finally, we deploy the 8B-parameter reward VLM in real-robot reinforcement learning and find that it improves policy learning over Gemini Robotics-ER 1.5, a frontier physical reasoning VLM trained on robotics data, by a large margin, while substantially narrowing the gap to RL training with human-provided rewards.
☆ From 2D to 3D terrain-following area coverage path planning
An algorithm for 3D terrain-following area coverage path planning is presented. Multiple adjacent paths are generated that are (i) locally apart from each other by a distance equal to the working width of a machinery, while (ii) simultaneously floating at a projection distance equal to a specific working height above the terrain. The complexities of the algorithm in comparison to its 2D equivalent are highlighted. These include uniformly spaced elevation data generation using an Inverse Distance Weighting-approach and a local search. Area coverage path planning results for real-world 3D data within an agricultural context are presented to validate the algorithm.
comment: 6 pages, 10 figures, 1 table
☆ Vision-based Goal-Reaching Control for Mobile Robots Using a Hierarchical Learning Framework
Reinforcement learning (RL) is effective in many robotic applications, but it requires extensive exploration of the state-action space, during which behaviors can be unsafe. This significantly limits its applicability to large robots with complex actuators operating on unstable terrain. Hence, to design a safe goal-reaching control framework for large-scale robots, this paper decomposes the whole system into a set of tightly coupled functional modules. 1) A real-time visual pose estimation approach is employed to provide accurate robot states to 2) an RL motion planner for goal-reaching tasks that explicitly respects robot specifications. The RL module generates real-time smooth motion commands for the actuator system, independent of its underlying dynamic complexity. 3) In the actuation mechanism, a supervised deep learning model is trained to capture the complex dynamics of the robot and provide this model to 4) a model-based robust adaptive controller that guarantees the wheels track the RL motion commands even on slip-prone terrain. 5) Finally, to reduce human intervention, a mathematical safety supervisor monitors the robot, stops it on unsafe faults, and autonomously guides it back to a safe inspection area. The proposed framework guarantees uniform exponential stability of the actuation system and safety of the whole operation. Experiments on a 6,000 kg robot in different scenarios confirm the effectiveness of the proposed framework.
☆ NMPC-Augmented Visual Navigation and Safe Learning Control for Large-Scale Mobile Robots
A large-scale mobile robot (LSMR) is a high-order multibody system that often operates on loose, unconsolidated terrain, which reduces traction. This paper presents a comprehensive navigation and control framework for an LSMR that ensures stability and safety-defined performance, delivering robust operation on slip-prone terrain by jointly leveraging high-performance techniques. The proposed architecture comprises four main modules: (1) a visual pose-estimation module that fuses onboard sensors and stereo cameras to provide an accurate, low-latency robot pose, (2) a high-level nonlinear model predictive control that updates the wheel motion commands to correct robot drift from the robot reference pose on slip-prone terrain, (3) a low-level deep neural network control policy that approximates the complex behavior of the wheel-driven actuation mechanism in LSMRs, augmented with robust adaptive control to handle out-of-distribution disturbances, ensuring that the wheels accurately track the updated commands issued by high-level control module, and (4) a logarithmic safety module to monitor the entire robot stack and guarantees safe operation. The proposed low-level control framework guarantees uniform exponential stability of the actuation subsystem, while the safety module ensures the whole system-level safety during operation. Comparative experiments on a 6,000 kg LSMR actuated by two complex electro-hydrostatic drives, while synchronizing modules operating at different frequencies.
☆ Priority-Aware Multi-Robot Coverage Path Planning
Multi-robot systems are widely used for coverage tasks that require efficient coordination across large environments. In Multi-Robot Coverage Path Planning (MCPP), the objective is typically to minimize the makespan by generating non-overlapping paths for full-area coverage. However, most existing methods assume uniform importance across regions, limiting their effectiveness in scenarios where some zones require faster attention. We introduce the Priority-Aware MCPP (PA-MCPP) problem, where a subset of the environment is designated as prioritized zones with associated weights. The goal is to minimize, in lexicographic order, the total priority-weighted latency of zone coverage and the overall makespan. To address this, we propose a scalable two-phase framework combining (1) greedy zone assignment with local search, spanning-tree-based path planning, and (2) Steiner-tree-guided residual coverage. Experiments across diverse scenarios demonstrate that our method significantly reduces priority-weighted latency compared to standard MCPP baselines, while maintaining competitive makespan. Sensitivity analyses further show that the method scales well with the number of robots and that zone coverage behavior can be effectively controlled by adjusting priority weights.
comment: IEEE Robotics and Automation Letters, 8 pages, 10 figures
☆ LLM-Based Agentic Exploration for Robot Navigation & Manipulation with Skill Orchestration
This paper presents an end-to-end LLM-based agentic exploration system for an indoor shopping task, evaluated in both Gazebo simulation and a corresponding real-world corridor layout. The robot incrementally builds a lightweight semantic map by detecting signboards at junctions and storing direction-to-POI relations together with estimated junction poses, while AprilTags provide repeatable anchors for approach and alignment. Given a natural-language shopping request, an LLM produces a constrained discrete action at each junction (direction and whether to enter a store), and a ROS finite-state main controller executes the decision by gating modular motion primitives, including local-costmap-based obstacle avoidance, AprilTag approaching, store entry, and grasping. Qualitative results show that the integrated stack can perform end-to-end task execution from user instruction to multi-store navigation and object retrieval, while remaining modular and debuggable through its text-based map and logged decision history.
☆ Optimal Transport-Based Decentralized Multi-Agent Distribution Matching
This paper presents a decentralized control framework for distribution matching in multi-agent systems (MAS), where agents collectively achieve a prescribed terminal spatial distribution. The problem is formulated using optimal transport (Wasserstein distance), which provides a principled measure of distributional discrepancy and serves as the basis for the control design. To avoid solving the global optimal transport problem directly, the distribution-matching objective is reformulated into a tractable per-agent decision process, enabling each agent to identify its desired terminal locations using only locally available information. A sequential weight-update rule is introduced to construct feasible local transport plans, and a memory-based correction mechanism is incorporated to maintain reliable operation under intermittent and range-limited communication. Convergence guarantees are established, showing cycle-wise improvement of a surrogate transport cost under both linear and nonlinear agent dynamics. Simulation results demonstrate that the proposed framework achieves effective and scalable distribution matching while operating fully in a decentralized manner.
☆ Variable Elimination in Hybrid Factor Graphs for Discrete-Continuous Inference & Estimation
Many hybrid problems in robotics involve both continuous and discrete components, and modeling them together for estimation tasks has been a long standing and difficult problem. Hybrid Factor Graphs give us a mathematical framework to model these types of problems, however existing approaches for solving them are based on approximations. In this work, we propose an efficient Hybrid Factor Graph framework alongwith a variable elimination algorithm to produce a hybrid Bayes network, which can then be used for exact Maximum A Posteriori estimation and marginalization over both sets of variables. Our approach first develops a novel hybrid Gaussian factor which can connect to both discrete and continuous variables, and a hybrid conditional which can represent multiple continuous hypotheses conditioned on the discrete variables. Using these representations, we derive the process of hybrid variable elimination under the Conditional Linear Gaussian scheme, giving us exact posteriors as hybrid Bayes network. To bound the number of discrete hypotheses, we use a tree-structured representation of the factors coupled with a simple pruning and probabilistic assignment scheme, which allows for tractable inference. We demonstrate the applicability of our framework on a SLAM dataset with ambiguous measurements, where discrete choices for the most likely measurement have to be made. Our demonstrated results showcase the accuracy, generality, and simplicity of our hybrid factor graph framework.
☆ Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations ICLR 2026
Diffusion policies have emerged as powerful generative models for offline policy learning, whose sampling process can be rigorously characterized by a score function guiding a Stochastic Differential Equation (SDE). However, the same score-based SDE modeling that grants diffusion policies the flexibility to learn diverse behavior also incurs solver and score-matching errors, large data requirements, and inconsistencies in action generation. While less critical in image generation, these inaccuracies compound and lead to failure in continuous control settings. We introduce Contractive Diffusion Policies (CDPs) to induce contractive behavior in the diffusion sampling dynamics. Contraction pulls nearby flows closer to enhance robustness against solver and score-matching errors while reducing unwanted action variance. We develop an in-depth theoretical analysis along with a practical implementation recipe to incorporate CDPs into existing diffusion policy architectures with minimal modification and computational cost. We evaluate CDPs for offline learning by conducting extensive experiments in simulation and real-world settings. Across benchmarks, CDPs often outperform baseline policies, with pronounced benefits under data scarcity.
comment: Under review at ICLR 2026
☆ Simulations of MRI Guided and Powered Ferric Applicators for Tetherless Delivery of Therapeutic Interventions
Magnetic Resonance Imaging (MRI) is a well-established modality for pre-operative planning and is also explored for intra-operative guidance of procedures such as intravascular interventions. Among the experimental robot-assisted technologies, the magnetic field gradients of the MRI scanner are used to power and maneuver ferromagnetic applicators for accessing sites in the patient's body via the vascular network. In this work, we propose a computational platform for preoperative planning and modeling of MRI-powered applicators inside blood vessels. This platform was implemented as a two-way data and command pipeline that links the MRI scanner, the computational core, and the operator. The platform first processes multi-slice MR data to extract the vascular bed and then fits a virtual corridor inside the vessel. This corridor serves as a virtual fixture (VF), a forbidden region for the applicators to avoid vessel perforation or collision. The geometric features of the vessel centerline, the VF, and MRI safety compliance (dB/dt, max available gradient) are then used to generate magnetic field gradient waveforms. Different blood flow profiles can be user-selected, and those parameters are used for modeling the applicator's maneuvering. The modeling module further generates cues about whether the selected vascular path can be safely maneuvered. Given future experimental studies that require a real-time operation, the platform was implemented on the Qt framework (C/C++) with software modules performing specific tasks running on dedicated threads: PID controller, generation of VF, generation of MR gradient waveforms.
comment: 9 pages, 8 figures, published in ICBBB 2022
☆ From Perception to Symbolic Task Planning: Vision-Language Guided Human-Robot Collaborative Structured Assembly
Human-robot collaboration (HRC) in structured assembly requires reliable state estimation and adaptive task planning under noisy perception and human interventions. To address these challenges, we introduce a design-grounded human-aware planning framework for human-robot collaborative structured assembly. The framework comprises two coupled modules. Module I, Perception-to-Symbolic State (PSS), employs vision-language models (VLMs) based agents to align RGB-D observations with design specifications and domain knowledge, synthesizing verifiable symbolic assembly states. It outputs validated installed and uninstalled component sets for online state tracking. Module II, Human-Aware Planning and Replanning (HPR), performs task-level multi-robot assignment and updates the plan only when the observed state deviates from the expected execution outcome. It applies a minimal-change replanning rule to selectively revise task assignments and preserve plan stability even under human interventions. We validate the framework on a 27-component timber-frame assembly. The PSS module achieves 97% state synthesis accuracy, and the HPR module maintains feasible task progression across diverse HRC scenarios. Results indicate that integrating VLM-based perception with knowledge-driven planning improves robustness of state estimation and task planning under dynamic conditions.
☆ Value Vision-Language-Action Planning & Search
Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic manipulation, yet they remain fundamentally limited by their reliance on behavior cloning, leading to brittleness under distribution shift. While augmenting pretrained models with test-time search algorithms like Monte Carlo Tree Search (MCTS) can mitigate these failures, existing formulations rely solely on the VLA prior for guidance, lacking a grounded estimate of expected future return. Consequently, when the prior is inaccurate, the planner can only correct action selection via the exploration term, which requires extensive simulation to become effective. To address this limitation, we introduce Value Vision-Language-Action Planning and Search (V-VLAPS), a framework that augments MCTS with a lightweight, learnable value function. By training a simple multilayer perceptron (MLP) on the latent representations of a fixed VLA backbone (Octo), we provide the search with an explicit success signal that biases action selection toward high-value regions. We evaluate V-VLAPS on the LIBERO robotic manipulation suite, demonstrating that our value-guided search improves success rates by over 5 percentage points while reducing the average number of MCTS simulations by 5-15 percent compared to baselines that rely only on the VLA prior.
comment: 10 pages, 3 figures
☆ Analyzing the Shopping Journey: Computing Shelf Browsing Visits in a Physical Retail Store
Motivated by recent challenges in the deployment of robots into customer-facing roles within retail, this work introduces a study of customer activity in physical stores as a step toward autonomous understanding of shopper intent. We introduce an algorithm that computes shoppers' ``shelf visits'' -- capturing their browsing behavior in the store. Shelf visits are extracted from trajectories obtained via machine vision-based 3D tracking and overhead cameras. We perform two independent calibrations of the shelf visit algorithm, using distinct sets of trajectories (consisting of 8138 and 15129 trajectories), collected in different stores and labeled by human reviewers. The calibrated models are then evaluated on trajectories held out of the calibration process both from the same store on which calibration was performed and from the other store. An analysis of the results shows that the algorithm can recognize customers' browsing activity when evaluated in an environment different from the one on which calibration was performed. We then use the model to analyze the customers' ``browsing patterns'' on a large set of trajectories and their relation to actual purchases in the stores. Finally, we discuss how shelf browsing information could be used for retail planning and in the domain of human-robot interaction scenarios.
♻ ☆ Flattening Hierarchies with Policy Bootstrapping NeurIPS 2025
Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/
comment: NeurIPS 2025 (Spotlight, top 3.2%)
♻ ☆ Iterative Tuning of Nonlinear Model Predictive Control for Robotic Manufacturing Tasks
Manufacturing processes are often perturbed by drifts in the environment and wear in the system, requiring control re-tuning even in the presence of repetitive operations. This paper presents an iterative learning framework for automatic tuning of Nonlinear Model Predictive Control (NMPC) weighting matrices based on task-level performance feedback. Inspired by norm-optimal Iterative Learning Control (ILC), the proposed method adaptively adjusts NMPC weights Q and R across task repetitions to minimize key performance indicators (KPIs) related to tracking accuracy, control effort, and saturation. Unlike gradient-based approaches that require differentiating through the NMPC solver, we construct an empirical sensitivity matrix, enabling structured weight updates without analytic derivatives. The framework is validated through simulation on a UR10e robot performing carbon fiber winding on a tetrahedral core. Results demonstrate that the proposed approach converges to near-optimal tracking performance (RMSE within 0.3% of offline Bayesian Optimization (BO)) in just 4 online repetitions, compared to 100 offline evaluations required by BO algorithm. The method offers a practical solution for adaptive NMPC tuning in repetitive robotic tasks, combining the precision of carefully optimized controllers with the flexibility of online adaptation.
♻ ☆ Digital Twin based Automatic Reconfiguration of Robotic Systems in Smart Environments SC2
Robotic systems have become integral to smart environments, enabling applications ranging from urban surveillance and automated agriculture to industrial automation. However, their effective operation in dynamic settings - such as smart cities and precision farming - is challenged by continuously evolving topographies and environmental conditions. Traditional control systems often struggle to adapt quickly, leading to inefficiencies or operational failures. To address this limitation, we propose a novel framework for autonomous and dynamic reconfiguration of robotic controllers using Digital Twin technology. Our approach leverages a virtual replica of the robot's operational environment to simulate and optimize movement trajectories in response to real-world changes. By recalculating paths and control parameters in the Digital Twin and deploying the updated code to the physical robot, our method ensures rapid and reliable adaptation without manual intervention. This work advances the integration of Digital Twins in robotics, offering a scalable solution for enhancing autonomy in smart, dynamic environments.
comment: Accepted for presentation to 11th IEEE International Smart Cities Conference (ISC2 2025)
♻ ☆ Tackling the Kidnapped Robot Problem via Sparse Feasible Hypothesis Sampling and Reliable Batched Multi-Stage Inference
This paper addresses the Kidnapped Robot Problem (KRP), a core localization challenge of relocalizing a robot in a known map without prior pose estimate when localization loss or at SLAM initialization. For this purpose, a passive 2-D global relocalization framework is proposed. It estimates the global pose efficiently and reliably from a single LiDAR scan and an occupancy grid map while the robot remains stationary, thereby enhancing the long-term autonomy of mobile robots. The proposed framework casts global relocalization as a non-convex problem and solves it via the multi-hypothesis scheme with batched multi-stage inference and early termination, balancing completeness and efficiency. The Rapidly-exploring Random Tree (RRT), under traversability constraints, asymptotically covers the reachable space to generate sparse, uniformly distributed feasible positional hypotheses, fundamentally reducing the sampling space. The hypotheses are preliminarily ordered by the proposed Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that facilitates the early termination by prioritizing high-likelihood candidates. The SMAD computation is optimized for non-panoramic scans. The Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for reliable orientation selection at hypothesized positions and accurate final pose evaluation to mitigate degradation in conventional likelihood-field metrics under translational uncertainty induced by sparse hypotheses, as well as non-panoramic LiDAR scan and environmental changes. Real-world experiments on a resource-constrained mobile robot with non-panoramic LiDAR scans show that the proposed framework achieves competitive performance in both global relocalization success rate and computational efficiency.
comment: 10 pages, 8 figures. This work has been submitted to the IEEE for possible publication
♻ ☆ NeRF-VIO: Map-Based Visual-Inertial Odometry with Initialization Leveraging Neural Radiance Fields
A prior map serves as a foundational reference for localization in context-aware applications such as augmented reality (AR). Providing valuable contextual information about the environment, the prior map is a vital tool for mitigating drift. In this paper, we propose a map-based visual-inertial localization algorithm (NeRF-VIO) with initialization using neural radiance fields (NeRF). Our algorithm utilizes a multilayer perceptron model and redefines the loss function as the geodesic distance on \(SE(3)\), ensuring the invariance of the initialization model under a frame change within \(\mathfrak{se}(3)\). The evaluation demonstrates that our model outperforms existing NeRF-based initialization solution in both accuracy and efficiency. By integrating a two-stage update mechanism within a multi-state constraint Kalman filter (MSCKF) framework, the state of NeRF-VIO is constrained by both captured images from an onboard camera and rendered images from a pre-trained NeRF model. The proposed algorithm is validated using a real-world AR dataset, the results indicate that our two-stage update pipeline outperforms MSCKF across all data sequences.
♻ ☆ PLK-Calib: Single-shot and Target-less LiDAR-Camera Extrinsic Calibration using Plücker Lines
Accurate LiDAR-Camera (LC) calibration is challenging but crucial for autonomous systems and robotics. In this paper, we propose two single-shot and target-less algorithms to estimate the calibration parameters between LiDAR and camera using line features. The first algorithm constructs line-to-line constraints by defining points-to-line projection errors and minimizes the projection error. The second algorithm (PLK-Calib) utilizes the co-perpendicular and co-parallel geometric properties of lines in Plücker (PLK) coordinate, and decouples the rotation and translation into two constraints, enabling more accurate estimates. Our degenerate analysis and Monte Carlo simulation indicate that three nonparallel line pairs are the minimal requirements to estimate the extrinsic parameters. Furthermore, we collect an LC calibration dataset with varying extrinsic under three different scenarios and use it to evaluate the performance of our proposed algorithms.
♻ ☆ Towards Balanced Behavior Cloning from Imbalanced Datasets
Robots should be able to learn complex behaviors from human demonstrations. In practice, these human-provided datasets are inevitably imbalanced: i.e., the human demonstrates some subtasks more frequently than others. State-of-the-art methods default to treating each element of the human's dataset as equally important. So if -- for instance -- the majority of the human's data focuses on reaching a goal, and only a few state-action pairs move to avoid an obstacle, the learning algorithm will place greater emphasis on goal reaching. More generally, misalignment between the relative amounts of data and the importance of that data causes fundamental problems for imitation learning approaches. In this paper we analyze and develop learning methods that automatically account for mixed datasets. We formally prove that imbalanced data leads to imbalanced policies when each state-action pair is weighted equally; these policies emulate the most represented behaviors, and not the human's complex, multi-task demonstrations. We next explore algorithms that rebalance offline datasets (i.e., reweight the importance of different state-action pairs) without human oversight. Reweighting the dataset can enhance the overall policy performance. However, there is no free lunch: each method for autonomously rebalancing brings its own pros and cons. We formulate these advantages and disadvantages, helping other researchers identify when each type of approach is most appropriate. We conclude by introducing a novel meta-gradient rebalancing algorithm that addresses the primary limitations behind existing approaches. Our experiments show that dataset rebalancing leads to better downstream learning, improving the performance of general imitation learning algorithms without requiring additional data collection. See our project website: https://collab.me.vt.edu/data_curation/.
Robotics 16
☆ Space Debris Removal using Nano-Satellites controlled by Low-Power Autonomous Agents
Space debris is an ever-increasing problem in space travel. There are already many old, no longer functional spacecraft and debris orbiting the earth, which endanger both the safe operation of satellites and space travel. Small nano-satellite swarms can address this problem by autonomously de-orbiting debris safely into the Earth's atmosphere. This work builds on the recent advances of autonomous agents deployed in resource-constrained platforms and shows a first simplified approach how such intelligent and autonomous nano-satellite swarms can be realized. We implement our autonomous agent software on wireless microcontrollers and perform experiments on a specialized test-bed to show the feasibility and overall energy efficiency of our approach.
comment: This is an open-access, author-archived version of a manuscript published in European Conference on Multi-Agent Systems 2024
☆ Efficient Prediction of Dense Visual Embeddings via Distillation and RGB-D Transformers IROS 2025
In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer - an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.
comment: Published in Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
☆ Replaceable Bit-based Gripper for Picking Cluttered Food Items
The food packaging industry goes through changes in food items and their weights quite rapidly. These items range from easy-to-pick, single-piece food items to flexible, long and cluttered ones. We propose a replaceable bit-based gripper system to tackle the challenge of weight-based handling of cluttered food items. The gripper features specialized food attachments(bits) that enhance its grasping capabilities, and a belt replacement system allows switching between different food items during packaging operations. It offers a wide range of control options, enabling it to grasp and drop specific weights of granular, cluttered, and entangled foods. We specifically designed bits for two flexible food items that differ in shape: ikura(salmon roe) and spaghetti. They represent the challenging categories of sticky, granular food and long, sticky, cluttered food, respectively. The gripper successfully picked up both spaghetti and ikura and demonstrated weight-specific dropping of these items with an accuracy over 80% and 95% respectively. The gripper system also exhibited quick switching between different bits, leading to the handling of a large range of food items.
☆ Pure Inertial Navigation in Challenging Environments with Wheeled and Chassis Mounted Inertial Sensors
Autonomous vehicles and wheeled robots are widely used in many applications in both indoor and outdoor settings. In practical situations with limited GNSS signals or degraded lighting conditions, the navigation solution may rely only on inertial sensors and as result drift in time due to errors in the inertial measurement. In this work, we propose WiCHINS, a wheeled and chassis inertial navigation system by combining wheel-mounted-inertial sensors with a chassis-mounted inertial sensor for accurate pure inertial navigation. To that end, we derive a three-stage framework, each with a dedicated extended Kalman filter. This framework utilizes the benefits of each location (wheel/body) during the estimation process. To evaluate our proposed approach, we employed a dataset with five inertial measurement units with a total recording time of 228.6 minutes. We compare our approach with four other inertial baselines and demonstrate an average position error of 11.4m, which is $2.4\%$ of the average traveled distance, using two wheels and one body inertial measurement units. As a consequence, our proposed method enables robust navigation in challenging environments and helps bridge the pure-inertial performance gap.
☆ Vehicle Painting Robot Path Planning Using Hierarchical Optimization
In vehicle production factories, the vehicle painting process employs multiple robotic arms to simultaneously apply paint to car bodies advancing along a conveyor line. Designing paint paths for these robotic arms, which involves assigning car body areas to arms and determining paint sequences for each arm, remains a time-consuming manual task for engineers, indicating the demand for automation and design time reduction. The unique constraints of the painting process hinder the direct application of conventional robotic path planning techniques, such as those used in welding. Therefore, this paper formulates the design of paint paths as a hierarchical optimization problem, where the upper-layer subproblem resembles a vehicle routing problem (VRP), and the lower-layer subproblem involves detailed path planning. This approach allows the use of different optimization algorithms at each layer, and permits flexible handling of constraints specific to the vehicle painting process through the design of variable representation, constraints, repair operators, and an initialization process at the upper and lower layers. Experiments with three commercially available vehicle models demonstrated that the proposed method can automatically design paths that satisfy all constraints for vehicle painting with quality comparable to those created manually by engineers.
☆ SLAP: Slapband-based Autonomous Perching Drone with Failure Recovery for Vertical Tree Trunks
Perching allows unmanned aerial vehicles (UAVs) to reduce energy consumption, remain anchored for surface sampling operations, or stably survey their surroundings. Previous efforts for perching on vertical surfaces have predominantly focused on lightweight mechanical design solutions with relatively scant system-level integration. Furthermore, perching strategies for vertical surfaces commonly require high-speed, aggressive landing operations that are dangerous for a surveyor drone with sensitive electronics onboard. This work presents the preliminary investigation of a perching approach suitable for larger drones that both gently perches on vertical tree trunks and reacts and recovers from perch failures. The system in this work, called SLAP, consists of vision-based perch site detector, an IMU (inertial-measurement-unit)-based perch failure detector, an attitude controller for soft perching, an optical close-range detection system, and a fast active elastic gripper with microspines made from commercially-available slapbands. We validated this approach on a modified 1.2 kg commercial quadrotor with component and system analysis. Initial human-in-the-loop autonomous indoor flight experiments achieved a 75% perch success rate on a real oak tree segment across 20 flights, and 100% perch failure recovery across 2 flights with induced failures.
comment: Paper accepted to IEEE Aerospace Conference 2026. This is a pre-print
☆ Application Research of a Deep Learning Model Integrating CycleGAN and YOLO in PCB Infrared Defect Detection
This paper addresses the critical bottleneck of infrared (IR) data scarcity in Printed Circuit Board (PCB) defect detection by proposing a cross-modal data augmentation framework integrating CycleGAN and YOLOv8. Unlike conventional methods relying on paired supervision, we leverage CycleGAN to perform unpaired image-to-image translation, mapping abundant visible-light PCB images into the infrared domain. This generative process synthesizes high-fidelity pseudo-IR samples that preserve the structural semantics of defects while accurately simulating thermal distribution patterns. Subsequently, we construct a heterogeneous training strategy that fuses generated pseudo-IR data with limited real IR samples to train a lightweight YOLOv8 detector. Experimental results demonstrate that this method effectively enhances feature learning under low-data conditions. The augmented detector significantly outperforms models trained on limited real data alone and approaches the performance benchmarks of fully supervised training, proving the efficacy of pseudo-IR synthesis as a robust augmentation strategy for industrial inspection.
comment: 8 pages,8 figures
☆ CropNeRF: A Neural Radiance Field-Based Framework for Crop Counting
Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image-based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly-accurate crop counts. Furthermore, our method eliminates the dependence on crop-specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.
comment: 8 pages, 10 figures, and 2 tables
☆ SLEI3D: Simultaneous Exploration and Inspection via Heterogeneous Fleets under Limited Communication
Robotic fleets such as unmanned aerial and ground vehicles have been widely used for routine inspections of static environments, where the areas of interest are known and planned in advance. However, in many applications, such areas of interest are unknown and should be identified online during exploration. Thus, this paper considers the problem of simultaneous exploration, inspection of unknown environments and then real-time communication to a mobile ground control station to report the findings. The heterogeneous robots are equipped with different sensors, e.g., long-range lidars for fast exploration and close-range cameras for detailed inspection. Furthermore, global communication is often unavailable in such environments, where the robots can only communicate with each other via ad-hoc wireless networks when they are in close proximity and free of obstruction. This work proposes a novel planning and coordination framework (SLEI3D) that integrates the online strategies for collaborative 3D exploration, adaptive inspection and timely communication (via the intermit-tent or proactive protocols). To account for uncertainties w.r.t. the number and location of features, a multi-layer and multi-rate planning mechanism is developed for inter-and-intra robot subgroups, to actively meet and coordinate their local plans. The proposed framework is validated extensively via high-fidelity simulations of numerous large-scale missions with up to 48 robots and 384 thousand cubic meters. Hardware experiments of 7 robots are also conducted. Project website is available at https://junfengchen-robotics.github.io/SLEI3D/.
♻ ☆ Symphony: A Heuristic Normalized Calibrated Advantage Actor and Critic Algorithm in application for Humanoid Robots
In our work we not explicitly hint that it is a misconception to think that humans learn fast. Learning process takes time. Babies start learning to move in the restricted liquid area called placenta. Children often are limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for dozen millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is no secret that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set a limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are kind of immersed in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update Actor and Critic in one pass, as well as combine Actor and Critic in one Object and implement their Losses in one line.
comment: https://github.com/SuspensionRailway/symphony
♻ ☆ Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training
General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/
♻ ☆ MDE-AgriVLN: Agricultural Vision-and-Language Navigation with Monocular Depth Estimation
Agricultural robots are serving as powerful assistants across a wide range of agricultural tasks, nevertheless, still heavily relying on manual operations or railway systems for movement. The AgriVLN method and the A2A benchmark pioneeringly extended Vision-and-Language Navigation (VLN) to the agricultural domain, enabling a robot to navigate to a target position following a natural language instruction. Unlike human binocular vision, most agricultural robots are only given a single camera for monocular vision, which results in limited spatial perception. To bridge this gap, we present the method of Agricultural Vision-and-Language Navigation with Monocular Depth Estimation (MDE-AgriVLN), in which we propose the MDE module generating depth features from RGB images, to assist the decision-maker on multimodal reasoning. When evaluated on the A2A benchmark, our MDE-AgriVLN method successfully increases Success Rate from 0.23 to 0.32 and decreases Navigation Error from 4.43m to 4.08m, demonstrating the state-of-the-art performance in the agricultural VLN domain. Code: https://github.com/AlexTraveling/MDE-AgriVLN.
♻ ☆ Efficient Multi-Task Scene Analysis with RGB-D Transformers IJCNN 2023
Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
comment: Published in Proc. International Joint Conference on Neural Networks (IJCNN 2023)
♻ ☆ Video-Based Detection and Analysis of Errors in Robotic Surgical Training
Robot-assisted minimally invasive surgeries offer many advantages but require complex motor tasks that take surgeons years to master. There is currently a lack of knowledge on how surgeons acquire these robotic surgical skills. Toward bridging this gap, a previous study followed surgical residents learning complex surgical dry lab tasks on a surgical robot over six months. Errors are an important measure for training and skill evaluation, but unlike in virtual simulations, in dry lab training, errors are difficult to monitor automatically. Here, we analyzed errors in the ring tower transfer task, in which surgical residents moved a ring along a curved wire as quickly and accurately as possible. We developed an image-processing algorithm using color and size thresholds, optical flow and short time Fourier transforms to detect collision errors and achieved a detection accuracy of approximately 95%. Using the detected errors and task completion time, we found that the residents reduced their completion time and number of errors over the six months, while the percentage of task time spent making errors remained relatively constant on average. This analysis sheds light on the learning process of the residents and can serve as a step towards providing error-related feedback to robotic surgeons.
comment: Title change; 9 pages, 4 figures, 1 table. Alex Geftler and Ilana Nisky contributed equally to this work
♻ ☆ World In Your Hands: A Large-Scale and Open-source Ecosystem for Learning Human-centric Manipulation in the Wild
Large-scale pre-training is fundamental for generalization in language and vision models, but data for dexterous hand manipulation remains limited in scale and diversity, hindering policy generalization. Limited scenario diversity, misaligned modalities, and insufficient benchmarking constrain current human manipulation datasets. To address these gaps, we introduce World In Your Hands (WiYH), a large-scale open-source ecosystem for human-centric manipulation learning. WiYH includes (1) the Oracle Suite, a wearable data collection kit with an auto-labeling pipeline for accurate motion capture; (2) the WiYH Dataset, featuring over 1,000 hours of multi-modal manipulation data across hundreds of skills in diverse real-world scenarios; and (3) extensive annotations and benchmarks supporting tasks from perception to action. Furthermore, experiments based on the WiYH ecosystem show that integrating WiYH's human-centric data significantly enhances the generalization and robustness of dexterous hand policies in tabletop manipulation tasks. We believe that World In Your Hands will bring new insights into human-centric data collection and policy learning to the community.
comment: This dataset represents the first large-scale collection of real-world, human-centric multimodal data integrating vision, language, tactile sensing, and action (VLTA)
♻ ☆ AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving
Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems. However, limited work exists on studying the trustworthiness of DriveVLMs -- a critical factor that directly impacts public transportation safety. In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives -- including trustfulness, safety, robustness, privacy, and fairness. We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries. We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models. Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness. DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information. Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations. Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs -- an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems. We release all the codes and datasets in https://github.com/taco-group/AutoTrust.
comment: Published at TMLR 2025
Robotics 36
☆ SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot
comment: Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot
☆ Coordinated Humanoid Manipulation with Choice Policies
Humanoid robots hold great promise for operating in human-centric environments, yet achieving robust whole-body coordination across the head, hands, and legs remains a major challenge. We present a system that combines a modular teleoperation interface with a scalable learning framework to address this problem. Our teleoperation design decomposes humanoid control into intuitive submodules, which include hand-eye coordination, grasp primitives, arm end-effector tracking, and locomotion. This modularity allows us to collect high-quality demonstrations efficiently. Building on this, we introduce Choice Policy, an imitation learning approach that generates multiple candidate actions and learns to score them. This architecture enables both fast inference and effective modeling of multimodal behaviors. We validate our approach on two real-world tasks: dishwasher loading and whole-body loco-manipulation for whiteboard wiping. Experiments show that Choice Policy significantly outperforms diffusion policies and standard behavior cloning. Furthermore, our results indicate that hand-eye coordination is critical for success in long-horizon tasks. Our work demonstrates a practical path toward scalable data collection and learning for coordinated humanoid manipulation in unstructured environments.
comment: Code and Website: https://choice-policy.github.io/
☆ DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.
comment: Submitted to IEEE Robotics and Automation Letters (RA-L)
☆ Hierarchical Deformation Planning and Neural Tracking for DLOs in Constrained Environments
Deformable linear objects (DLOs) manipulation presents significant challenges due to DLOs' inherent high-dimensional state space and complex deformation dynamics. The wide-populated obstacles in realistic workspaces further complicate DLO manipulation, necessitating efficient deformation planning and robust deformation tracking. In this work, we propose a novel framework for DLO manipulation in constrained environments. This framework combines hierarchical deformation planning with neural tracking, ensuring reliable performance in both global deformation synthesis and local deformation tracking. Specifically, the deformation planner begins by generating a spatial path set that inherently satisfies the homotopic constraints associated with DLO keypoint paths. Next, a path-set-guided optimization method is applied to synthesize an optimal temporal deformation sequence for the DLO. In manipulation execution, a neural model predictive control approach, leveraging a data-driven deformation model, is designed to accurately track the planned DLO deformation sequence. The effectiveness of the proposed framework is validated in extensive constrained DLO manipulation tasks.
☆ MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control
Achieving provable stability in model-free reinforcement learning (RL) remains a challenge, particularly in balancing exploration with rigorous safety. This article introduces MSACL, a framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. Unlike methods relying on complex reward engineering, MSACL utilizes off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions. By introducing Exponential Stability Labels (ESL) and a $λ$-weighted aggregation mechanism, the framework effectively balances the bias-variance trade-off in multi-step learning. Policy optimization is guided by a stability-aware advantage function, ensuring the learned policy promotes rapid Lyapunov descent. We evaluate MSACL across six benchmarks, including stabilization and nonlinear tracking tasks, demonstrating its superiority over state-of-the-art Lyapunov-based RL algorithms. MSACL achieves exponential stability and rapid convergence under simple rewards, while exhibiting significant robustness to uncertainties and generalization to unseen trajectories. Sensitivity analysis establishes the multi-step horizon $n=20$ as a robust default across diverse systems. By linking Lyapunov theory with off-policy actor-critic frameworks, MSACL provides a foundation for verifiably safe learning-based control. Source code and benchmark environments will be made publicly available.
☆ VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.
☆ ArtiSG: Functional 3D Scene Graph Construction via Human-demonstrated Articulated Objects Manipulation
3D scene graphs have empowered robots with semantic understanding for navigation and planning, yet they often lack the functional information required for physical manipulation, particularly regarding articulated objects. Existing approaches for inferring articulation mechanisms from static observations are prone to visual ambiguity, while methods that estimate parameters from state changes typically rely on constrained settings such as fixed cameras and unobstructed views. Furthermore, fine-grained functional elements like small handles are frequently missed by general object detectors. To bridge this gap, we present ArtiSG, a framework that constructs functional 3D scene graphs by encoding human demonstrations into structured robotic memory. Our approach leverages a robust articulation data collection pipeline utilizing a portable setup to accurately estimate 6-DoF articulation trajectories and axes even under camera ego-motion. We integrate these kinematic priors into a hierarchical and open-vocabulary graph while utilizing interaction data to discover inconspicuous functional elements missed by visual perception. Extensive real-world experiments demonstrate that ArtiSG significantly outperforms baselines in functional element recall and articulation estimation precision. Moreover, we show that the constructed graph serves as a reliable functional memory that effectively guides robots to perform language-directed manipulation tasks in real-world environments containing diverse articulated objects.
☆ CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture
Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.
comment: 8 pages, 5 figures, and 3 tables
☆ Explaining Why Things Go Where They Go: Interpretable Constructs of Human Organizational Preferences
Robotic systems for household object rearrangement often rely on latent preference models inferred from human demonstrations. While effective at prediction, these models offer limited insight into the interpretable factors that guide human decisions. We introduce an explicit formulation of object arrangement preferences along four interpretable constructs: spatial practicality (putting items where they naturally fit best in the space), habitual convenience (making frequently used items easy to reach), semantic coherence (placing items together if they are used for the same task or are contextually related), and commonsense appropriateness (putting things where people would usually expect to find them). To capture these constructs, we designed and validated a self-report questionnaire through a 63-participant online study. Results confirm the psychological distinctiveness of these constructs and their explanatory power across two scenarios (kitchen and living room). We demonstrate the utility of these constructs by integrating them into a Monte Carlo Tree Search (MCTS) planner and show that when guided by participant-derived preferences, our planner can generate reasonable arrangements that closely align with those generated by participants. This work contributes a compact, interpretable formulation of object arrangement preferences and a demonstration of how it can be operationalized for robot planning.
comment: Accepted to the 2026 ACM/IEEE International Conference on Human-Robot Interaction (HRI '26)
☆ Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow
Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at https://dream2flow.github.io/.
comment: Project website: https://dream2flow.github.io/
☆ Control of Microrobots with Reinforcement Learning under On-Device Compute Constraints
An important function of autonomous microrobots is the ability to perform robust movement over terrain. This paper explores an edge ML approach to microrobot locomotion, allowing for on-device, lower latency control under compute, memory, and power constraints. This paper explores the locomotion of a sub-centimeter quadrupedal microrobot via reinforcement learning (RL) and deploys the resulting controller on an ultra-small system-on-chip (SoC), SC$μ$M-3C, featuring an ARM Cortex-M0 microcontroller running at 5 MHz. We train a compact FP32 multilayer perceptron (MLP) policy with two hidden layers ($[128, 64]$) in a massively parallel GPU simulation and enhance robustness by utilizing domain randomization over simulation parameters. We then study integer (Int8) quantization (per-tensor and per-feature) to allow for higher inference update rates on our resource-limited hardware, and we connect hardware power budgets to achievable update frequency via a cycles-per-update model for inference on our Cortex-M0. We propose a resource-aware gait scheduling viewpoint: given a device power budget, we can select the gait mode (trot/intermediate/gallop) that maximizes expected RL reward at a corresponding feasible update frequency. Finally, we deploy our MLP policy on a real-world large-scale robot on uneven terrain, qualitatively noting that domain-randomized training can improve out-of-distribution stability. We do not claim real-world large-robot empirical zero-shot transfer in this work.
comment: 9 pages, 10 figures
☆ LSRE: Latent Semantic Rule Encoding for Real-Time Semantic Risk Detection in Autonomous Driving
Real-world autonomous driving must adhere to complex human social rules that extend beyond legally codified traffic regulations. Many of these semantic constraints, such as yielding to emergency vehicles, complying with traffic officers' gestures, or stopping for school buses, are intuitive for humans yet difficult to encode explicitly. Although large vision-language models (VLMs) can interpret such semantics, their inference cost makes them impractical for real-time deployment.This work proposes LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within the latent space of a recurrent world model. By encoding language-defined safety semantics into a lightweight latent classifier, LSRE enables real-time semantic risk assessment at 10 Hz without per-frame VLM queries. Experiments on six semantic-failure scenarios in CARLA demonstrate that LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency. LSRE further generalizes to rarely seen semantic-similar test cases, indicating that language-guided latent classification offers an effective and deployable mechanism for semantic safety monitoring in autonomous driving.
☆ Dynamic Policy Learning for Legged Robot with Simplified Model Pretraining and Model Homotopy Transfer
Generating dynamic motions for legged robots remains a challenging problem. While reinforcement learning has achieved notable success in various legged locomotion tasks, producing highly dynamic behaviors often requires extensive reward tuning or high-quality demonstrations. Leveraging reduced-order models can help mitigate these challenges. However, the model discrepancy poses a significant challenge when transferring policies to full-body dynamics environments. In this work, we introduce a continuation-based learning framework that combines simplified model pretraining and model homotopy transfer to efficiently generate and refine complex dynamic behaviors. First, we pretrain the policy using a single rigid body model to capture core motion patterns in a simplified environment. Next, we employ a continuation strategy to progressively transfer the policy to the full-body environment, minimizing performance loss. To define the continuation path, we introduce a model homotopy from the single rigid body model to the full-body model by gradually redistributing mass and inertia between the trunk and legs. The proposed method not only achieves faster convergence but also demonstrates superior stability during the transfer process compared to baseline methods. Our framework is validated on a range of dynamic tasks, including flips and wall-assisted maneuvers, and is successfully deployed on a real quadrupedal robot.
comment: 8 pages. Submitted to the IEEE for possible publication
☆ CREPES-X: Hierarchical Bearing-Distance-Inertial Direct Cooperative Relative Pose Estimation System
Relative localization is critical for cooperation in autonomous multi-robot systems. Existing approaches either rely on shared environmental features or inertial assumptions or suffer from non-line-of-sight degradation and outliers in complex environments. Robust and efficient fusion of inter-robot measurements such as bearings, distances, and inertials for tens of robots remains challenging. We present CREPES-X (Cooperative RElative Pose Estimation System with multiple eXtended features), a hierarchical relative localization framework that enhances speed, accuracy, and robustness under challenging conditions, without requiring any global information. CREPES-X starts with a compact hardware design: InfraRed (IR) LEDs, an IR camera, an ultra-wideband module, and an IMU housed in a cube no larger than 6cm on each side. Then CREPES-X implements a two-stage hierarchical estimator to meet different requirements, considering speed, accuracy, and robustness. First, we propose a single-frame relative estimator that provides instant relative poses for multi-robot setups through a closed-form solution and robust bearing outlier rejection. Then a multi-frame relative estimator is designed to offer accurate and robust relative states by exploring IMU pre-integration via robocentric relative kinematics with loosely- and tightly-coupled optimization. Extensive simulations and real-world experiments validate the effectiveness of CREPES-X, showing robustness to up to 90% bearing outliers, proving resilience in challenging conditions, and achieving RMSE of 0.073m and 1.817° in real-world datasets.
comment: 21 pages, 23 figures, journal
☆ ReSPIRe: Informative and Reusable Belief Tree Search for Robot Probabilistic Search and Tracking in Unknown Environments
Target search and tracking (SAT) is a fundamental problem for various robotic applications such as search and rescue and environmental exploration. This paper proposes an informative trajectory planning approach, namely ReSPIRe, for SAT in unknown cluttered environments under considerably inaccurate prior target information and limited sensing field of view. We first develop a novel sigma point-based approximation approach to fast and accurately estimate mutual information reward under non-Gaussian belief distributions, utilizing informative sampling in state and observation spaces to mitigate the computational intractability of integral calculation. To tackle significant uncertainty associated with inadequate prior target information, we propose the hierarchical particle structure in ReSPIRe, which not only extracts critical particles for global route guidance, but also adjusts the particle number adaptively for planning efficiency. Building upon the hierarchical structure, we develop the reusable belief tree search approach to build a policy tree for online trajectory planning under uncertainty, which reuses rollout evaluation to improve planning efficiency. Extensive simulations and real-world experiments demonstrate that ReSPIRe outperforms representative benchmark methods with smaller MI approximation error, higher search efficiency, and more stable tracking performance, while maintaining outstanding computational efficiency.
comment: 17 pages, 12 figures, accepted to IEEE Transactions on Systems, Man, and Cybernetics: Systems
☆ VLA-RAIL: A Real-Time Asynchronous Inference Linker for VLA Models and Robots
Vision-Language-Action (VLA) models have achieved remarkable breakthroughs in robotics, with the action chunk playing a dominant role in these advances. Given the real-time and continuous nature of robotic motion control, the strategies for fusing a queue of successive action chunks have a profound impact on the overall performance of VLA models. Existing methods suffer from jitter, stalling, or even pauses in robotic action execution, which not only limits the achievable execution speed but also reduces the overall success rate of task completion. This paper introduces VLA-RAIL (A Real-Time Asynchronous Inference Linker), a novel framework designed to address these issues by conducting model inference and robot motion control asynchronously and guaranteeing smooth, continuous, and high-speed action execution. The core contributions of the paper are two fold: a Trajectory Smoother that effectively filters out the noise and jitter in the trajectory of one action chunk using polynomial fitting and a Chunk Fuser that seamlessly align the current executing trajectory and the newly arrived chunk, ensuring position, velocity, and acceleration continuity between two successive action chunks. We validate the effectiveness of VLA-RAIL on a benchmark of dynamic simulation tasks and several real-world manipulation tasks. Experimental results demonstrate that VLA-RAIL significantly reduces motion jitter, enhances execution speed, and improves task success rates, which will become a key infrastructure for the large-scale deployment of VLA models.
☆ Antagonistic Bowden-Cable Actuation of a Lightweight Robotic Hand: Toward Dexterous Manipulation for Payload Constrained Humanoids
Humanoid robots toward human-level dexterity require robotic hands capable of simultaneously providing high grasping force, rapid actuation speeds, multiple degrees of freedom, and lightweight structures within human-like size constraints. Meeting these conflicting requirements remains challenging, as satisfying this combination typically necessitates heavier actuators and bulkier transmission systems, significantly restricting the payload capacity of robot arms. In this letter, we present a lightweight anthropomorphic hand actuated by Bowden cables, which uniquely combines rolling-contact joint optimization with antagonistic cable actuation, enabling single-motor-per-joint control with negligible cable-length deviation. By relocating the actuator module to the torso, the design substantially reduces distal mass while maintaining anthropomorphic scale and dexterity. Additionally, this antagonistic cable actuation eliminates the need for synchronization between motors. Using the proposed methods, the hand assembly with a distal mass of 236g (excluding remote actuators and Bowden sheaths) demonstrated reliable execution of dexterous tasks, exceeding 18N fingertip force and lifting payloads over one hundred times its own mass. Furthermore, robustness was validated through Cutkosky taxonomy grasps and trajectory consistency under perturbed actuator-hand transformations.
comment: Preprint
☆ RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence
While data-driven imitation learning has revolutionized robotic manipulation, current approaches remain constrained by the scarcity of large-scale, diverse real-world demonstrations. Consequently, the ability of existing models to generalize across long-horizon bimanual tasks and mobile manipulation in unstructured environments remains limited. To bridge this gap, we present RoboMIND 2.0, a comprehensive real-world dataset comprising over 310K dual-arm manipulation trajectories collected across six distinct robot embodiments and 739 complex tasks. Crucially, to support research in contact-rich and spatially extended tasks, the dataset incorporates 12K tactile-enhanced episodes and 20K mobile manipulation trajectories. Complementing this physical data, we construct high-fidelity digital twins of our real-world environments, releasing an additional 20K-trajectory simulated dataset to facilitate robust sim-to-real transfer. To fully exploit the potential of RoboMIND 2.0, we propose MIND-2 system, a hierarchical dual-system frame-work optimized via offline reinforcement learning. MIND-2 integrates a high-level semantic planner (MIND-2-VLM) to decompose abstract natural language instructions into grounded subgoals, coupled with a low-level Vision-Language-Action executor (MIND-2-VLA), which generates precise, proprioception-aware motor actions.
☆ Hybrid Motion Planning with Deep Reinforcement Learning for Mobile Robot Navigation
Autonomous mobile robots operating in complex, dynamic environments face the dual challenge of navigating large-scale, structurally diverse spaces with static obstacles while safely interacting with various moving agents. Traditional graph-based planners excel at long-range pathfinding but lack reactivity, while Deep Reinforcement Learning (DRL) methods demonstrate strong collision avoidance but often fail to reach distant goals due to a lack of global context. We propose Hybrid Motion Planning with Deep Reinforcement Learning (HMP-DRL), a hybrid framework that bridges this gap. Our approach utilizes a graph-based global planner to generate a path, which is integrated into a local DRL policy via a sequence of checkpoints encoded in both the state space and reward function. To ensure social compliance, the local planner employs an entity-aware reward structure that dynamically adjusts safety margins and penalties based on the semantic type of surrounding agents. We validate the proposed method through extensive testing in a realistic simulation environment derived from real-world map data. Comprehensive experiments demonstrate that HMP-DRL consistently outperforms other methods, including state-of-the-art approaches, in terms of key metrics of robot navigation: success rate, collision rate, and time to reach the goal. Overall, these findings confirm that integrating long-term path guidance with semantically-aware local control significantly enhances both the safety and reliability of autonomous navigation in complex human-centric settings.
comment: 22 pages, 4 figures
☆ Resolving State Ambiguity in Robot Manipulation via Adaptive Working Memory Recoding
State ambiguity is common in robotic manipulation. Identical observations may correspond to multiple valid behavior trajectories. The visuomotor policy must correctly extract the appropriate types and levels of information from the history to identify the current task phase. However, naively extending the history window is computationally expensive and may cause severe overfitting. Inspired by the continuous nature of human reasoning and the recoding of working memory, we introduce PAM, a novel visuomotor Policy equipped with Adaptive working Memory. With minimal additional training cost in a two-stage manner, PAM supports a 300-frame history window while maintaining high inference speed. Specifically, a hierarchical frame feature extractor yields two distinct representations for motion primitives and temporal disambiguation. For compact representation, a context router with range-specific queries is employed to produce compact context features across multiple history lengths. And an auxiliary objective of reconstructing historical information is introduced to ensure that the context router acts as an effective bottleneck. We meticulously design 7 tasks and verify that PAM can handle multiple scenarios of state ambiguity simultaneously. With a history window of approximately 10 seconds, PAM still supports stable training and maintains inference speeds above 20Hz. Project website: https://tinda24.github.io/pam/
☆ DISF: Disentangled Iterative Surface Fitting for Contact-stable Grasp Planning with Grasp Pose Alignment to the Object Center of Mass
In this work, we address the limitation of surface fitting-based grasp planning algorithm, which primarily focuses on geometric alignment between the gripper and object surface while overlooking the stability of contact point distribution, often resulting in unstable grasps due to inadequate contact configurations. To overcome this limitation, we propose a novel surface fitting algorithm that integrates contact stability while preserving geometric compatibility. Inspired by human grasping behavior, our method disentangles the grasp pose optimization into three sequential steps: (1) rotation optimization to align contact normals, (2) translation refinement to improve the alignment between the gripper frame origin and the object Center of Mass (CoM), and (3) gripper aperture adjustment to optimize contact point distribution. We validate our approach in simulation across 15 objects under both Known-shape (with clean CAD-derived dataset) and Observed-shape (with YCB object dataset) settings, including cross-platform grasp execution on three robot--gripper platforms. We further validate the method in real-world grasp experiments on a UR3e robot. Overall, DISF reduces CoM misalignment while maintaining geometric compatibility, translating into higher grasp success in both simulation and real-world execution compared to baselines. Additional videos and supplementary results are available on our project page: https://tomoya-yamanokuchi.github.io/disf-ras-project-page/
comment: 48 pages
☆ Compositional Diffusion with Guided search for Long-Horizon Planning
Generative models have emerged as powerful tools for planning, with compositional approaches offering particular promise for modeling long-horizon task distributions by composing together local, modular generative models. This compositional paradigm spans diverse domains, from multi-step manipulation planning to panoramic image synthesis to long video generation. However, compositional generative models face a critical challenge: when local distributions are multimodal, existing composition methods average incompatible modes, producing plans that are neither locally feasible nor globally coherent. We propose Compositional Diffusion with Guided Search (CDGS), which addresses this \emph{mode averaging} problem by embedding search directly within the diffusion denoising process. Our method explores diverse combinations of local modes through population-based sampling, prunes infeasible candidates using likelihood-based filtering, and enforces global consistency through iterative resampling between overlapping segments. CDGS matches oracle performance on seven robot manipulation tasks, outperforming baselines that lack compositionality or require long-horizon training data. The approach generalizes across domains, enabling coherent text-guided panoramic images and long videos through effective local-to-global message passing. More details: https://cdgsearch.github.io/
comment: 38 pages, 18 figures
☆ GRL-SNAM: Geometric Reinforcement Learning with Path Differential Hamiltonians for Simultaneous Navigation and Mapping in Unknown Environments
We present GRL-SNAM, a geometric reinforcement learning framework for Simultaneous Navigation and Mapping(SNAM) in unknown environments. A SNAM problem is challenging as it needs to design hierarchical or joint policies of multiple agents that control the movement of a real-life robot towards the goal in mapless environment, i.e. an environment where the map of the environment is not available apriori, and needs to be acquired through sensors. The sensors are invoked from the path learner, i.e. navigator, through active query responses to sensory agents, and along the motion path. GRL-SNAM differs from preemptive navigation algorithms and other reinforcement learning methods by relying exclusively on local sensory observations without constructing a global map. Our approach formulates path navigation and mapping as a dynamic shortest path search and discovery process using controlled Hamiltonian optimization: sensory inputs are translated into local energy landscapes that encode reachability, obstacle barriers, and deformation constraints, while policies for sensing, planning, and reconfiguration evolve stagewise via updating Hamiltonians. A reduced Hamiltonian serves as an adaptive score function, updating kinetic/potential terms, embedding barrier constraints, and continuously refining trajectories as new local information arrives. We evaluate GRL-SNAM on two different 2D navigation tasks. Comparing against local reactive baselines and global policy learning references under identical stagewise sensing constraints, it preserves clearance, generalizes to unseen layouts, and demonstrates that Geometric RL learning via updating Hamiltonians enables high-quality navigation through minimal exploration via local energy refinement rather than extensive global mapping. The code is publicly available on \href{https://github.com/CVC-Lab/GRL-SNAM}{Github}.
☆ Reinforcement learning with timed constraints for robotics motion planning
Robotic systems operating in dynamic and uncertain environments increasingly require planners that satisfy complex task sequences while adhering to strict temporal constraints. Metric Interval Temporal Logic (MITL) offers a formal and expressive framework for specifying such time-bounded requirements; however, integrating MITL with reinforcement learning (RL) remains challenging due to stochastic dynamics and partial observability. This paper presents a unified automata-based RL framework for synthesizing policies in both Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) under MITL specifications. MITL formulas are translated into Timed Limit-Deterministic Generalized Büchi Automata (Timed-LDGBA) and synchronized with the underlying decision process to construct product timed models suitable for Q-learning. A simple yet expressive reward structure enforces temporal correctness while allowing additional performance objectives. The approach is validated in three simulation studies: a $5 \times 5$ grid-world formulated as an MDP, a $10 \times 10$ grid-world formulated as a POMDP, and an office-like service-robot scenario. Results demonstrate that the proposed framework consistently learns policies that satisfy strict time-bounded requirements under stochastic transitions, scales to larger state spaces, and remains effective in partially observable environments, highlighting its potential for reliable robotic planning in time-critical and uncertain settings.
☆ Dichotomous Diffusion Policy Optimization
Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.
♻ ☆ CLF-RL: Control Lyapunov Function Guided Reinforcement Learning
Reinforcement learning (RL) has shown promise in generating robust locomotion policies for bipedal robots, but often suffers from tedious reward design and sensitivity to poorly shaped objectives. In this work, we propose a structured reward shaping framework that leverages model-based trajectory generation and control Lyapunov functions (CLFs) to guide policy learning. We explore two model-based planners for generating reference trajectories: a reduced-order linear inverted pendulum (LIP) model for velocity-conditioned motion planning, and a precomputed gait library based on hybrid zero dynamics (HZD) using full-order dynamics. These planners define desired end-effector and joint trajectories, which are used to construct CLF-based rewards that penalize tracking error and encourage rapid convergence. This formulation provides meaningful intermediate rewards, and is straightforward to implement once a reference is available. Both the reference trajectories and CLF shaping are used only during training, resulting in a lightweight policy at deployment. We validate our method both in simulation and through extensive real-world experiments on a Unitree G1 robot. CLF-RL demonstrates significantly improved robustness relative to the baseline RL policy and better performance than a classic tracking reward RL formulation.
comment: 8 pages; 7 figures
♻ ☆ Learning Spatial-Aware Manipulation Ordering NeurIPS 2025
Manipulation in cluttered environments is challenging due to spatial dependencies among objects, where an improper manipulation order can cause collisions or blocked access. Existing approaches often overlook these spatial relationships, limiting their flexibility and scalability. To address these limitations, we propose OrderMind, a unified spatial-aware manipulation ordering framework that directly learns object manipulation priorities based on spatial context. Our architecture integrates a spatial context encoder with a temporal priority structuring module. We construct a spatial graph using k-Nearest Neighbors to aggregate geometric information from the local layout and encode both object-object and object-manipulator interactions to support accurate manipulation ordering in real-time. To generate physically and semantically plausible supervision signals, we introduce a spatial prior labeling method that guides a vision-language model to produce reasonable manipulation orders for distillation. We evaluate OrderMind on our Manipulation Ordering Benchmark, comprising 163,222 samples of varying difficulty. Extensive experiments in both simulation and real-world environments demonstrate that our method significantly outperforms prior approaches in effectiveness and efficiency, enabling robust manipulation in cluttered scenes.
comment: Accepted to NeurIPS 2025
♻ ☆ Theory of Mind for Explainable Human-Robot Interaction
Within the context of human-robot interaction (HRI), Theory of Mind (ToM) is intended to serve as a user-friendly backend to the interface of robotic systems, enabling robots to infer and respond to human mental states. When integrated into robots, ToM allows them to adapt their internal models to users' behaviors, enhancing the interpretability and predictability of their actions. Similarly, Explainable Artificial Intelligence (XAI) aims to make AI systems transparent and interpretable, allowing humans to understand and interact with them effectively. Since ToM in HRI serves related purposes, we propose to consider ToM as a form of XAI and evaluate it through the eValuation XAI (VXAI) framework and its seven desiderata. This paper identifies a critical gap in the application of ToM within HRI, as existing methods rarely assess the extent to which explanations correspond to the robot's actual internal reasoning. To address this limitation, we propose to integrate ToM within XAI frameworks. By embedding ToM principles inside XAI, we argue for a shift in perspective, as current XAI research focuses predominantly on the AI system itself and often lacks user-centered explanations. Incorporating ToM would enable a change in focus, prioritizing the user's informational needs and perspective.
♻ ☆ Towards Data-Driven Metrics for Social Robot Navigation Benchmarking
This paper presents a joint effort towards the development of a data-driven Social Robot Navigation metric to facilitate benchmarking and policy optimization for ground robots. We compiled a dataset with 4427 trajectories -- 182 real and 4245 simulated -- and presented it to human raters, yielding a total of 4402 rated trajectories after data quality assurance. Notably, we provide the first all-encompassing learned social robot navigation metric, along qualitative and quantitative results, including the test loss achieved, a comparison against hand-crafted metrics, and an ablation study. All data, software, and model weights are publicly available.
♻ ☆ AINav: Large Language Model-Based Adaptive Interactive Navigation
Robotic navigation in complex environments remains a critical research challenge. Traditional navigation methods focus on optimal trajectory generation within fixed free workspace, therefore struggling in environments lacking viable paths to the goal, such as disaster zones or cluttered warehouses. To address this problem, we propose AINav, an adaptive interactive navigation approach that proactively interacts with environments to create feasible paths to achieve originally unreachable goals. Specifically, we present a primitive skill tree for task planning with large language models (LLMs), facilitating effective reasoning to determine interaction objects and sequences. To ensure robust subtask execution, we adopt reinforcement learning to pre-train a comprehensive skill library containing versatile locomotion and interaction behaviors for motion planning. Furthermore, we introduce an adaptive replanning approach featuring two LLM-based modules: an advisor serving as a flexible replanning trigger and an arborist for autonomous plan adjustment. Integrated with the tree structure, the replanning mechanism allows for convenient node addition and pruning, enabling rapid plan adaptation in a priori unknown environments. Comprehensive simulations and experiments have demonstrated AINav's effectiveness and adaptivity in diverse scenarios. The supplementary video is available at: https://youtu.be/CjXm5KFx9AI.
comment: 13 pages, 12 figures, accepted to IEEE Robotics & Automation Magazine
♻ ☆ Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities
Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation.
♻ ☆ Passage-traversing optimal path planning with sampling-based algorithms
This paper introduces a new paradigm of optimal path planning, i.e., passage-traversing optimal path planning (PTOPP), that optimizes paths' traversed passages for specified optimization objectives. In particular, PTOPP is utilized to find the path with optimal accessible free space along its entire length, which represents a basic requirement for paths in robotics. As passages are places where free space shrinks and becomes constrained, the core idea is to leverage the path's passage traversal status to characterize its accessible free space comprehensively. To this end, a novel passage detection and free space decomposition method using proximity graphs is proposed, enabling fast detection of sparse but informative passages and environment decompositions. Based on this preprocessing, optimal path planning with accessible free space objectives or constraints is formulated as PTOPP problems compatible with sampling-based optimal planners. Then, sampling-based algorithms for PTOPP, including their dependent primitive procedures, are developed leveraging partitioned environments for fast passage traversal check. All these methods are implemented and thoroughly tested for effectiveness and efficiency validation. Compared to existing approaches, such as clearance-based methods, PTOPP demonstrates significant advantages in configurability, solution optimality, and efficiency, addressing prior limitations and incapabilities. It is believed to provide an efficient and versatile solution to accessible free space optimization over conventional avenues and more generally, to a broad class of path planning problems that can be formulated as PTOPP.
comment: 27 pages, 20 figures, 4 tables, journal paper
♻ ☆ VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs
In most existing embodied navigation tasks, instructions are well-defined and unambiguous, such as instruction following and object searching. Under this idealized setting, agents are required solely to produce effective navigation outputs conditioned on vision and language inputs. However, real-world navigation instructions are often vague and ambiguous, requiring the agent to resolve uncertainty and infer user intent through active dialog. To address this gap, we propose Interactive Instance Object Navigation (IION), a task that requires agents not only to generate navigation actions but also to produce language outputs via active dialog, thereby aligning more closely with practical settings. IION extends Instance Object Navigation (ION) by allowing agents to freely consult an oracle in natural language while navigating. Building on this task, we present the Vision Language-Language Navigation (VL-LN) benchmark, which provides a large-scale, automatically generated dataset and a comprehensive evaluation protocol for training and assessing dialog-enabled navigation models. VL-LN comprises over 41k long-horizon dialog-augmented trajectories for training and an automatic evaluation protocol with an oracle capable of responding to agent queries. Using this benchmark, we train a navigation model equipped with dialog capabilities and show that it achieves significant improvements over the baselines. Extensive experiments and analyses further demonstrate the effectiveness and reliability of VL-LN for advancing research on dialog-enabled embodied navigation. Code and dataset: https://0309hws.github.io/VL-LN.github.io/
♻ ☆ Large Language Model-Driven Closed-Loop UAV Operation with Semantic Observations
Recent advances in large Language Models (LLMs) have revolutionized mobile robots, including unmanned aerial vehicles (UAVs), enabling their intelligent operation within Internet of Things (IoT) ecosystems. However, LLMs still face challenges from logical reasoning and complex decision-making, leading to concerns about the reliability of LLM-driven UAV operations in IoT applications. In this paper, we propose a closed-loop LLM-driven UAV operation code generation framework that enables reliable UAV operations powered by effective feedback and refinement using two LLM modules, i.e., a Code Generator and an Evaluator. Our framework transforms numerical state observations from UAV operations into semantic trajectory descriptions to enhance the evaluator LLM's understanding of UAV dynamics for precise feedback generation. Our framework also enables a simulation-based refinement process, and hence eliminates the risks to physical UAVs caused by incorrect code execution during the refinement. Extensive experiments on UAV control tasks with different complexities are conducted. The experimental results show that our framework can achieve reliable UAV operations using LLMs, which significantly outperforms baseline methods in terms of success rate and completeness with the increase of task complexity.
comment: 13 pages, 10 figures
♻ ☆ Dynamic Gap: Safe Gap-based Navigation in Dynamic Environments ICRA
This paper extends the family of gap-based local planners to unknown dynamic environments through generating provable collision-free properties for hierarchical navigation systems. Existing perception-informed local planners that operate in dynamic environments rely on emergent or empirical robustness for collision avoidance as opposed to performing formal analysis of dynamic obstacles. In addition to this, the obstacle tracking that is performed in these existent planners is often achieved with respect to a global inertial frame, subjecting such tracking estimates to transformation errors from odometry drift. The proposed local planner, dynamic gap, shifts the tracking paradigm to modeling how the free space, represented as gaps, evolves over time. Gap crossing and closing conditions are developed to aid in determining the feasibility of passage through gaps, and a breadth of simulation benchmarking is performed against other navigation planners in the literature where the proposed dynamic gap planner achieves the highest success rate out of all planners tested in all environments.
comment: Accepted to 2025 International Conference on Robotics and Automation (ICRA)
♻ ☆ Hybrid A* Path Planning with Multi-Modal Motion Extension for Four-Wheel Steering Mobile Robots
Four-wheel independent steering (4WIS) systems provide mobile robots with a rich set of motion modes, such as Ackermann steering, lateral steering, and parallel movement, offering superior maneuverability in constrained environments. However, existing path planning methods generally assume a single kinematic model and thus fail to fully exploit the multi-modal capabilities of 4WIS platforms. To address this limitation, we propose an extended Hybrid A* framework that operates in a four-dimensional state space incorporating both spatial states and motion modes. Within this framework, we design multi-modal Reeds-Shepp curves tailored to the distinct kinematic constraints of each motion mode, develop an enhanced heuristic function that accounts for mode-switching costs, and introduce a terminal connection strategy with intelligent mode selection to ensure smooth transitions between different steering patterns. The proposed planner enables seamless integration of multiple motion modalities within a single path, significantly improving flexibility and adaptability in complex environments. Results demonstrate significantly improved planning performance for 4WIS robots in complex environments.
comment: Updated method details, parameters, and experimental scenarios
Robotics 42
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.
☆ Energy-Aware Bayesian Control Barrier Functions for Physics-Informed Gaussian Process Dynamics
We study safe control for dynamical systems whose continuous-time dynamics are learned with Gaussian processes (GPs), focusing on mechanical and port-Hamiltonian systems where safety is naturally expressed via energy constraints. The availability of a GP Hamiltonian posterior naturally raises the question of how to systematically exploit this structure to design an energy-aware control barrier function with high-probability safety guarantees. We address this problem by developing a Bayesian-CBF framework and instantiating it with energy-aware Bayesian-CBFs (EB-CBFs) that construct conservative energy-based barriers directly from the Hamiltonian and vector-field posteriors, yielding safety filters that minimally modify a nominal controller while providing probabilistic energy safety guarantees. Numerical simulations on a mass-spring system demonstrate that the proposed EB-CBFs achieve high-probability safety under noisy sampled GP-learned dynamics.
☆ Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models
The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained vision-language model (VLM) fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning.
comment: 17 pages without bibliography or appendix. The main paper has 16 figures
☆ Subsecond 3D Mesh Generation for Robot Manipulation
3D meshes are a fundamental representation widely used in computer science and engineering. In robotics, they are particularly valuable because they capture objects in a form that aligns directly with how robots interact with the physical world, enabling core capabilities such as predicting stable grasps, detecting collisions, and simulating dynamics. Although automatic 3D mesh generation methods have shown promising progress in recent years, potentially offering a path toward real-time robot perception, two critical challenges remain. First, generating high-fidelity meshes is prohibitively slow for real-time use, often requiring tens of seconds per object. Second, mesh generation by itself is insufficient. In robotics, a mesh must be contextually grounded, i.e., correctly segmented from the scene and registered with the proper scale and pose. Additionally, unless these contextual grounding steps remain efficient, they simply introduce new bottlenecks. In this work, we introduce an end-to-end system that addresses these challenges, producing a high-quality, contextually grounded 3D mesh from a single RGB-D image in under one second. Our pipeline integrates open-vocabulary object segmentation, accelerated diffusion-based mesh generation, and robust point cloud registration, each optimized for both speed and accuracy. We demonstrate its effectiveness in a real-world manipulation task, showing that it enables meshes to be used as a practical, on-demand representation for robotics perception and planning.
comment: In submission
☆ Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning
Recent reasoning-augmented Vision-Language-Action (VLA) models have improved the interpretability of end-to-end autonomous driving by generating intermediate reasoning traces. Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. CF-VLA first generates time-segmented meta-actions that summarize driving intent, and then performs counterfactual reasoning conditioned on both the meta-actions and the visual context. This step simulates potential outcomes, identifies unsafe behaviors, and outputs corrected meta-actions that guide the final trajectory generation. To efficiently obtain such self-reflective capabilities, we propose a rollout-filter-label pipeline that mines high-value scenes from a base (non-counterfactual) VLA's rollouts and labels counterfactual reasoning traces for subsequent training rounds. Experiments on large-scale driving datasets show that CF-VLA improves trajectory accuracy by up to 17.6%, enhances safety metrics by 20.5%, and exhibits adaptive thinking: it only enables counterfactual reasoning in challenging scenarios. By transforming reasoning traces from one-shot descriptions to causal self-correction signals, CF-VLA takes a step toward self-reflective autonomous driving agents that learn to think before they act.
☆ Fast and Realistic Automated Scenario Simulations and Reporting for an Autonomous Racing Stack
In this paper, we describe the automated simulation and reporting pipeline implemented for our autonomous racing stack, ur.autopilot. The backbone of the simulation is based on a high-fidelity model of the vehicle interfaced as a Functional Mockup Unit (FMU). The pipeline can execute the software stack and the simulation up to three times faster than real-time, locally or on GitHub for Continuous Integration/- Continuous Delivery (CI/CD). As the most important input of the pipeline, there is a set of running scenarios. Each scenario allows the initialization of the ego vehicle in different initial conditions (position and speed), as well as the initialization of any other configuration of the stack. This functionality is essential to validate efficiently critical modules, like the one responsible for high-speed overtaking maneuvers or localization, which are among the most challenging aspects of autonomous racing. Moreover, we describe how we implemented a fault injection module, capable of introducing sensor delays and perturbations as well as modifying outputs of any node of the stack. Finally, we describe the design of our automated reporting process, aimed at maximizing the effectiveness of the simulation analysis.
comment: Accepted to the 2026 IEEE/SICE International Symposium on System Integration (SII 2026)
☆ Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems
The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.
comment: Preprint; 38 pages, 7 figures, 9 tables; GitHub at https://github.com/worldbench/awesome-spatial-intelligence
☆ Geometric Multi-Session Map Merging with Learned Local Descriptors
Multi-session map merging is crucial for extended autonomous operations in large-scale environments. In this paper, we present GMLD, a learning-based local descriptor framework for large-scale multi-session point cloud map merging that systematically aligns maps collected across different sessions with overlapping regions. The proposed framework employs a keypoint-aware encoder and a plane-based geometric transformer to extract discriminative features for loop closure detection and relative pose estimation. To further improve global consistency, we include inter-session scan matching cost factors in the factor-graph optimization stage. We evaluate our framework on the public datasets, as well as self-collected data from diverse environments. The results show accurate and robust map merging with low error, and the learned features deliver strong performance in both loop closure detection and relative pose estimation.
☆ New Insights into Cascaded Geometric Flight Control: From Performance Guarantees to Practical Pitfalls
We present a new stability proof for cascaded geometric control used by aerial vehicles tracking time-varying position trajectories. Our approach uses sliding variables and a recently proposed quaternion-based sliding controller to demonstrate that exponentially convergent position trajectory tracking is theoretically possible. Notably, our analysis reveals new aspects of the control strategy, including how tracking error in the attitude loop influences the position loop, how model uncertainties affect the closed-loop system, and the practical pitfalls of the control architecture.
comment: V1
☆ 3D Path-Following Guidance via Nonlinear Model Predictive Control for Fixed-Wing Small UAS
This paper presents the design, implementation, and flight test results of two novel 3D path-following guidance algorithms based on nonlinear model predictive control (MPC), with specific application to fixed-wing small uncrewed aircraft systems. To enable MPC, control-augmented modelling and system identification of the RAAVEN small uncrewed aircraft is presented. Two formulations of MPC are then showcased. The first schedules a static reference path rate over the MPC horizon, incentivizing a constant inertial speed. The second, with inspiration from model predictive contouring control, dynamically optimizes for the reference path rate over the controller horizon as the system operates. This allows for a weighted tradeoff between path progression and distance from path, two competing objectives in path-following guidance. Both controllers are formulated to operate over general smooth 3D arc-length parameterized curves. The MPC guidance algorithms are flown over several high-curvature test paths, with comparison to a baseline lookahead guidance law. The results showcase the real-world feasibility and superior performance of nonlinear MPC for 3D path-following guidance at ground speeds up to 36 meters per second.
☆ UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots
A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.
comment: Project page: https://jnnan.github.io/uniact/
☆ World In Your Hands: A Large-Scale and Open-source Ecosystem for Learning Human-centric Manipulation in the Wild
Large-scale pre-training is fundamental for generalization in language and vision models, but data for dexterous hand manipulation remains limited in scale and diversity, hindering policy generalization. Limited scenario diversity, misaligned modalities, and insufficient benchmarking constrain current human manipulation datasets. To address these gaps, we introduce World In Your Hands (WiYH), a large-scale open-source ecosystem for human-centric manipulation learning. WiYH includes (1) the Oracle Suite, a wearable data collection kit with an auto-labeling pipeline for accurate motion capture; (2) the WiYH Dataset, featuring over 1,000 hours of multi-modal manipulation data across hundreds of skills in diverse real-world scenarios; and (3) extensive annotations and benchmarks supporting tasks from perception to action. Furthermore, experiments based on the WiYH ecosystem show that integrating WiYH's human-centric data significantly enhances the generalization and robustness of dexterous hand policies in tabletop manipulation tasks. We believe that World In Your Hands will bring new insights into human-centric data collection and policy learning to the community.
☆ Real-world Reinforcement Learning from Suboptimal Interventions
Real-world reinforcement learning (RL) offers a promising approach to training precise and dexterous robotic manipulation policies in an online manner, enabling robots to learn from their own experience while gradually reducing human labor. However, prior real-world RL methods often assume that human interventions are optimal across the entire state space, overlooking the fact that even expert operators cannot consistently provide optimal actions in all states or completely avoid mistakes. Indiscriminately mixing intervention data with robot-collected data inherits the sample inefficiency of RL, while purely imitating intervention data can ultimately degrade the final performance achievable by RL. The question of how to leverage potentially suboptimal and noisy human interventions to accelerate learning without being constrained by them thus remains open. To address this challenge, we propose SiLRI, a state-wise Lagrangian reinforcement learning algorithm for real-world robot manipulation tasks. Specifically, we formulate the online manipulation problem as a constrained RL optimization, where the constraint bound at each state is determined by the uncertainty of human interventions. We then introduce a state-wise Lagrange multiplier and solve the problem via a min-max optimization, jointly optimizing the policy and the Lagrange multiplier to reach a saddle point. Built upon a human-as-copilot teleoperation system, our algorithm is evaluated through real-world experiments on diverse manipulation tasks. Experimental results show that SiLRI effectively exploits human suboptimal interventions, reducing the time required to reach a 90% success rate by at least 50% compared with the state-of-the-art RL method HIL-SERL, and achieving a 100% success rate on long-horizon manipulation tasks where other RL methods struggle to succeed. Project website: https://silri-rl.github.io/.
☆ DRL-TH: Jointly Utilizing Temporal Graph Attention and Hierarchical Fusion for UGV Navigation in Crowded Environments
Deep reinforcement learning (DRL) methods have demonstrated potential for autonomous navigation and obstacle avoidance of unmanned ground vehicles (UGVs) in crowded environments. Most existing approaches rely on single-frame observation and employ simple concatenation for multi-modal fusion, which limits their ability to capture temporal context and hinders dynamic adaptability. To address these challenges, we propose a DRL-based navigation framework, DRL-TH, which leverages temporal graph attention and hierarchical graph pooling to integrate historical observations and adaptively fuse multi-modal information. Specifically, we introduce a temporal-guided graph attention network (TG-GAT) that incorporates temporal weights into attention scores to capture correlations between consecutive frames, thereby enabling the implicit estimation of scene evolution. In addition, we design a graph hierarchical abstraction module (GHAM) that applies hierarchical pooling and learnable weighted fusion to dynamically integrate RGB and LiDAR features, achieving balanced representation across multiple scales. Extensive experiments demonstrate that our DRL-TH outperforms existing methods in various crowded environments. We also implemented DRL-TH control policy on a real UGV and showed that it performed well in real world scenarios.
☆ Safe Sliding Mode Control for Marine Vessels Using High-Order Control Barrier Functions and Fast Projection
This paper presents a novel safe control framework that integrates Sliding Mode Control (SMC), High-Order Control Barrier Functions (HOCBFs) with state-dependent adaptiveness and a lightweight projection for collision-free navigation of an over-actuated 3-DOF marine surface vessel subjected to strong environmental disturbances (wind, waves, and current). SMC provides robustness to matched disturbances common in marine operations, while HOCBFs enforce forward invariance of obstacle-avoidance constraints. A fast half-space projection method adjusts the SMC control only when needed, preserving robustness and minimizing chattering. The approach is evaluated on a nonlinear marine platform model that includes added mass, hydrodynamic damping, and full thruster allocation. Simulation results show robust navigation, guaranteed obstacle avoidance, and computational efficiency suitable for real-time embedded use. For small marine robots and surface vessels with limited onboard computational resources-where execution speed and computational efficiency are critical-the SMC-HOCBF framework constitutes a strong candidate for safety-critical control.
☆ Local Path Optimization in The Latent Space Using Learned Distance Gradient IROS 2025
Constrained motion planning is a common but challenging problem in robotic manipulation. In recent years, data-driven constrained motion planning algorithms have shown impressive planning speed and success rate. Among them, the latent motion method based on manifold approximation is the most efficient planning algorithm. Due to errors in manifold approximation and the difficulty in accurately identifying collision conflicts within the latent space, time-consuming path validity checks and path replanning are required. In this paper, we propose a method that trains a neural network to predict the minimum distance between the robot and obstacles using latent vectors as inputs. The learned distance gradient is then used to calculate the direction of movement in the latent space to move the robot away from obstacles. Based on this, a local path optimization algorithm in the latent space is proposed, and it is integrated with the path validity checking process to reduce the time of replanning. The proposed method is compared with state-of-the-art algorithms in multiple planning scenarios, demonstrating the fastest planning speed
comment: This paper has been published in IROS 2025
☆ Heteroscedastic Bayesian Optimization-Based Dynamic PID Tuning for Accurate and Robust UAV Trajectory Tracking IROS 2025
Unmanned Aerial Vehicles (UAVs) play an important role in various applications, where precise trajectory tracking is crucial. However, conventional control algorithms for trajectory tracking often exhibit limited performance due to the underactuated, nonlinear, and highly coupled dynamics of quadrotor systems. To address these challenges, we propose HBO-PID, a novel control algorithm that integrates the Heteroscedastic Bayesian Optimization (HBO) framework with the classical PID controller to achieve accurate and robust trajectory tracking. By explicitly modeling input-dependent noise variance, the proposed method can better adapt to dynamic and complex environments, and therefore improve the accuracy and robustness of trajectory tracking. To accelerate the convergence of optimization, we adopt a two-stage optimization strategy that allow us to more efficiently find the optimal controller parameters. Through experiments in both simulation and real-world scenarios, we demonstrate that the proposed method significantly outperforms state-of-the-art (SOTA) methods. Compared to SOTA methods, it improves the position accuracy by 24.7% to 42.9%, and the angular accuracy by 40.9% to 78.4%.
comment: Accepted by IROS 2025 (2025 IEEE/RSJ International Conference on Intelligent Robots and Systems)
☆ RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation
Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.
☆ GR-Dexter Technical Report
Vision-language-action (VLA) models have enabled language-conditioned, long-horizon robot manipulation, but most existing systems are limited to grippers. Scaling VLA policies to bimanual robots with high degree-of-freedom (DoF) dexterous hands remains challenging due to the expanded action space, frequent hand-object occlusions, and the cost of collecting real-robot data. We present GR-Dexter, a holistic hardware-model-data framework for VLA-based generalist manipulation on a bimanual dexterous-hand robot. Our approach combines the design of a compact 21-DoF robotic hand, an intuitive bimanual teleoperation system for real-robot data collection, and a training recipe that leverages teleoperated robot trajectories together with large-scale vision-language and carefully curated cross-embodiment datasets. Across real-world evaluations spanning long-horizon everyday manipulation and generalizable pick-and-place, GR-Dexter achieves strong in-domain performance and improved robustness to unseen objects and unseen instructions. We hope GR-Dexter serves as a practical step toward generalist dexterous-hand robotic manipulation.
☆ ROBOPOL: Social Robotics Meets Vehicular Communications for Cooperative Automated Driving
On the way towards full autonomy, sharing roads between automated vehicles and human actors in so-called mixed traffic is unavoidable. Moreover, even if all vehicles on the road were autonomous, pedestrians would still be crossing the streets. We propose social robots as moderators between autonomous vehicles and vulnerable road users (VRU). To this end, we identify four enablers requiring integration: (1) advanced perception, allowing the robot to see the environment; (2) vehicular communications allowing connected vehicles to share intentions and the robot to send guiding commands; (3) social human-robot interaction allowing the robot to effectively communicate with VRUs and drivers; (4) formal specification allowing the robot to understand traffic and plan accordingly. This paper presents an overview of the key enablers and report on a first proof-of-concept integration of the first three enablers envisioning a social robot advising pedestrians in scenarios with a cooperative automated e-bike.
☆ Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training
General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation.
☆ RflyUT-Sim: A Simulation Platform for Development and Testing of Complex Low-Altitude Traffic Control
Significant challenges are posed by simulation and testing in the field of low-altitude unmanned aerial vehicle (UAV) traffic due to the high costs associated with large-scale UAV testing and the complexity of establishing low-altitude traffic test scenarios. Stringent safety requirements make high fidelity one of the key metrics for simulation platforms. Despite advancements in simulation platforms for low-altitude UAVs, there is still a shortage of platforms that feature rich traffic scenarios, high-precision UAV and scenario simulators, and comprehensive testing capabilities for low-altitude traffic. Therefore, this paper introduces an integrated high-fidelity simulation platform for low-altitude UAV traffic. This platform simulates all components of the UAV traffic network, including the control system, the traffic management system, the UAV system, the communication network , the anomaly and fault modules, etc. Furthermore, it integrates RflySim/AirSim and Unreal Engine 5 to develop full-state models of UAVs and 3D maps that model the real world using the oblique photogrammetry technique. Additionally, the platform offers a wide range of interfaces, and all models and scenarios can be customized with a high degree of flexibility. The platform's source code has been released, making it easier to conduct research related to low-altitude traffic.
☆ Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks
Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.
☆ Evaluation of Impression Difference of a Domestic Mobile Manipulator with Autonomous and/or Remote Control in Fetch-and-Carry Tasks
A single service robot can present two distinct agencies: its onboard autonomy and an operator-mediated agency, yet users experience them through one physical body. We formalize this dual-agency structure as a User-Robot-Operator triad in an autonomous remote-control setting that combines autonomous execution with remote human support. Prior to the recent surge of language-based and multimodal interfaces, we developed and evaluated an early-stage prototype in 2020 that combined natural-language text chat with freehand sketch annotations over the robot's live camera view to support remote intervention. We evaluated three modes - autonomous, remote, and hybrid - in controlled fetch-and-carry tasks using a domestic mobile manipulator (HSR) on a World Robot Summit 2020 rule-compliant test field. The results show systematic mode-dependent differences in user-rated affinity and additional insights on perceived security, indicating that switching or blending agency within one robot measurably shapes human impressions. These findings provide empirical guidance for designing human-in-the-loop mobile manipulation in domestic physical tasks.
comment: Published in Advanced Robotics (2020). This paper defines an autonomous remote-control setting that makes the User-Robot-Operator triadic interaction explicit in physical tasks, and reports empirical differences in affinity and perceived security across autonomous, remote, and hybrid modes. Please cite: Advanced Robotics 34(20):1291-1308, 2020. DOI:10.1080/01691864.2020.1780152
☆ SHIELD: Spherical-Projection Hybrid-Frontier Integration for Efficient LiDAR-based Drone Exploration
This paper introduces SHIELD, a Spherical-Projection Hybrid-Frontier Integration for Efficient LiDAR-based Drone exploration method. Although laser LiDAR offers the advantage of a wide field of view, its application in UAV exploration still faces several challenges. The observation quality of LiDAR point clouds is generally inferior to that of depth cameras. Traditional frontier methods based on known and unknown regions impose a heavy computational burden, especially when handling the wide field of view of LiDAR. In addition, regions without point cloud are also difficult to classify as free space through raycasting. To address these problems, the SHIELD is proposed. It maintains an observation-quality occupancy map and performs ray-casting on this map to address the issue of inconsistent point-cloud quality during exploration. A hybrid frontier method is used to tackle both the computational burden and the limitations of point-cloud quality exploration. In addition, an outward spherical-projection ray-casting strategy is proposed to jointly ensure flight safety and exploration efficiency in open areas. Simulations and flight experiments prove the effectiveness of SHIELD. This work will be open-sourced to contribute to the research community.
♻ ☆ Cross-embodied Co-design for Dexterous Hands
Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website: https://an-axolotl.github.io/co-design-for-dexterity.github.io/
♻ ☆ A New Software Tool for Generating and Visualizing Robot Self-Collision Matrices
In robotics, it is common to check whether a given robot state results in self-intersection (i.e., a self-collision query) or to assess its distance from such an intersection (i.e., a self-proximity query). These checks are typically performed between pairs of shapes attached to different robot links. However, many of these shape pairs can be excluded in advance, as their configurations are known to always or never result in contact. This information is typically encoded in a self-collision matrix, where each entry (i, j) indicates whether a check should be performed between shape i and shape j. While the MoveIt Setup Assistant is widely used to generate such matrices, current tools are limited by static visualization, lack of proximity support, rigid single-geometry assumptions, and tedious refinement workflows, hindering flexibility and reuse in downstream robotics applications. In this work, we introduce an interactive tool that overcomes these limitations by generating and visualizing self-collision matrices across multiple shape representations, enabling dynamic inspection, filtering, and refinement of shape pairs. Outputs are provided in both JSON and YAML for easy integration. The system is implemented in Rust and uses the Bevy game engine to deliver high-quality visualizations. We demonstrate its effectiveness on multiple robot platforms, showing that matrices generated using diverse shape types yield faster and more accurate self-collision and self-proximity queries.
♻ ☆ Beyond URDF: The Universal Robot Description Directory for Shared, Extensible, and Standardized Robot Models
Robots are typically described in software by specification files (e.g., URDF, SDF, MJCF, USD) that encode only basic kinematic, dynamic, and geometric information. As a result, downstream applications such as simulation, planning, and control must repeatedly re-derive richer data, leading to redundant computations, fragmented implementations, and limited standardization. In this work, we introduce the Universal Robot Description Directory (URDD), a modular representation that organizes derived robot information into structured, easy-to-parse JSON and YAML modules. Our open-source toolkit automatically generates URDDs from URDFs, with a Rust implementation supporting Bevy-based visualization. Additionally, we provide a JavaScript/Three.js viewer for web-based inspection of URDDs. Experiments on multiple robot platforms show that URDDs can be generated efficiently, encapsulate substantially richer information than standard specification files, and directly enable the construction of core robotics subroutines. URDD provides a unified, extensible resource for reducing redundancy and establishing shared standards across robotics frameworks. We conclude with a discussion on the limitations and implications of our work.
♻ ☆ APOLLO Blender: A Robotics Library for Visualization and Animation in Blender
High-quality visualizations are an essential part of robotics research, enabling clear communication of results through figures, animations, and demonstration videos. While Blender is a powerful and freely available 3D graphics platform, its steep learning curve and lack of robotics-focused integrations make it difficult and time-consuming for researchers to use effectively. In this work, we introduce a lightweight software library that bridges this gap by providing simple scripting interfaces for common robotics visualization tasks. The library offers three primary capabilities: (1) importing robots and environments directly from standardized descriptions such as URDF; (2) Python-based scripting tools for keyframing robot states and visual attributes; and (3) convenient generation of primitive 3D shapes for schematic figures and animations. Together, these features allow robotics researchers to rapidly create publication-ready images, animations, and explanatory schematics without needing extensive Blender expertise. We demonstrate the library through a series of proof-of-concept examples and conclude with a discussion of current limitations and opportunities for future extensions.
♻ ☆ Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
comment: Project website: https://egoman-project.github.io
OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation
Visual imitation learning enables robotic agents to acquire skills by observing expert demonstration videos. In the one-shot setting, the agent generates a policy after observing a single expert demonstration without additional fine-tuning. Existing approaches typically train and evaluate on the same set of tasks, varying only object configurations, and struggle to generalize to unseen tasks with different semantic or structural requirements. While some recent methods attempt to address this, they exhibit low success rates on hard test tasks that, despite being visually similar to some training tasks, differ in context and require distinct responses. Additionally, most existing methods lack an explicit model of environment dynamics, limiting their ability to reason about future states. To address these limitations, we propose a novel framework for one-shot visual imitation learning via world-model-guided trajectory generation. Given an expert demonstration video and the agent's initial observation, our method leverages a learned world model to predict a sequence of latent states and actions. This latent trajectory is then decoded into physical waypoints that guide the agent's execution. Our method is evaluated on two simulated benchmarks and three real-world robotic platforms, where it consistently outperforms prior approaches, with over 30% improvement in some cases. The code is available at https://github.com/raktimgg/osvi-wm.
♻ ☆ SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling
Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
♻ ☆ Reproducibility in the Control of Autonomous Mobility-on-Demand Systems
Autonomous Mobility-on-Demand (AMoD) systems, powered by advances in robotics, control, and Machine Learning (ML), offer a promising paradigm for future urban transportation. AMoD offers fast and personalized travel services by leveraging centralized control of autonomous vehicle fleets to optimize operations and enhance service performance. However, the rapid growth of this field has outpaced the development of standardized practices for evaluating and reporting results, leading to significant challenges in reproducibility. As AMoD control algorithms become increasingly complex and data-driven, a lack of transparency in modeling assumptions, experimental setups, and algorithmic implementation hinders scientific progress and undermines confidence in the results. This paper presents a systematic study of reproducibility in AMoD research. We identify key components across the research pipeline, spanning system modeling, control problems, simulation design, algorithm specification, and evaluation, and analyze common sources of irreproducibility. We survey prevalent practices in the literature, highlight gaps, and propose a structured framework to assess and improve reproducibility. Specifically, concrete guidelines are offered, along with a "reproducibility checklist", to support future work in achieving replicable, comparable, and extensible results. While focused on AMoD, the principles and practices we advocate generalize to a broader class of cyber-physical systems that rely on networked autonomy and data-driven control. This work aims to lay the foundation for a more transparent and reproducible research culture in the design and deployment of intelligent mobility systems.
♻ ☆ Contingency Model-based Control (CMC) for Communicationless Cooperative Collision Avoidance in Robot Swarms
Cooperative collision avoidance between robots, or `agents,' in swarm operations remains an open challenge. Assuming a decentralized architecture, each agent is responsible for making its own decisions and choosing its control actions. Most existing approaches rely on a (wireless) communication network between (some of) the agents. In reality, however, communication is brittle. It may be affected by latency, further delays and packet losses, and transmission faults. Moreover, it is subject to adversarial attacks, such as jamming or spoofing. This paper proposes Contingency Model-based Control (CMC), a decentralized cooperative approach that does not rely on communication. Instead, the control algorithm is based on consensual rules that are designed for all agents offline, similar to traffic rules. For CMC, this includes the definition of a contingency trajectory for each robot, and perpendicular bisecting planes as collision avoidance constraints. The setup permits a full guarantee of recursive feasibility and collision avoidance between all swarm members in closed-loop operation. CMC naturally satisfies the plug & play paradigm, i.e., new robots may enter the swarm dynamically. The effectiveness of the CMC regime is demonstrated in two numerical examples, showing that the collision avoidance guarantee is intact and the robot swarm operates smoothly in a constrained environment.
♻ ☆ Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.
comment: Codebase: https://github.com/EvolvingLMMs-Lab/EASI/ ; Leaderboard: https://huggingface.co/spaces/lmms-lab-si/EASI-Leaderboard
♻ ☆ Scaling Spatial Intelligence with Multimodal Foundation Models
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
comment: Codebase: https://github.com/OpenSenseNova/SenseNova-SI ; Models: https://huggingface.co/collections/sensenova/sensenova-si
♻ ☆ How Robot Dogs See the Unseeable: Improving Visual Interpretability via Peering for Exploratory Robots
In vegetated environments, such as forests, exploratory robots play a vital role in navigating complex, cluttered environments where human access is limited and traditional equipment struggles. Visual occlusion from obstacles, such as foliage, can severely obstruct a robot's sensors, impairing scene understanding. We show that "peering", a characteristic side-to-side movement used by insects to overcome their visual limitations, can also allow robots to markedly improve visual reasoning under partial occlusion. This is accomplished by applying core signal processing principles, specifically optical synthetic aperture sensing, together with the vision reasoning capabilities of modern large multimodal models. Peering enables real-time, high-resolution, and wavelength-independent perception, which is crucial for vision-based scene understanding across a wide range of applications. The approach is low-cost and immediately deployable on any camera-equipped robot. We investigated different peering motions and occlusion masking strategies, demonstrating that, unlike peering, state-of-the-art multi-view 3D vision techniques fail in these conditions due to their high susceptibility to occlusion. Our experiments were carried out on an industrial-grade quadrupedal robot. However, the ability to peer is not limited to such platforms, but potentially also applicable to bipedal, hexapod, wheeled, or crawling platforms. Robots that can effectively see through partial occlusion will gain superior perception abilities - including enhanced scene understanding, situational awareness, camouflage breaking, and advanced navigation in complex environments.
♻ ☆ OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.
♻ ☆ Analysis of Errors in Robotic Surgical Skill Acquisition with Video-Based Detection
Robot-assisted minimally invasive surgeries offer many advantages but require complex motor tasks that take surgeons years to master. There is currently a lack of knowledge on how surgeons acquire these robotic surgical skills. Toward bridging this gap, a previous study followed surgical residents learning complex surgical dry lab tasks on a surgical robot over six months. Errors are an important measure for training and skill evaluation, but unlike in virtual simulations, in dry lab training, errors are difficult to monitor automatically. Here, we analyzed errors in the ring tower transfer task, in which surgical residents moved a ring along a curved wire as quickly and accurately as possible. We developed an image-processing algorithm using color and size thresholds, optical flow and short time Fourier transforms to detect collision errors and achieved a detection accuracy of approximately 95%. Using the detected errors and task completion time, we found that the residents reduced their completion time and number of errors over the six months, while the percentage of task time spent making errors remained relatively constant on average. This analysis sheds light on the learning process of the residents and can serve as a step towards providing error-related feedback to robotic surgeons.
comment: 9 pages, 4 figures, 1 table. Alex Geftler and Ilana Nisky contributed equally to this work
♻ ☆ CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems
Multi-modal learning has emerged as a key technique for improving performance across domains such as autonomous driving, robotics, and reasoning. However, in certain scenarios, particularly in resource-constrained environments, some modalities available during training may be absent during inference. While existing frameworks effectively utilize multiple data sources during training and enable inference with reduced modalities, they are primarily designed for single-agent settings. This poses a critical limitation in dynamic environments such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision-making blind spots. Conversely, some works explore multi-agent collaboration but without addressing missing modality at test time. To overcome these limitations, we propose Collaborative Auxiliary Modality Learning (CAML), a novel multi-modal multi-agent framework that enables agents to collaborate and share multi-modal data during training, while allowing inference with reduced modalities during testing. Experimental results in collaborative decision-making for CAV in accident-prone scenarios demonstrate that CAML achieves up to a 58.1% improvement in accident detection. Additionally, we validate CAML on real-world aerial-ground robot data for collaborative semantic segmentation, achieving up to a 10.6% improvement in mIoU.
♻ ☆ RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion
Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
♻ ☆ PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies
A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.
comment: Website: https://polaris-evals.github.io/
Robotics 46
☆ Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation
The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io
comment: 27 pages, 11 figures
☆ The Bulldozer Technique: Efficient Elimination of Local Minima Traps for APF-Based Robot Navigation
Path planning is a fundamental component in autonomous mobile robotics, enabling a robot to navigate from its current location to a desired goal while avoiding obstacles. Among the various techniques, Artificial Potential Field (APF) methods have gained popularity due to their simplicity, real-time responsiveness, and low computational requirements. However, a major limitation of conventional APF approaches is the local minima trap problem, where the robot becomes stuck in a position with no clear direction toward the goal. This paper proposes a novel path planning technique, termed the Bulldozer, which addresses the local minima issue while preserving the inherent advantages of APF. The Bulldozer technique introduces a backfilling mechanism that systematically identifies and eliminates local minima regions by increasing their potential values, analogous to a bulldozer filling potholes in a road. Additionally, a ramp-based enhancement is incorporated to assist the robot in escaping trap areas when starting within a local minimum. The proposed technique is experimentally validated using a physical mobile robot across various maps with increasing complexity. Comparative analyses are conducted against standard APF, adaptive APF, and well-established planning algorithms such as A*, PRM, and RRT. Results demonstrate that the Bulldozer technique effectively resolves the local minima problem while achieving superior execution speed and competitive path quality. Furthermore, a kinematic tracking controller is employed to assess the smoothness and traceability of the planned paths, confirming their suitability for real-world execution.
☆ Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control
Humans intuitively move to sound, but current humanoid robots lack expressive improvisational capabilities, confined to predefined motions or sparse commands. Generating motion from audio and then retargeting it to robots relies on explicit motion reconstruction, leading to cascaded errors, high latency, and disjointed acoustic-actuation mapping. We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. Guided by the core principle of "motion = content + style", the framework treats audio as implicit style signals and eliminates the need for explicit motion reconstruction. RoboPerform integrates a ResMoE teacher policy for adapting to diverse motion patterns and a diffusion-based student policy for audio style injection. This retargeting-free design ensures low latency and high fidelity. Experimental validation shows that RoboPerform achieves promising results in physical plausibility and audio alignment, successfully transforming robots into responsive performers capable of reacting to audio.
☆ RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion
Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
☆ The N-5 Scaling Law: Topological Dimensionality Reduction in the Optimal Design of Fully-actuated Multirotors
The geometric design of fully-actuated and omnidirectional N-rotor aerial vehicles is conventionally formulated as a parametric optimization problem, seeking a single optimal set of N orientations within a fixed architectural family. This work departs from that paradigm to investigate the intrinsic topological structure of the optimization landscape itself. We formulate the design problem on the product manifold of Projective Lines \RP^2^N, fixing the rotor positions to the vertices of polyhedral chassis while varying their lines of action. By minimizing a coordinate-invariant Log-Volume isotropy metric, we reveal that the topology of the global optima is governed strictly by the symmetry of the chassis. For generic (irregular) vertex arrangements, the solutions appear as a discrete set of isolated points. However, as the chassis geometry approaches regularity, the solution space undergoes a critical phase transition, collapsing onto an N-dimensional Torus of the lines tangent at the vertexes to the circumscribing sphere of the chassis, and subsequently reducing to continuous 1-dimensional curves driven by Affine Phase Locking. We synthesize these observations into the N-5 Scaling Law: an empirical relationship holding for all examined regular planar polygons and Platonic solids (N <= 10), where the space of optimal configurations consists of K=N-5 disconnected 1D topological branches. We demonstrate that these locking patterns correspond to a sequence of admissible Star Polygons {N/q}, allowing for the exact prediction of optimal phases for arbitrary N. Crucially, this topology reveals a design redundancy that enables optimality-preserving morphing: the vehicle can continuously reconfigure along these branches while preserving optimal isotropic control authority.
☆ Interactive Robot Programming for Surface Finishing via Task-Centric Mixed Reality Interfaces
Lengthy setup processes that require robotics expertise remain a major barrier to deploying robots for tasks involving high product variability and small batch sizes. As a result, collaborative robots, despite their advanced sensing and control capabilities, are rarely used for surface finishing in small-scale craft and manufacturing settings. To address this gap, we propose a novel robot programming approach that enables non-experts to intuitively program robots through interactive, task-focused workflows. For that, we developed a new surface segmentation algorithm that incorporates human input to identify and refine workpiece regions for processing. Throughout the programming process, users receive continuous visual feedback on the robot's learned model, enabling them to iteratively refine the segmentation result. Based on the segmented surface model, a robot trajectory is generated to cover the desired processing area. We evaluated multiple interaction designs across two comprehensive user studies to derive an optimal interface that significantly reduces user workload, improves usability and enables effective task programming even for users with limited practical experience.
comment: Currently under review at Intelligent Service Robotics
☆ A Kalman Filter-Based Disturbance Observer for Steer-by-Wire Systems
Steer-by-Wire systems replace mechanical linkages, which provide benefits like weight reduction, design flexibility, and compatibility with autonomous driving. However, they are susceptible to high-frequency disturbances from unintentional driver torque, known as driver impedance, which can degrade steering performance. Existing approaches either rely on direct torque sensors, which are costly and impractical, or lack the temporal resolution to capture rapid, high-frequency driver-induced disturbances. We address this limitation by designing a Kalman filter-based disturbance observer that estimates high-frequency driver torque using only motor state measurements. We model the drivers passive torque as an extended state using a PT1-lag approximation and integrate it into both linear and nonlinear Steer-by-Wire system models. In this paper, we present the design, implementation and simulation of this disturbance observer with an evaluation of different Kalman filter variants. Our findings indicate that the proposed disturbance observer accurately reconstructs driver-induced disturbances with only minimal delay 14ms. We show that a nonlinear extended Kalman Filter outperforms its linear counterpart in handling frictional nonlinearities, improving estimation during transitions from static to dynamic friction. Given the study's methodology, it was unavoidable to rely on simulation-based validation rather than real-world experimentation. Further studies are needed to investigate the robustness of the observers under real-world driving conditions.
☆ Unsupervised Learning for Detection of Rare Driving Scenarios
The detection of rare and hazardous driving scenarios is a critical challenge for ensuring the safety and reliability of autonomous systems. This research explores an unsupervised learning framework for detecting rare and extreme driving scenarios using naturalistic driving data (NDD). We leverage the recently proposed Deep Isolation Forest (DIF), an anomaly detection algorithm that combines neural network-based feature representations with Isolation Forests (IFs), to identify non-linear and complex anomalies. Data from perception modules, capturing vehicle dynamics and environmental conditions, is preprocessed into structured statistical features extracted from sliding windows. The framework incorporates t-distributed stochastic neighbor embedding (t-SNE) for dimensionality reduction and visualization, enabling better interpretability of detected anomalies. Evaluation is conducted using a proxy ground truth, combining quantitative metrics with qualitative video frame inspection. Our results demonstrate that the proposed approach effectively identifies rare and hazardous driving scenarios, providing a scalable solution for anomaly detection in autonomous driving systems. Given the study's methodology, it was unavoidable to depend on proxy ground truth and manually defined feature combinations, which do not encompass the full range of real-world driving anomalies or their nuanced contextual dependencies.
☆ Soft Robotic Technological Probe for Speculative Fashion Futures
Emerging wearable robotics demand design approaches that address not only function, but also social meaning. In response, we present Sumbrella, a soft robotic garment developed as a speculative fashion probe. We first detail the design and fabrication of the Sumbrella, including sequenced origami-inspired bistable units, fabric pneumatic actuation chambers, cable driven shape morphing mechanisms, computer vision components, and an integrated wearable system comprising a hat and bolero jacket housing power and control electronics. Through a focus group with twelve creative technologists, we then used Sumbrella as a technological probe to explore how people interpreted, interacted, and imagined future relationships with soft robotic wearables. While Sumbrella allowed our participants to engage in rich discussion around speculative futures and expressive potential, it also surfaced concerns about exploitation, surveillance, and the personal risks and societal ethics of embedding biosensing technologies in public life. We contribute to the Human-Robot Interaction (HRI) field key considerations and recommendations for designing soft robotic garments, including the potential for kinesic communication, the impact of such technologies on social dynamics, and the importance of ethical guidelines. Finally, we provide a reflection on our application of speculative design; proposing that it allows HRI researchers to not only consider functionality, but also how wearable robots influence definitions of what is considered acceptable or desirable in public settings.
☆ Act2Goal: From World Model To General Goal-conditioned Policy
Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/
☆ Robust Deep Learning Control with Guaranteed Performance for Safe and Reliable Robotization in Heavy-Duty Machinery
Today's heavy-duty mobile machines (HDMMs) face two transitions: from diesel-hydraulic actuation to clean electric systems driven by climate goals, and from human supervision toward greater autonomy. Diesel-hydraulic systems have long dominated, so full electrification, via direct replacement or redesign, raises major technical and economic challenges. Although advanced artificial intelligence (AI) could enable higher autonomy, adoption in HDMMs is limited by strict safety requirements, and these machines still rely heavily on human supervision. This dissertation develops a control framework that (1) simplifies control design for electrified HDMMs through a generic modular approach that is energy-source independent and supports future modifications, and (2) defines hierarchical control policies that partially integrate AI while guaranteeing safety-defined performance and stability. Five research questions align with three lines of investigation: a generic robust control strategy for multi-body HDMMs with strong stability across actuation types and energy sources; control solutions that keep strict performance under uncertainty and faults while balancing robustness and responsiveness; and methods to interpret and trust black-box learning strategies so they can be integrated stably and verified against international safety standards. The framework is validated in three case studies spanning different actuators and conditions, covering heavy-duty mobile robots and robotic manipulators. Results appear in five peer-reviewed publications and one unpublished manuscript, advancing nonlinear control and robotics and supporting both transitions.
comment: Doctoral Dissertation, Tampere University
☆ Theory of Mind for Explainable Human-Robot Interaction
Within the context of human-robot interaction (HRI), Theory of Mind (ToM) is intended to serve as a user-friendly backend to the interface of robotic systems, enabling robots to infer and respond to human mental states. When integrated into robots, ToM allows them to adapt their internal models to users' behaviors, enhancing the interpretability and predictability of their actions. Similarly, Explainable Artificial Intelligence (XAI) aims to make AI systems transparent and interpretable, allowing humans to understand and interact with them effectively. Since ToM in HRI serves related purposes, we propose to consider ToM as a form of XAI and evaluate it through the eValuation XAI (VXAI) framework and its seven desiderata. This paper identifies a critical gap in the application of ToM within HRI, as existing methods rarely assess the extent to which explanations correspond to the robot's actual internal reasoning. To address this limitation, we propose to integrate ToM within XAI frameworks. By embedding ToM principles inside XAI, we argue for a shift in perspective, as current XAI research focuses predominantly on the AI system itself and often lacks user-centered explanations. Incorporating ToM would enable a change in focus, prioritizing the user's informational needs and perspective.
☆ Assessing behaviour coverage in a multi-agent system simulation for autonomous vehicle testing
As autonomous vehicle technology advances, ensuring the safety and reliability of these systems becomes paramount. Consequently, comprehensive testing methodologies are essential to evaluate the performance of autonomous vehicles in diverse and complex real-world scenarios. This study focuses on the behaviour coverage analysis of a multi-agent system simulation designed for autonomous vehicle testing, and provides a systematic approach to measure and assess behaviour coverage within the simulation environment. By defining a set of driving scenarios, and agent interactions, we evaluate the extent to which the simulation encompasses a broad range of behaviours relevant to autonomous driving. Our findings highlight the importance of behaviour coverage in validating the effectiveness and robustness of autonomous vehicle systems. Through the analysis of behaviour coverage metrics and coverage-based testing, we identify key areas for improvement and optimization in the simulation framework. Thus, a Model Predictive Control (MPC) pedestrian agent is proposed, where its objective function is formulated to encourage \textit{interesting} tests while promoting a more realistic behaviour than other previously studied pedestrian agents. This research contributes to advancing the field of autonomous vehicle testing by providing insights into the comprehensive evaluation of system behaviour in simulated environments. The results offer valuable implications for enhancing the safety, reliability, and performance of autonomous vehicles through rigorous testing methodologies.
☆ Optimal Scalability-Aware Allocation of Swarm Robots: From Linear to Retrograde Performance via Marginal Gains
In collective systems, the available agents are a limited resource that must be allocated among tasks to maximize collective performance. Computing the optimal allocation of several agents to numerous tasks through a brute-force approach can be infeasible, especially when each task's performance scales differently with the increase of agents. For example, difficult tasks may require more agents to achieve similar performances compared to simpler tasks, but performance may saturate nonlinearly as the number of allocated agents increases. We propose a computationally efficient algorithm, based on marginal performance gains, for optimally allocating agents to tasks with concave scalability functions, including linear, saturating, and retrograde scaling, to achieve maximum collective performance. We test the algorithm by allocating a simulated robot swarm among collective decision-making tasks, where embodied agents sample their environment and exchange information to reach a consensus on spatially distributed environmental features. We vary task difficulties by different geometrical arrangements of environmental features in space (patchiness). In this scenario, decision performance in each task scales either as a saturating curve (following the Condorcet's Jury Theorem in an interference-free setup) or as a retrograde curve (when physical interference among robots restricts their movement). Using simple robot simulations, we show that our algorithm can be useful in allocating robots among tasks. Our approach aims to advance the deployment of future real-world multi-robot systems.
comment: 14 pages, 11 figures, Accepted for publication in IEEE Transactions on Systems, Man, and Cybernetics: Systems
☆ PCR-ORB: Enhanced ORB-SLAM3 with Point Cloud Refinement Using Deep Learning-Based Dynamic Object Filtering
Visual Simultaneous Localization and Mapping (vSLAM) systems encounter substantial challenges in dynamic environments where moving objects compromise tracking accuracy and map consistency. This paper introduces PCR-ORB (Point Cloud Refinement ORB), an enhanced ORB-SLAM3 framework that integrates deep learning-based point cloud refinement to mitigate dynamic object interference. Our approach employs YOLOv8 for semantic segmentation combined with CUDA-accelerated processing to achieve real-time performance. The system implements a multi-stage filtering strategy encompassing ground plane estimation, sky region removal, edge filtering, and temporal consistency validation. Comprehensive evaluation on the KITTI dataset (sequences 00-09) demonstrates performance characteristics across different environmental conditions and scene types. Notable improvements are observed in specific sequences, with sequence 04 achieving 25.9% improvement in ATE RMSE and 30.4% improvement in ATE median. However, results show mixed performance across sequences, indicating scenario-dependent effectiveness. The implementation provides insights into dynamic object filtering challenges and opportunities for robust navigation in complex environments.
comment: 17 pages, 2 figures, 1 table
☆ Explainable Neural Inverse Kinematics for Obstacle-Aware Robotic Manipulation: A Comparative Analysis of IKNet Variants
Deep neural networks have accelerated inverse-kinematics (IK) inference to the point where low cost manipulators can execute complex trajectories in real time, yet the opaque nature of these models contradicts the transparency and safety requirements emerging in responsible AI regulation. This study proposes an explainability centered workflow that integrates Shapley-value attribution with physics-based obstacle avoidance evaluation for the ROBOTIS OpenManipulator-X. Building upon the original IKNet, two lightweight variants-Improved IKNet with residual connections and Focused IKNet with position-orientation decoupling are trained on a large, synthetically generated pose-joint dataset. SHAP is employed to derive both global and local importance rankings, while the InterpretML toolkit visualizes partial-dependence patterns that expose non-linear couplings between Cartesian poses and joint angles. To bridge algorithmic insight and robotic safety, each network is embedded in a simulator that subjects the arm to randomized single and multi-obstacle scenes; forward kinematics, capsule-based collision checks, and trajectory metrics quantify the relationship between attribution balance and physical clearance. Qualitative heat maps reveal that architectures distributing importance more evenly across pose dimensions tend to maintain wider safety margins without compromising positional accuracy. The combined analysis demonstrates that explainable AI(XAI) techniques can illuminate hidden failure modes, guide architectural refinements, and inform obstacle aware deployment strategies for learning based IK. The proposed methodology thus contributes a concrete path toward trustworthy, data-driven manipulation that aligns with emerging responsible-AI standards.
comment: 27 pages, 16 figures
☆ Beyond Coverage Path Planning: Can UAV Swarms Perfect Scattered Regions Inspections?
Unmanned Aerial Vehicles (UAVs) have revolutionized inspection tasks by offering a safer, more efficient, and flexible alternative to traditional methods. However, battery limitations often constrain their effectiveness, necessitating the development of optimized flight paths and data collection techniques. While existing approaches like coverage path planning (CPP) ensure comprehensive data collection, they can be inefficient, especially when inspecting multiple non connected Regions of Interest (ROIs). This paper introduces the Fast Inspection of Scattered Regions (FISR) problem and proposes a novel solution, the multi UAV Disjoint Areas Inspection (mUDAI) method. The introduced approach implements a two fold optimization procedure, for calculating the best image capturing positions and the most efficient UAV trajectories, balancing data resolution and operational time, minimizing redundant data collection and resource consumption. The mUDAI method is designed to enable rapid, efficient inspections of scattered ROIs, making it ideal for applications such as security infrastructure assessments, agricultural inspections, and emergency site evaluations. A combination of simulated evaluations and real world deployments is used to validate and quantify the method's ability to improve operational efficiency while preserving high quality data capture, demonstrating its effectiveness in real world operations. An open source Python implementation of the mUDAI method can be found on GitHub (https://github.com/soc12/mUDAI) and the collected and processed data from the real world experiments are all hosted on Zenodo (https://zenodo.org/records/13866483). Finally, this online platform (https://sites.google.com/view/mudai-platform/) allows interested readers to interact with the mUDAI method and generate their own multi UAV FISR missions.
☆ A Human-Oriented Cooperative Driving Approach: Integrating Driving Intention, State, and Conflict
Human-vehicle cooperative driving serves as a vital bridge to fully autonomous driving by improving driving flexibility and gradually building driver trust and acceptance of autonomous technology. To establish more natural and effective human-vehicle interaction, we propose a Human-Oriented Cooperative Driving (HOCD) approach that primarily minimizes human-machine conflict by prioritizing driver intention and state. In implementation, we take both tactical and operational levels into account to ensure seamless human-vehicle cooperation. At the tactical level, we design an intention-aware trajectory planning method, using intention consistency cost as the core metric to evaluate the trajectory and align it with driver intention. At the operational level, we develop a control authority allocation strategy based on reinforcement learning, optimizing the policy through a designed reward function to achieve consistency between driver state and authority allocation. The results of simulation and human-in-the-loop experiments demonstrate that our proposed approach not only aligns with driver intention in trajectory planning but also ensures a reasonable authority allocation. Compared to other cooperative driving approaches, the proposed HOCD approach significantly enhances driving performance and mitigates human-machine conflict.The code is available at https://github.com/i-Qin/HOCD.
☆ SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling
Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
☆ Breaking Symmetry-Induced Degeneracy in Multi-Agent Ergodic Coverage via Stochastic Spectral Control
Multi-agent ergodic coverage via Spectral Multiscale Coverage (SMC) provides a principled framework for driving a team of agents so that their collective time-averaged trajectories match a prescribed spatial distribution. While classical SMC has demonstrated empirical success, it can suffer from gradient cancellation, particularly when agents are initialized near symmetry points of the target distribution, leading to undesirable behaviors such as stalling or motion constrained along symmetry axes. In this work, we rigorously characterize the initial conditions and symmetry-induced invariant manifolds that give rise to such directional degeneracy in first-order agent dynamics. To address this, we introduce a stochastic perturbation combined with a contraction term and prove that the resulting dynamics ensure almost-sure escape from zero-gradient manifolds while maintaining mean-square boundedness of agent trajectories. Simulations on symmetric multi-modal reference distributions demonstrate that the proposed stochastic SMC effectively mitigates transient stalling and axis-constrained motion, while ensuring that all agent trajectories remain bounded within the domain.
☆ A Sequential Hermaphrodite Coupling Mechanism for Lattice-based Modular Robots
Lattice-based modular robot systems are envisioned for large-scale construction in extreme environments, such as space. Coupling mechanisms for heterogeneous structural modules should meet all of the following requirements: single-sided coupling and decoupling, flat surfaces when uncoupled, and coupling to passive coupling interfaces as well as coupling behavior between coupling mechanisms. The design requirements for such a coupling mechanism are complex. We propose a novel shape-matching mechanical coupling mechanism that satisfies these design requirements. This mechanism enables controlled, sequential transitions between male and female states. When uncoupled, all mechanisms are in the female state. To enable single-sided coupling, one side of the mechanisms switches to the male state during the coupling process. Single-sided decoupling is possible not only from the male side but also from the female side by forcibly switching the opposite mechanism's male state to the female state. This coupling mechanism can be applied to various modular robot systems and robot arm tool changers.
comment: Author's version of a manuscript accepted at the 2025 IEEE International Conference on Mechatronics (ICM). (c) IEEE. The final published version is available at https://doi.org/10.1109/ICM62621.2025.10934866
☆ Towards the Automation in the Space Station: Feasibility Study and Ground Tests of a Multi-Limbed Intra-Vehicular Robot
This paper presents a feasibility study, including simulations and prototype tests, on the autonomous operation of a multi-limbed intra-vehicular robot (mobile manipulator), shortly MLIVR, designed to assist astronauts with logistical tasks on the International Space Station (ISS). Astronauts spend significant time on tasks such as preparation, close-out, and the collection and transportation of goods, reducing the time available for critical mission activities. Our study explores the potential for a mobile manipulator to support these operations, emphasizing the need for autonomous functionality to minimize crew and ground operator effort while enabling real-time task execution. We focused on the robot's transportation capabilities, simulating its motion planning in 3D space. The actual motion execution was tested with a prototype on a 2D table to mimic a microgravity environment. The results demonstrate the feasibility of performing these tasks with minimal human intervention, offering a promising solution to enhance operational efficiency on the ISS.
comment: Author's version of a manuscript accepted at the 2025 IEEE/SICE International Symposium on System Integration (SII). (c) IEEE. The final published version is available at https://doi.org/10.1109/SII59315.2025.10870890
☆ Pole-centric Descriptors for Robust Robot Localization: Evaluation under Pole-at-Distance (PaD) Observations using the Small Pole Landmark (SPL) Dataset
While pole-like structures are widely recognized as stable geometric anchors for long-term robot localization, their identification reliability degrades significantly under Pole-at-Distance (Pad) observations typical of large-scale urban environments. This paper shifts the focus from descriptor design to a systematic investigation of descriptor robustness. Our primary contribution is the establishment of a specialized evaluation framework centered on the Small Pole Landmark (SPL) dataset. This dataset is constructed via an automated tracking-based association pipeline that captures multi-view, multi-distance observations of the same physical landmarks without manual annotation. Using this framework, we present a comparative analysis of Contrastive Learning (CL) and Supervised Learning (SL) paradigms. Our findings reveal that CL induces a more robust feature space for sparse geometry, achieving superior retrieval performance particularly in the 5--10m range. This work provides an empirical foundation and a scalable methodology for evaluating landmark distinctiveness in challenging real-world scenarios.
comment: 3 pages, technical report
☆ A New Software Tool for Generating and Visualizing Robot Self-Collision Matrices
In robotics, it is common to check whether a given robot state results in self-intersection (i.e., a self-collision query) or to assess its distance from such an intersection (i.e., a self-proximity query). These checks are typically performed between pairs of shapes attached to different robot links. However, many of these shape pairs can be excluded in advance, as their configurations are known to always or never result in contact. This information is typically encoded in a self-collision matrix, where each entry (i, j) indicates whether a check should be performed between shape i and shape j. While the MoveIt Setup Assistant is widely used to generate such matrices, current tools are limited by static visualization, lack of proximity support, rigid single-geometry assumptions, and tedious refinement workflows, hindering flexibility and reuse in downstream robotics applications. In this work, we introduce an interactive tool that overcomes these limitations by generating and visualizing self-collision matrices across multiple shape representations, enabling dynamic inspection, filtering, and refinement of shape pairs. Outputs are provided in both JSON and YAML for easy integration. The system is implemented in Rust and uses the Bevy game engine to deliver high-quality visualizations. We demonstrate its effectiveness on multiple robot platforms, showing that matrices generated using diverse shape types yield faster and more accurate self-collision and self-proximity queries.
☆ Beyond URDF: The Universal Robot Description Directory for Shared, Extensible, and Standardized Robot Models
Robots are typically described in software by specification files (e.g., URDF, SDF, MJCF, USD) that encode only basic kinematic, dynamic, and geometric information. As a result, downstream applications such as simulation, planning, and control must repeatedly re-derive richer data, leading to redundant computations, fragmented implementations, and limited standardization. In this work, we introduce the Universal Robot Description Directory (URDD), a modular representation that organizes derived robot information into structured, easy-to-parse JSON and YAML modules. Our open-source toolkit automatically generates URDDs from URDFs, with a Rust implementation supporting Bevy-based visualization. Additionally, we provide a JavaScript/Three.js viewer for web-based inspection of URDDs. Experiments on multiple robot platforms show that URDDs can be generated efficiently, encapsulate substantially richer information than standard specification files, and directly enable the construction of core robotics subroutines. URDD provides a unified, extensible resource for reducing redundancy and establishing shared standards across robotics frameworks. We conclude with a discussion on the limitations and implications of our work.
☆ Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.
☆ Simultaneous Extrinsic Contact and In-Hand Pose Estimation via Distributed Tactile Sensing
Prehensile autonomous manipulation, such as peg insertion, tool use, or assembly, require precise in-hand understanding of the object pose and the extrinsic contacts made during interactions. Providing accurate estimation of pose and contacts is challenging. Tactile sensors can provide local geometry at the sensor and force information about the grasp, but the locality of sensing means resolving poses and contacts from tactile alone is often an ill-posed problem, as multiple configurations can be consistent with the observations. Adding visual feedback can help resolve ambiguities, but can suffer from noise and occlusions. In this work, we propose a method that pairs local observations from sensing with the physical constraints of contact. We propose a set of factors that ensure local consistency with tactile observations as well as enforcing physical plausibility, namely, that the estimated pose and contacts must respect the kinematic and force constraints of quasi-static rigid body interactions. We formalize our problem as a factor graph, allowing for efficient estimation. In our experiments, we demonstrate that our method outperforms existing geometric and contact-informed estimation pipelines, especially when only tactile information is available. Video results can be found at https://tacgraph.github.io/.
comment: 8 pages. IEEE Robotics and Automation Letters, 2026
☆ Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments
Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy (< 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.
♻ ☆ GrOMP: Grasped Object Manifold Projection for Multimodal Imitation Learning of Manipulation
Imitation Learning (IL) holds great potential for learning repetitive manipulation tasks, such as those in industrial assembly. However, its effectiveness is often limited by insufficient trajectory precision due to compounding errors. In this paper, we introduce Grasped Object Manifold Projection (GrOMP), an interactive method that mitigates these errors by constraining a non-rigidly grasped object to a lower-dimensional manifold. GrOMP assumes a precise task in which a manipulator holds an object that may shift within the grasp in an observable manner and must be mated with a grounded part. Crucially, all GrOMP enhancements are learned from the same expert dataset used to train the base IL policy, and are adjusted with an n-arm bandit-based interactive component. We propose a theoretical basis for GrOMP's improvement upon the well-known compounding error bound in IL literature. We demonstrate the framework on four precise assembly tasks using tactile feedback, and note that the approach remains modality-agnostic. Data and videos are available at williamvdb.github.io/GrOMPsite.
comment: 8 pages, 8 figures, 2 tables
♻ ☆ Mechanically Programming the Cross-Sectional Shape of Soft Growing Robotic Structures for Patient Transfer
Pneumatic soft everting robotic structures have the potential to facilitate human transfer tasks due to their ability to grow underneath humans without sliding friction and their utility as a flexible sling when deflated. Tubular structures naturally yield circular cross-sections when inflated, whereas a robotic sling must be both thin enough to grow between them and their resting surface and wide enough to cradle the human. Recent works have achieved flattened cross-sections by including rigid components into the structure, but this reduces conformability to the human. We present a method of mechanically programming the cross-section of soft everting robotic structures using flexible strips that constrain radial expansion between points along the outer membrane. Our method enables simultaneously wide and thin profiles while maintaining the full multi-axis flexibility of traditional slings. We develop and validate a model relating the geometric design specifications to the fabrication parameters, and experimentally characterize their effects on growth rate. Finally, we prototype a soft growing robotic sling system and demonstrate its use for assisting a single caregiver in bed-to-chair patient transfer.
♻ ☆ Gaussian Process Implicit Surfaces as Control Barrier Functions for Safe Robot Navigation
Level set methods underpin modern safety techniques such as control barrier functions (CBFs), while also serving as implicit surface representations for geometric shapes via distance fields. Inspired by these two paradigms, we propose a unified framework where the implicit surface itself acts as a CBF. We leverage Gaussian process (GP) implicit surface (GPIS) to represent the safety boundaries, using safety samples which are derived from sensor measurements to condition the GP. The GP posterior mean defines the implicit safety surface (safety belief), while the posterior variance provides a robust safety margin. Although GPs have favorable properties such as uncertainty estimation and analytical tractability, they scale cubically with data. To alleviate this issue, we develop a sparse solution called sparse Gaussian CBFs. To the best of our knowledge, GPIS have not been explicitly used to synthesize CBFs. We validate the approach on collision avoidance tasks in two settings: a simulated 7-DOF manipulator operating around the Stanford bunny, and a quadrotor navigating in 3D around a physical chair. In both cases, Gaussian CBFs (with and without sparsity) enable safe interaction and collision-free execution of trajectories that would otherwise intersect the objects.
comment: 8 pages, 7 figures, under review
♻ ☆ Never too Cocky to Cooperate: An FIM and RL-based USV-AUV Collaborative System for Underwater Tasks in Extreme Sea Conditions
This paper develops a novel unmanned surface vehicle (USV)-autonomous underwater vehicle (AUV) collaborative system designed to enhance underwater task performance in extreme sea conditions. The system integrates a dual strategy: (1) high-precision multi-AUV localization enabled by Fisher information matrix-optimized USV path planning, and (2) reinforcement learning-based cooperative planning and control method for multi-AUV task execution. Extensive experimental evaluations in the underwater data collection task demonstrate the system's operational feasibility, with quantitative results showing significant performance improvements over baseline methods. The proposed system exhibits robust coordination capabilities between USV and AUVs while maintaining stability in extreme sea conditions. To facilitate reproducibility and community advancement, we provide an open-source simulation toolkit available at: https://github.com/360ZMEM/USV-AUV-colab .
comment: This paper has been accepted by IEEE Transactions on Mobile Computing
♻ ☆ Anti-Slip AI-Driven Model-Free Control with Global Exponential Stability in Skid-Steering Robots IROS
Undesired lateral and longitudinal wheel slippage can disrupt a mobile robot's heading angle, traction, and, eventually, desired motion. This issue makes the robotization and accurate modeling of heavy-duty machinery very challenging because the application primarily involves off-road terrains, which are susceptible to uneven motion and severe slippage. As a step toward robotization in skid-steering heavy-duty robot (SSHDR), this paper aims to design an innovative robust model-free control system developed by neural networks to strongly stabilize the robot dynamics in the presence of a broad range of potential wheel slippages. Before the control design, the dynamics of the SSHDR are first investigated by mathematically incorporating slippage effects, assuming that all functional modeling terms of the system are unknown to the control system. Then, a novel tracking control framework to guarantee global exponential stability of the SSHDR is designed as follows: 1) the unknown modeling of wheel dynamics is approximated using radial basis function neural networks (RBFNNs); and 2) a new adaptive law is proposed to compensate for slippage effects and tune the weights of the RBFNNs online during execution. Simulation and experimental results verify the proposed tracking control performance of a 4,836 kg SSHDR operating on slippery terrain.
comment: This paper has been published in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
♻ ☆ Contingency Model-based Control (CMC) for Communicationless Cooperative Collision Avoidance in Robot Swarms
Cooperative collision avoidance between robots in swarm operations remains an open challenge. Assuming a decentralized architecture, each robot is responsible for making its own control decisions, including motion planning. To this end, most existing approaches mostly rely some form of (wireless) communication between the agents of the swarm. In reality, however, communication is brittle. It may be affected by latency, further delays and packet losses, transmission faults, and is subject to adversarial attacks, such as jamming or spoofing. This paper proposes Contingency Model-based Control (CMC) as a communicationless alternative. It follows the implicit cooperation paradigm, under which the design of the robots is based on consensual (offline) rules, similar to traffic rules. They include the definition of a contingency trajectory for each robot, and a method for construction of mutual collision avoidance constraints. The setup is shown to guarantee the recursive feasibility and collision avoidance between all swarm members in closed-loop operation. Moreover, CMC naturally satisfies the Plug \& Play paradigm, i.e., for new robots entering the swarm. Two numerical examples demonstrate that the collision avoidance guarantee is intact and that the robot swarm operates smoothly under the CMC regime.
♻ ☆ Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models
The integration of Large Language Models (LLMs) into robotics has unlocked unprecedented capabilities in high-level task planning. However, most current systems operate in an open-loop fashion, where LLMs act as one-shot planners, rendering them brittle and unable to adapt to unforeseen circumstances in dynamic physical environments. To overcome this limitation, this paper introduces the "Think, Act, Learn" (T-A-L) framework, a novel architecture that enables an embodied agent to autonomously learn and refine its policies through continuous interaction. Our framework establishes a closed-loop cycle where an LLM first "thinks" by decomposing high-level commands into actionable plans. The robot then "acts" by executing these plans while gathering rich, multimodal sensory feedback. Critically, the "learn" module processes this feedback to facilitate LLM-driven self-reflection, allowing the agent to perform causal analysis on its failures and generate corrective strategies. These insights are stored in an experiential memory to guide future planning cycles. We demonstrate through extensive experiments in both simulation and the real world that our T-A-L agent significantly outperforms baseline methods, including open-loop LLMs, Behavioral Cloning, and traditional Reinforcement Learning. Our framework achieves over a 97% success rate on complex, long-horizon tasks, converges to a stable policy in an average of just 9 trials, and exhibits remarkable generalization to unseen tasks. This work presents a significant step towards developing more robust, adaptive, and truly autonomous robotic agents.
comment: 13 pages, 7 figures
♻ ☆ ReSemAct: Advancing Fine-Grained Robotic Manipulation via Semantic Structuring and Affordance Refinement
Fine-grained robotic manipulation requires grounding natural language into appropriate affordance targets. However, most existing methods driven by foundation models often compress rich semantics into oversimplified affordances, preventing exploitation of implicit semantic information. To address these challenges, we present ReSemAct, a novel unified manipulation framework that introduces Semantic Structuring and Affordance Refinement (SSAR), powered by the automated synergistic reasoning between Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs). Specifically, the Semantic Structuring module derives a unified semantic affordance description from natural language and RGB observations, organizing affordance regions, implicit functional intent, and coarse affordance anchors into a structured representation for downstream refinement. Building upon this specification, the Affordance Refinement strategy instantiates two complementary flows that separately specialize geometry and position, yielding fine-grained affordance targets. These refined targets are then encoded as real-time joint-space optimization objectives, enabling reactive and robust manipulation in dynamic environments. Extensive simulation and real-world experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSemAct performs diverse tasks under zero-shot conditions, showcasing the robustness of SSAR with foundation models in fine-grained manipulation. Code and videos at https://github.com/scy-v/ReSemAct and https://resemact.github.io.
comment: Code and videos: https://github.com/scy-v/ReSemAct and https://resemact.github.io
♻ ☆ LidarDM: Generative LiDAR Simulation in a Generated World
We present LidarDM, a novel LiDAR generative model capable of producing realistic, layout-aware, physically plausible, and temporally coherent LiDAR videos. LidarDM stands out with two unprecedented capabilities in LiDAR generative modeling: (i) LiDAR generation guided by driving scenarios, offering significant potential for autonomous driving simulations, and (ii) 4D LiDAR point cloud generation, enabling the creation of realistic and temporally coherent sequences. At the heart of our model is a novel integrated 4D world generation framework. Specifically, we employ latent diffusion models to generate the 3D scene, combine it with dynamic actors to form the underlying 4D world, and subsequently produce realistic sensory observations within this virtual environment. Our experiments indicate that our approach outperforms competing algorithms in realism, temporal coherency, and layout consistency. We additionally show that LidarDM can be used as a generative world model simulator for training and testing perception models.
♻ ☆ UniTacHand: Unified Spatio-Tactile Representation for Human to Robotic Hand Skill Transfer
Tactile sensing is crucial for robotic hands to achieve human-level dexterous manipulation, especially in scenarios with visual occlusion. However, its application is often hindered by the difficulty of collecting large-scale real-world robotic tactile data. In this study, we propose to collect low-cost human manipulation data using haptic gloves for tactile-based robotic policy learning. The misalignment between human and robotic tactile data makes it challenging to transfer policies learned from human data to robots. To bridge this gap, we propose UniTacHand, a unified representation to align robotic tactile information captured by dexterous hands with human hand touch obtained from gloves. First, we project tactile signals from both human hands and robotic hands onto a morphologically consistent 2D surface space of the MANO hand model. This unification standardizes the heterogeneous data structures and inherently embeds the tactile signals with spatial context. Then, we introduce a contrastive learning method to align them into a unified latent space, trained on only 10 minutes of paired data from our data collection system. Our approach enables zero-shot tactile-based policy transfer from humans to a real robot, generalizing to objects unseen in the pre-training data. We also demonstrate that co-training on mixed data, including both human and robotic demonstrations via UniTacHand, yields better performance and data efficiency compared with using only robotic data. UniTacHand paves a path toward general, scalable, and data-efficient learning for tactile-based dexterous hands.
comment: The first two authors contributed equally
♻ ☆ Developing a Fundamental Diagram for Urban Air Mobility Based on Physical Experiments
Urban Air Mobility (UAM) is an emerging application of unmanned aerial vehicles (UAVs) that promises to reduce travel time and alleviate congestion in urban transportation systems. As drone density increases, UAM operations are expected to experience congestion similar to that in ground traffic. However, the fundamental characteristics of UAM traffic flow, particularly under real-world operating conditions, remain poorly understood. This study proposes a general framework for constructing the fundamental diagram (FD) of UAM traffic by integrating theoretical analysis with physical experiments. To the best of our knowledge, this is the first study to derive a UAM FD using real-world physical test data. On the theoretical side, we design two drone control laws for collision avoidance and develop simulation-based traffic generation methods to produce diverse UAM traffic scenarios. Based on Edie's definition, traffic flow theory is then applied to construct the FD and characterize the macroscopic properties of UAM traffic. To account for real-world disturbances and modeling uncertainties, we further conduct physical experiments on a reduced-scale testbed using Bitcraze Crazyflie drones. Both simulation and physical test trajectory data are collected and organized into the UAMTra2Flow dataset, which is analyzed using the proposed framework. Preliminary results indicate that classical FD structures for ground transportation are also applicable to UAM systems. Notably, FD curves obtained from physical experiments exhibit deviations from simulation-based results, highlighting the importance of experimental validation. Finally, results from the reduced-scale testbed are scaled to realistic operating conditions to provide practical insights for future UAM traffic systems. The dataset and code for this paper are publicly available at https://github.com/CATS-Lab/UAM-FD.
♻ ☆ CHARM: Considering Human Attributes for Reinforcement Modeling
Reinforcement Learning from Human Feedback has recently achieved significant success in various fields, and its performance is highly related to feedback quality. While much prior work acknowledged that human teachers' characteristics would affect human feedback patterns, there is little work that has closely investigated the actual effects. In this work, we designed an exploratory study investigating how human feedback patterns are associated with human characteristics. We conducted a public space study with two long horizon tasks and 46 participants. We found that feedback patterns are not only correlated with task statistics, such as rewards, but also correlated with participants' characteristics, especially robot experience and educational background. Additionally, we demonstrated that human feedback value can be more accurately predicted with human characteristics compared to only using task statistics. All human feedback and characteristics we collected, and codes for our data collection and predicting more accurate human feedback are available at https://github.com/AABL-Lab/CHARM
♻ ☆ How Much Progress Did I Make? An Unexplored Human Feedback Signal for Teaching Robots
Enhancing the expressiveness of human teaching is vital for both improving robots' learning from humans and the human-teaching-robot experience. In this work, we characterize and test a little-used teaching signal: \textit{progress}, designed to represent the completion percentage of a task. We conducted two online studies with 76 crowd-sourced participants and one public space study with 40 non-expert participants to validate the capability of this progress signal. We find that progress indicates whether the task is successfully performed, reflects the degree of task completion, identifies unproductive but harmless behaviors, and is likely to be more consistent across participants. Furthermore, our results show that giving progress does not require extra workload and time. An additional contribution of our work is a dataset of 40 non-expert demonstrations from the public space study through an ice cream topping-adding task, which we observe to be multi-policy and sub-optimal, with sub-optimality not only from teleoperation errors but also from exploratory actions and attempts. The dataset is available at https://github.com/TeachingwithProgress/Non-Expert\_Demonstrations.
comment: 8 pages. RO-MAN 2024
♻ ☆ Model-free source seeking of exponentially convergent unicycle: theoretical and robotic experimental results
This paper introduces a novel model-free, real-time unicycle-based source seeking design. This design autonomously steers the unicycle dynamic system towards the extremum point of an objective function or physical/scalar signal that is unknown expression-wise, but accessible via measurements. A key contribution of this paper is that the introduced design converges exponentially to the extremum point of objective functions (or scalar signals) that behave locally like a higher-degree power function (e.g., fourth-degree polynomial function) as opposed to locally quadratic objective functions, the usual case in literature. We provide theoretical results and design characterization, supported by a variety of simulation results that demonstrate the robustness of the proposed design, including cases with different initial conditions and measurement delays/noise. Also, for the first time in the literature, we provide experimental robotic results that demonstrate the effectiveness of the proposed design and its exponential convergence ability.
♻ ☆ MambaIO: Global-Coordinate Inertial Odometry for Pedestrians via Multi-Scale Frequency-Decoupled Modeling
Inertial Odometry (IO) enables real-time localization using only acceleration and angular velocity measurements from an Inertial Measurement Unit (IMU), making it a promising solution for localization in consumer-grade applications. Traditionally, researchers have routinely transformed IMU measurements into the global frame to obtain smoother motion representations. However, recent studies in drone scenarios have demonstrated that the body frame can significantly improve localization accuracy, prompting a re-evaluation of the suitability of the global frame for pedestrian IO. To address this issue, this paper systematically evaluates the effectiveness of the global frame in pedestrian IO through theoretical analysis, qualitative inspection, and quantitative experiments. Building upon these findings, we further propose MambaIO, which decomposes IMU measurements into high-frequency and low-frequency components using a Laplacian pyramid. The low-frequency component is processed by a Mamba architecture to extract implicit contextual motion cues, while the high-frequency component is handled by a convolutional structure to capture fine-grained local motion details. Experiments on multiple public datasets show that MambaIO substantially reduces localization error and achieves state-of-the-art (SOTA) performance. To the best of our knowledge, this is the first application of the Mamba architecture to the IO task.
♻ ☆ Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation
Fatigue detection for human operators plays a key role in safety critical applications such as aviation, mining, and long haul transport. While numerous studies have demonstrated the effectiveness of high fidelity sensors in controlled laboratory environments, their performance often degrades when ported to real world settings due to noise, lighting conditions, and field of view constraints, thereby limiting their practicality. This paper formalizes a deployment oriented setting for real world fatigue detection, where high quality sensors are often unavailable in practical applications. To address this challenge, we propose leveraging knowledge from heterogeneous source domains, including high fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real world target domain. Building on this idea, we design a heterogeneous and multiple source fatigue detection framework that adaptively utilizes the available modalities in the target domain while exploiting diverse configurations in the source domains through alignment across domains and modality imputation. Our experiments, conducted using a field deployed sensor setup and two publicly available human fatigue datasets, demonstrate the practicality, robustness, and improved generalization of our approach across subjects and domains. The proposed method achieves consistent gains over strong baselines in sensor constrained scenarios. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
comment: 4figures,14pages
♻ ☆ Towards autonomous photogrammetric forest inventory using a lightweight under-canopy robotic drone
Drones are increasingly used in forestry to capture high-resolution remote sensing data, supporting enhanced monitoring, assessment, and decision-making processes. While operations above the forest canopy are already highly automated, flying inside forests remains challenging, primarily relying on manual piloting. In dense forests, relying on the Global Navigation Satellite System (GNSS) for localization is not feasible. In addition, the drone must autonomously adjust its flight path to avoid collisions. Recently, advancements in robotics have enabled autonomous drone flights in GNSS-denied obstacle-rich areas. In this article, a step towards autonomous forest data collection is taken by building a prototype of a robotic under-canopy drone utilizing state-of-the-art open source methods and validating its performance for data collection inside forests. Specifically, the study focused on camera-based autonomous flight under the forest canopy and photogrammetric post-processing of the data collected with the low-cost onboard stereo camera. The autonomous flight capability of the prototype was evaluated through multiple test flights in boreal forests. The tree parameter estimation capability was studied by performing diameter at breast height (DBH) estimation. The prototype successfully carried out flights in selected challenging forest environments, and the experiments showed promising performance in forest 3D modeling with a miniaturized stereoscopic photogrammetric system. The DBH estimation achieved a root mean square error (RMSE) of 3.33 - 3.97 cm (10.69 - 12.98 %) across all trees. For trees with a DBH less than 30 cm, the RMSE was 1.16 - 2.56 cm (5.74 - 12.47 %). The results provide valuable insights into autonomous under-canopy forest mapping and highlight the critical next steps for advancing lightweight robotic drone systems for mapping complex forest environments.
comment: This is an Accepted Manuscript of an article published by Taylor & Francis in International Journal of Remote Sensing on 03 Dec 2025. The published final Version of Record is openly available at: https://doi.org/10.1080/01431161.2025.2579803
♻ ☆ Robust Robotic Exploration and Mapping Using Generative Occupancy Map Synthesis
We present a novel approach for enhancing robotic exploration by using generative occupancy mapping. We implement SceneSense, a diffusion model designed and trained for predicting 3D occupancy maps given partial observations. Our proposed approach probabilistically fuses these predictions into a running occupancy map in real-time, resulting in significant improvements in map quality and traversability. We deploy SceneSense on a quadruped robot and validate its performance with real-world experiments to demonstrate the effectiveness of the model. In these experiments we show that occupancy maps enhanced with SceneSense predictions better estimate the distribution of our fully observed ground truth data ($24.44\%$ FID improvement around the robot and $75.59\%$ improvement at range). We additionally show that integrating SceneSense enhanced maps into our robotic exploration stack as a ``drop-in'' map improvement, utilizing an existing off-the-shelf planner, results in improvements in robustness and traversability time. Finally, we show results of full exploration evaluations with our proposed system in two dissimilar environments and find that locally enhanced maps provide more consistent exploration results than maps constructed only from direct sensor measurements.
comment: arXiv admin note: text overlap with arXiv:2409.10681
Robotics 13
☆ APOLLO Blender: A Robotics Library for Visualization and Animation in Blender
High-quality visualizations are an essential part of robotics research, enabling clear communication of results through figures, animations, and demonstration videos. While Blender is a powerful and freely available 3D graphics platform, its steep learning curve and lack of robotics-focused integrations make it difficult and time-consuming for researchers to use effectively. In this work, we introduce a lightweight software library that bridges this gap by providing simple scripting interfaces for common robotics visualization tasks. The library offers three primary capabilities: (1) importing robots and environments directly from standardized descriptions such as URDF; (2) Python-based scripting tools for keyframing robot states and visual attributes; and (3) convenient generation of primitive 3D shapes for schematic figures and animations. Together, these features allow robotics researchers to rapidly create publication-ready images, animations, and explanatory schematics without needing extensive Blender expertise. We demonstrate the library through a series of proof-of-concept examples and conclude with a discussion of current limitations and opportunities for future extensions.
☆ Embodied Learning of Reward for Musculoskeletal Control with Vision Language Models
Discovering effective reward functions remains a fundamental challenge in motor control of high-dimensional musculoskeletal systems. While humans can describe movement goals explicitly such as "walking forward with an upright posture," the underlying control strategies that realize these goals are largely implicit, making it difficult to directly design rewards from high-level goals and natural language descriptions. We introduce Motion from Vision-Language Representation (MoVLR), a framework that leverages vision-language models (VLMs) to bridge the gap between goal specification and movement control. Rather than relying on handcrafted rewards, MoVLR iteratively explores the reward space through iterative interaction between control optimization and VLM feedback, aligning control policies with physically coordinated behaviors. Our approach transforms language and vision-based assessments into structured guidance for embodied learning, enabling the discovery and refinement of reward functions for high-dimensional musculoskeletal locomotion and manipulation. These results suggest that VLMs can effectively ground abstract motion descriptions in the implicit principles governing physiological motor control.
comment: 18 pages, 8 figures
☆ Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives
Recent advances in vision, language, and multimodal learning have substantially accelerated progress in robotic foundation models, with robot manipulation remaining a central and challenging problem. This survey examines robot manipulation from an algorithmic perspective and organizes recent learning-based approaches within a unified abstraction of high-level planning and low-level control. At the high level, we extend the classical notion of task planning to include reasoning over language, code, motion, affordances, and 3D representations, emphasizing their role in structured and long-horizon decision making. At the low level, we propose a training-paradigm-oriented taxonomy for learning-based control, organizing existing methods along input modeling, latent representation learning, and policy learning. Finally, we identify open challenges and prospective research directions related to scalability, data efficiency, multimodal physical interaction, and safety. Together, these analyses aim to clarify the design space of modern foundation models for robotic manipulation.
comment: This work is a re-architected core derived from the full survey (arXiv:2510.10903) , refined to highlight the most central themes and representative studies
☆ PreGME: Prescribed Performance Control of Aerial Manipulators based on Variable-Gain ESO
An aerial manipulator, comprising a multirotor base and a robotic arm, is subject to significant dynamic coupling between these two components. Therefore, achieving precise and robust motion control is a challenging yet important objective. Here, we propose a novel prescribed performance motion control framework based on variable-gain extended state observers (ESOs), referred to as PreGME. The method includes variable-gain ESOs for real-time estimation of dynamic coupling and a prescribed performance flight control that incorporates error trajectory constraints. Compared with existing methods, the proposed approach exhibits the following two characteristics. First, the adopted variable-gain ESOs can accurately estimate rapidly varying dynamic coupling. This enables the proposed method to handle manipulation tasks that require aggressive motion of the robotic arm. Second, by prescribing the performance, a preset error trajectory is generated to guide the system evolution along this trajectory. This strategy allows the proposed method to ensure the tracking error remains within the prescribed performance envelope, thereby achieving high-precision control. Experiments on a real platform, including aerial staff twirling, aerial mixology, and aerial cart-pulling experiments, are conducted to validate the effectiveness of the proposed method. Experimental results demonstrate that even under the dynamic coupling caused by rapid robotic arm motion (end-effector velocity: 1.02 m/s, acceleration: 5.10 m/s$^2$), the proposed method achieves high tracking performance.
comment: 12 pages, 6 figures
☆ P-FABRIK: A General Intuitive and Robust Inverse Kinematics Method for Parallel Mechanisms Using FABRIK Approach
Traditional geometric inverse kinematics methods for parallel mechanisms rely on specific spatial geometry constraints. However, their application to redundant parallel mechanisms is challenged due to the increased constraint complexity. Moreover, it will output no solutions and cause unpredictable control problems when the target pose lies outside its workspace. To tackle these challenging issues, this work proposes P-FABRIK, a general, intuitive, and robust inverse kinematics method to find one feasible solution for diverse parallel mechanisms based on the FABRIK algorithm. By decomposing the general parallel mechanism into multiple serial sub-chains using a new topological decomposition strategy, the end targets of each sub-chain can be subsequently revised to calculate the inverse kinematics solutions iteratively. Multiple case studies involving planar, standard, and redundant parallel mechanisms demonstrated the proposed method's generality across diverse parallel mechanisms. Furthermore, numerical simulation studies verified its efficacy and computational efficiency, as well as its robustness ability to handle out-of-workspace targets.
comment: 7 pages, 8 figures, and 2 tables
☆ The body is not there to compute: Comment on "Informational embodiment: Computational role of information structure in codes and robots" by Pitti et al
Applying the lens of computation and information has been instrumental in driving the technological progress of our civilization as well as in empowering our understanding of the world around us. The digital computer was and for many still is the leading metaphor for how our mind operates. Information theory (IT) has also been important in our understanding of how nervous systems encode and process information. The target article deploys information and computation to bodies: to understand why they have evolved in particular ways (animal bodies) and to design optimal bodies (robots). In this commentary, I argue that the main role of bodies is not to compute.
comment: Comment on Pitti, A., Austin, M., Nakajima, K., & Kuniyoshi, Y. (2025). Informational Embodiment: Computational role of information structure in codes and robots. Physics of Life Reviews 53, 262-276. https://doi.org/10.1016/j.plrev.2025.03.018. Also available as arXiv:2408.12950
☆ MUSON: A Reasoning-oriented Multimodal Dataset for Socially Compliant Navigation in Urban Environments
Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models' ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at https://huggingface.co/datasets/MARSLab/MUSON
☆ Two-Robot Computational Landscape: A Complete Characterization of Model Power in Minimal Mobile Robot Systems
The computational power of autonomous mobile robots under the Look-Compute-Move (LCM) model has been widely studied through an extensive hierarchy of robot models defined by the presence of memory, communication, and synchrony assumptions. While the general n-robot landscape has been largely established, the exact structure for two robots has remained unresolved. This paper presents the first complete characterization of the computational power of two autonomous robots across all major models, namely OBLOT, FSTA, FCOM, and LUMI, under the full spectrum of schedulers (FSYNCH, SSYNCH, ASYNCH, and their atomic variants). Our results reveal a landscape that fundamentally differs from the general case. Most notably, we prove that FSTA^F and LUMI^F coincide under full synchrony, a surprising collapse indicating that perfect synchrony can substitute both memory and communication when only two robots exist. We also show that FSTA and FCOM are orthogonal: there exists a problem solvable in the weakest communication model but impossible even in the strongest finite-state model, completing the bidirectional incomparability. All equivalence and separation results are derived through a novel simulation-free method, providing a unified and constructive view of the two-robot hierarchy. This yields the first complete and exact computational landscape for two robots, highlighting the intrinsic challenges of coordination at the minimal scale.
comment: 23 pages, 3 figures
☆ Active Constraint Learning in High Dimensions from Demonstrations
We present an iterative active constraint learning (ACL) algorithm, within the learning from demonstrations (LfD) paradigm, which intelligently solicits informative demonstration trajectories for inferring an unknown constraint in the demonstrator's environment. Our approach iteratively trains a Gaussian process (GP) on the available demonstration dataset to represent the unknown constraints, uses the resulting GP posterior to query start/goal states, and generates informative demonstrations which are added to the dataset. Across simulation and hardware experiments using high-dimensional nonlinear dynamics and unknown nonlinear constraints, our method outperforms a baseline, random-sampling based method at accurately performing constraint inference from an iteratively generated set of sparse but informative demonstrations.
comment: Under review, 25 pages, 11 figures
☆ Sistema de navegación de cobertura para vehículos no holonómicos en ambientes de exterior
In mobile robotics, coverage navigation refers to the deliberate movement of a robot with the purpose of covering a certain area or volume. Performing this task properly is fundamental for the execution of several activities, for instance, cleaning a facility with a robotic vacuum cleaner. In the mining industry, it is required to perform coverage in several unit processes related with material movement using industrial machinery, for example, in cleaning tasks, in dumps, and in the construction of tailings dam walls. The automation of these processes is fundamental to enhance the security associated with their execution. In this work, a coverage navigation system for a non-holonomic robot is presented. This work is intended to be a proof of concept for the potential automation of various unit processes that require coverage navigation like the ones mentioned before. The developed system includes the calculation of routes that allow a mobile platform to cover a specific area, and incorporates recovery behaviors in case that an unforeseen event occurs, such as the arising of dynamic or previously unmapped obstacles in the terrain to be covered, e.g., other machines or pedestrians passing through the area, being able to perform evasive maneuvers and post-recovery to ensure a complete coverage of the terrain. The system was tested in different simulated and real outdoor environments, obtaining results near 90% of coverage in the majority of experiments. The next step of development is to scale up the utilized robot to a mining machine/vehicle whose operation will be validated in a real environment. The result of one of the tests performed in the real world can be seen in the video available in https://youtu.be/gK7_3bK1P5g.
comment: 13 pages, in Spanish language, 12 figures, accepted at Tercer Congreso Iberoamericano de Minería Subterranea y a Cielo Abierto, UMining 2024
♻ ☆ Effective Game-Theoretic Motion Planning via Nested Search
To facilitate effective, safe deployment in the real world, individual robots must reason about interactions with other agents, which often occur without explicit communication. Recent work has identified game theory, particularly the concept of Nash Equilibrium (NE), as a key enabler for behavior-aware decision-making. Yet, existing work falls short of fully unleashing the power of game-theoretic reasoning. Specifically, popular optimization-based methods require simplified robot dynamics and tend to get trapped in local minima due to convexification. Other works that rely on payoff matrices suffer from poor scalability due to the explicit enumeration of all possible trajectories. To bridge this gap, we introduce Game-Theoretic Nested Search (GTNS), a novel, scalable, and provably correct approach for computing NEs in general dynamical systems. GTNS efficiently searches the action space of all agents involved, while discarding trajectories that violate the NE constraint (no unilateral deviation) through an inner search over a lower-dimensional space. Our algorithm enables explicit selection among equilibria by utilizing a user-specified global objective, thereby capturing a rich set of realistic interactions. We demonstrate the approach on a variety of autonomous driving and racing scenarios where we achieve solutions in mere seconds on commodity hardware.
comment: Updated version. Additional experiment included. Cosmetic/formatting changes made
♻ ☆ Relative Localization System Design for SnailBot: A Modular Self-reconfigurable Robot
This paper presents the design and implementation of a relative localization system for SnailBot, a modular self reconfigurable robot. The system integrates ArUco marker recognition, optical flow analysis, and IMU data processing into a unified fusion framework, enabling robust and accurate relative positioning for collaborative robotic tasks. Experimental validation demonstrates the effectiveness of the system in realtime operation, with a rule based fusion strategy ensuring reliability across dynamic scenarios. The results highlight the potential for scalable deployment in modular robotic systems.
comment: The design presented in the article does not correspond to the actual situation
♻ ☆ Adaptive Keyframe Selection for Scalable 3D Scene Reconstruction in Dynamic Environments
In this paper, we propose an adaptive keyframe selection method for improved 3D scene reconstruction in dynamic environments. The proposed method integrates two complementary modules: an error-based selection module utilizing photometric and structural similarity (SSIM) errors, and a momentum-based update module that dynamically adjusts keyframe selection thresholds according to scene motion dynamics. By dynamically curating the most informative frames, our approach addresses a key data bottleneck in real-time perception. This allows for the creation of high-quality 3D world representations from a compressed data stream, a critical step towards scalable robot learning and deployment in complex, dynamic environments. Experimental results demonstrate significant improvements over traditional static keyframe selection strategies, such as fixed temporal intervals or uniform frame skipping. These findings highlight a meaningful advancement toward adaptive perception systems that can dynamically respond to complex and evolving visual scenes. We evaluate our proposed adaptive keyframe selection module on two recent state-of-the-art 3D reconstruction networks, Spann3r and CUT3R, and observe consistent improvements in reconstruction quality across both frameworks. Furthermore, an extensive ablation study confirms the effectiveness of each individual component in our method, underlining their contribution to the overall performance gains.
comment: Accepted at ROBOVIS 2026
Robotics 17
☆ Modeling of UAV Tether Aerodynamics for Real-Time Simulation
One of the main limitations of multirotor UAVs is their short flight time due to battery constraints. A practical solution for continuous operation is to power the drone from the ground via a tether. While this approach has been demonstrated for stationary systems, scenarios with a fast-moving base vehicle or strong wind conditions require modeling the tether forces, including aerodynamic effects. In this work, we propose two complementary approaches for real-time quasi-static tether modeling with aerodynamics. The first is an analytical method based on catenary theory with a uniform drag assumption, achieving very fast solve times below 1ms. The second is a numerical method that discretizes the tether into segments and lumped masses, solving the equilibrium equations using CasADi and IPOPT. By leveraging initialization strategies, such as warm starting and analytical initialization, real-time performance was achieved with a solve time of 5ms, while allowing for flexible force formulations. Both approaches were validated in real-world tests using a load cell to measure the tether force. The results show that the analytical method provides sufficient accuracy for most tethered UAV applications with minimal computational cost, while the numerical method offers higher flexibility and physical accuracy when required. These approaches form a lightweight and extensible framework for real-time tether simulation, applicable to both offline optimization and online tasks such as simulation, control, and trajectory planning.
☆ ParaMaP: Parallel Mapping and Collision-free Motion Planning for Reactive Robot Manipulation
Real-time and collision-free motion planning remains challenging for robotic manipulation in unknown environments due to continuous perception updates and the need for frequent online replanning. To address these challenges, we propose a parallel mapping and motion planning framework that tightly integrates Euclidean Distance Transform (EDT)-based environment representation with a sampling-based model predictive control (SMPC) planner. On the mapping side, a dense distance-field-based representation is constructed using a GPU-based EDT and augmented with a robot-masked update mechanism to prevent false self-collision detections during online perception. On the planning side, motion generation is formulated as a stochastic optimization problem with a unified objective function and efficiently solved by evaluating large batches of candidate rollouts in parallel within a SMPC framework, in which a geometrically consistent pose tracking metric defined on SE(3) is incorporated to ensure fast and accurate convergence to the target pose. The entire mapping and planning pipeline is implemented on the GPU to support high-frequency replanning. The effectiveness of the proposed framework is validated through extensive simulations and real-world experiments on a 7-DoF robotic manipulator. More details are available at: https://zxw610.github.io/ParaMaP.
☆ VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models
While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at https://vla-arena.github.io.
☆ Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding
Recent Vision-Language-Action (VLA) models have made impressive progress toward general-purpose robotic manipulation by post-training large Vision-Language Models (VLMs) for action prediction. Yet most VLAs entangle perception and control in a monolithic pipeline optimized purely for action, which can erode language-conditioned grounding. In our real-world tabletop tests, policies over-grasp when the target is absent, are distracted by clutter, and overfit to background appearance. To address these issues, we propose OBEYED-VLA (OBject-centric and gEometrY groundED VLA), a framework that explicitly disentangles perceptual grounding from action reasoning. Instead of operating directly on raw RGB, OBEYED-VLA augments VLAs with a perception module that grounds multi-view inputs into task-conditioned, object-centric, and geometry-aware observations. This module includes a VLM-based object-centric grounding stage that selects task-relevant object regions across camera views, along with a complementary geometric grounding stage that emphasizes the 3D structure of these objects over their appearance. The resulting grounded views are then fed to a pretrained VLA policy, which we fine-tune exclusively on single-object demonstrations collected without environmental clutter or non-target objects. On a real-world UR10e tabletop setup, OBEYED-VLA substantially improves robustness over strong VLA baselines across four challenging regimes and multiple difficulty levels: distractor objects, absent-target rejection, background appearance changes, and cluttered manipulation of unseen objects. Ablation studies confirm that both semantic grounding and geometry-aware grounding are critical to these gains. Overall, the results indicate that making perception an explicit, object-centric component is an effective way to strengthen and generalize VLA-based robotic manipulation.
comment: Under review. Project website: https://uark-aicv.github.io/OBEYED_VLA
☆ Topology-Preserving Scalar Field Optimization for Boundary-Conforming Spiral Toolpaths on Multiply Connected Freeform Surfaces
Ball-end milling path planning on multiply connected freeform surfaces is pivotal for high-quality and efficient machining of components in automotive and aerospace manufacturing. Although scalar-field-based optimization provides a unified framework for multi-objective toolpath generation, maintaining boundary conformity while eliminating zero-gradient singularities that cause iso-curve branching or termination and disrupt toolpath continuity remains challenging on multiply connected surfaces. We propose an efficient strategy to robustly enforce these constraints throughout optimization. Conformal slit mapping is employed to construct a feasible, singularity-free initial scalar field. The optimization is reformulated as a topology-preserving mesh deformation governed by boundary-synchronous updates, enabling globally optimized spacing, scallop-height uniformity, and smooth trajectory transitions. Consequently, the toolpaths are continuous, boundary-conforming, and free of self-intersections. Milling experiments demonstrate that, compared with a state-of-the-art conformal slit mapping-based method, the proposed approach increases machining efficiency by 14.24%, improves scallop-height uniformity by 5.70%, and reduces milling impact-induced vibrations by over 10%. The strategy offers broad applicability in high-performance machining scenarios.
comment: 24Pages,12Figures
☆ Asymmetric Friction in Geometric Locomotion
Geometric mechanics models of locomotion have provided insight into how robots and animals use environmental interactions to convert internal shape changes into displacement through the world, encoding this relationship in a ``motility map''. A key class of such motility maps arises from (possibly anisotropic) linear drag acting on the system's individual body parts, formally described via Riemannian metrics on the motions of the system's individual body parts. The motility map can then be generated by invoking a sub-Riemannian constraint on the aggregate system motion under which the position velocity induced by a given shape velocity is that which minimizes the power dissipated via friction. The locomotion of such systems is ``geometric'' in the sense that the final position reached by the system depends only on the sequence of shapes that the system passes through, but not on the rate with which the shape changes are made. In this paper, we consider a far more general class of systems in which the drag may be not only anisotropic (with different coefficients for forward/backward and left/right motions), but also asymmetric (with different coefficients for forward and backward motions). Formally, including asymmetry in the friction replaces the Riemannian metrics on the body parts with Finsler metrics. We demonstrate that the sub-Riemannian approach to constructing the system motility map extends naturally to a sub-Finslerian approach and identify system properties analogous to the constraint curvature of sub-Riemannian systems that allow for the characterization of the system motion capabilities.
comment: 23 pages, 15 figures
☆ Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing
Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.
☆ Bugs with Features: Vision-Based Fault-Tolerant Collective Motion Inspired by Nature
In collective motion, perceptually-limited individuals move in an ordered manner, without centralized control. The perception of each individual is highly localized, as is its ability to interact with others. While natural collective motion is robust, most artificial swarms are brittle. This particularly occurs when vision is used as the sensing modality, due to ambiguities and information-loss inherent in visual perception. This paper presents mechanisms for robust collective motion inspired by studies of locusts. First, we develop a robust distance estimation method that combines visually perceived horizontal and vertical sizes of neighbors. Second, we introduce intermittent locomotion as a mechanism that allows robots to reliably detect peers that fail to keep up, and disrupt the motion of the swarm. We show how such faulty robots can be avoided in a manner that is robust to errors in classifying them as faulty. Through extensive physics-based simulation experiments, we show dramatic improvements to swarm resilience when using these techniques. We show these are relevant to both distance-based Avoid-Attract models, as well as to models relying on Alignment, in a wide range of experiment settings.
☆ Emergence of Human to Robot Transfer in Vision-Language-Action Models
Vision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises because diverse pretraining produces embodiment-agnostic representations for human and robot data. We validate these findings through a series of experiments probing human to robot skill transfer and find that with sufficiently diverse robot pre-training our method can nearly double the performance on generalization settings seen only in human data.
♻ ☆ Forecasting in Offline Reinforcement Learning for Non-stationary Environments NeurIPS 2025
Offline Reinforcement Learning (RL) provides a promising avenue for training policies from pre-collected datasets when gathering additional interaction data is infeasible. However, existing offline RL methods often assume stationarity or only consider synthetic perturbations at test time, assumptions that often fail in real-world scenarios characterized by abrupt, time-varying offsets. These offsets can lead to partial observability, causing agents to misperceive their true state and degrade performance. To overcome this challenge, we introduce Forecasting in Non-stationary Offline RL (FORL), a framework that unifies (i) conditional diffusion-based candidate state generation, trained without presupposing any specific pattern of future non-stationarity, and (ii) zero-shot time-series foundation models. FORL targets environments prone to unexpected, potentially non-Markovian offsets, requiring robust agent performance from the onset of each episode. Empirical evaluations on offline RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines. By integrating zero-shot forecasting with the agent's experience, we aim to bridge the gap between offline RL and the complexities of real-world, non-stationary environments.
comment: The Thirty-Ninth Annual Conference on Neural Information Processing Systems, NeurIPS 2025
♻ ☆ Symphony: A Heuristic Normalized Calibrated Advantage Actor and Critic Algorithm in application for Humanoid Robots
In our work we not explicitly hint that it is a misconception to think that humans learn fast. Learning process takes time. Babies start learning to move in the restricted liquid area called placenta. Children often are limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for dozen millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is no secret that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set a limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are kind of immersed in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update Actor and Critic in one pass, as well as combine Actor and Critic in one Object and implement their Losses in one line.
comment: https://github.com/SuspensionRailway/symphony
♻ ☆ RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization
Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces \textbf{RiemanLine}, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere $\mathcal{S}^2$, and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For $n$ parallel lines, the proposed representation reduces the parameter space from $4n$ (orthonormal form) to $2n+2$, naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.
♻ ☆ On The Computational Complexity of Minimum Aerial Photographs for Planar Region Coverage
With the popularity of drone technologies, aerial photography has become prevalent in many daily scenarios such as environment monitoring, structure inspection, law enforcement etc. A central challenge in this domain is the efficient coverage of a target area with photographs that can entirely capture the region, while respecting constraints such as the image resolution, and limited number of pictures that can be taken. This work investigates the computational complexity of covering a simple planar polygon using squares and circles. Specifically, it shows inapproximability gaps of $1.165$ (for squares) and $1.25$ (for restricted square centers) and develops a $2.828$-optimal approximation algorithm, demonstrating that these problems are computationally intractable to approximate. The intuitions of this work can extend beyond aerial photography to broader applications such as pesticide spraying and strategic sensor placement.
♻ ☆ Driving Beyond Privilege: Distilling Dense-Reward Knowledge into Sparse-Reward Policies
We study how to exploit dense simulator-defined rewards in vision-based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time-to-collision) can be converted into dense rewards that stabilize and accelerate model-based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision-free overtaking. We propose reward-privileged world model distillation, a two-stage framework in which a teacher DreamerV3-style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's-eye-view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane-following and overtaking benchmarks, sparse-reward students outperform both dense-reward teachers and sparse-from-scratch baselines. On unseen lane-following routes, reward-privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near-perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment-aligned objectives.
♻ ☆ InDRiVE: Reward-Free World-Model Pretraining for Autonomous Driving via Latent Disagreement
Model-based reinforcement learning (MBRL) can reduce interaction cost for autonomous driving by learning a predictive world model, but it typically still depends on task-specific rewards that are difficult to design and often brittle under distribution shift. This paper presents InDRiVE, a DreamerV3-style MBRL agent that performs reward-free pretraining in CARLA using only intrinsic motivation derived from latent ensemble disagreement. Disagreement acts as a proxy for epistemic uncertainty and drives the agent toward under-explored driving situations, while an imagination-based actor-critic learns a planner-free exploration policy directly from the learned world model. After intrinsic pretraining, we evaluate zero-shot transfer by freezing all parameters and deploying the pretrained exploration policy in unseen towns and routes. We then study few-shot adaptation by training a task policy with limited extrinsic feedback for downstream objectives (lane following and collision avoidance). Experiments in CARLA across towns, routes, and traffic densities show that disagreement-based pretraining yields stronger zero-shot robustness and robust few-shot collision avoidance under town shift and matched interaction budgets, supporting the use of intrinsic disagreement as a practical reward-free pretraining signal for reusable driving world models.
♻ ☆ RefAV: Towards Planning-Centric Scenario Mining
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recently held competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
comment: Project Page: https://cainand.github.io/RefAV/
♻ ☆ EMMA: Scaling Mobile Manipulation via Egocentric Human Data
Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train human full-body motion data with static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments. Details of this project can be found at https://ego-moma.github.io/.
Robotics 17
☆ Bab_Sak Robotic Intubation System (BRIS): A Learning-Enabled Control Framework for Safe Fiberoptic Endotracheal Intubation
Endotracheal intubation is a critical yet technically demanding procedure, with failure or improper tube placement leading to severe complications. Existing robotic and teleoperated intubation systems primarily focus on airway navigation and do not provide integrated control of endotracheal tube advancement or objective verification of tube depth relative to the carina. This paper presents the Robotic Intubation System (BRIS), a compact, human-in-the-loop platform designed to assist fiberoptic-guided intubation while enabling real-time, objective depth awareness. BRIS integrates a four-way steerable fiberoptic bronchoscope, an independent endotracheal tube advancement mechanism, and a camera-augmented mouthpiece compatible with standard clinical workflows. A learning-enabled closed-loop control framework leverages real-time shape sensing to map joystick inputs to distal bronchoscope tip motion in Cartesian space, providing stable and intuitive teleoperation under tendon nonlinearities and airway contact. Monocular endoscopic depth estimation is used to classify airway regions and provide interpretable, anatomy-aware guidance for safe tube positioning relative to the carina. The system is validated on high-fidelity airway mannequins under standard and difficult airway configurations, demonstrating reliable navigation and controlled tube placement. These results highlight BRIS as a step toward safer, more consistent, and clinically compatible robotic airway management.
☆ StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision
Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks under the stereo setting and demonstrates strong robustness to camera pose variations.
☆ Flexible Multitask Learning with Factorized Diffusion Policy
Multitask learning poses significant challenges due to the highly multimodal and diverse nature of robot action distributions. However, effectively fitting policies to these complex task distributions is often difficult, and existing monolithic models often underfit the action distribution and lack the flexibility required for efficient adaptation. We introduce a novel modular diffusion policy framework that factorizes complex action distributions into a composition of specialized diffusion models, each capturing a distinct sub-mode of the behavior space for a more effective overall policy. In addition, this modular structure enables flexible policy adaptation to new tasks by adding or fine-tuning components, which inherently mitigates catastrophic forgetting. Empirically, across both simulation and real-world robotic manipulation settings, we illustrate how our method consistently outperforms strong modular and monolithic baselines.
☆ Aerial World Model for Long-horizon Visual Generation and Navigation in 3D Space
Unmanned aerial vehicles (UAVs) have emerged as powerful embodied agents. One of the core abilities is autonomous navigation in large-scale three-dimensional environments. Existing navigation policies, however, are typically optimized for low-level objectives such as obstacle avoidance and trajectory smoothness, lacking the ability to incorporate high-level semantics into planning. To bridge this gap, we propose ANWM, an aerial navigation world model that predicts future visual observations conditioned on past frames and actions, thereby enabling agents to rank candidate trajectories by their semantic plausibility and navigational utility. ANWM is trained on 4-DoF UAV trajectories and introduces a physics-inspired module: Future Frame Projection (FFP), which projects past frames into future viewpoints to provide coarse geometric priors. This module mitigates representational uncertainty in long-distance visual generation and captures the mapping between 3D trajectories and egocentric observations. Empirical results demonstrate that ANWM significantly outperforms existing world models in long-distance visual forecasting and improves UAV navigation success rates in large-scale environments.
☆ Online Inertia Parameter Estimation for Unknown Objects Grasped by a Manipulator Towards Space Applications
Knowing the inertia parameters of a grasped object is crucial for dynamics-aware manipulation, especially in space robotics with free-floating bases. This work addresses the problem of estimating the inertia parameters of an unknown target object during manipulation. We apply and extend an existing online identification method by incorporating momentum conservation, enabling its use for the floating-base robots. The proposed method is validated through numerical simulations, and the estimated parameters are compared with ground-truth values. Results demonstrate accurate identification in the scenarios, highlighting the method's applicability to on-orbit servicing and other space missions.
comment: Author's version of a manuscript accepted at the International Conference on Space Robotics 2025 (iSpaRo 2025). (c) IEEE
☆ Optimal Trajectory Planning for Orbital Robot Rendezvous and Docking
Approaching a tumbling target safely is a critical challenge in space debris removal missions utilizing robotic manipulators onboard servicing satellites. In this work, we propose a trajectory planning method based on nonlinear optimization for a close-range rendezvous to bring a free-floating, rotating debris object in a two-dimensional plane into the manipulator's workspace, as a preliminary step for its capture. The proposed method introduces a dynamic keep-out sphere that adapts depending on the approach conditions, allowing for closer and safer access to the target. Furthermore, a control strategy is developed to reproduce the optimized trajectory using discrete ON/OFF thrusters, considering practical implementation constraints.
comment: Author's version of a manuscript accepted at the International Conference on Space Robotics 2025 (iSpaRo 2025). (c) IEEE
☆ MoonBot: Modular and On-Demand Reconfigurable Robot Toward Moon Base Construction
The allure of lunar surface exploration and development has recently captured widespread global attention. Robots have proved to be indispensable for exploring uncharted terrains, uncovering and leveraging local resources, and facilitating the construction of future human habitats. In this article, we introduce the modular and on-demand reconfigurable robot (MoonBot), a modular and reconfigurable robotic system engineered to maximize functionality while operating within the stringent mass constraints of lunar payloads and adapting to varying environmental conditions and task requirements. This article details the design and development of MoonBot and presents a preliminary field demonstration that validates the proof of concept through the execution of milestone tasks simulating the establishment of lunar infrastructure. These tasks include essential civil engineering operations, infrastructural component transportation and deployment, and assistive operations with inflatable modules. Furthermore, we systematically summarize the lessons learned during testing, focusing on the connector design and providing valuable insights for the advancement of modular robotic systems in future lunar missions.
comment: This is the authors' version of a paper accepted for publication in IEEE Transactions on Field Robotics, (c) IEEE. The final published version is available at https://doi.org/10.1109/TFR.2025.3624346
☆ A Unified AI, Embedded, Simulation, and Mechanical Design Approach to an Autonomous Delivery Robot
This paper presents the development of a fully autonomous delivery robot integrating mechanical engineering, embedded systems, and artificial intelligence. The platform employs a heterogeneous computing architecture, with RPi 5 and ROS 2 handling AI-based perception and path planning, while ESP32 running FreeRTOS ensures real-time motor control. The mechanical design was optimized for payload capacity and mobility through precise motor selection and material engineering. Key technical challenges addressed include optimizing computationally intensive AI algorithms on a resource-constrained platform and implementing a low-latency, reliable communication link between the ROS 2 host and embedded controller. Results demonstrate deterministic, PID-based motor control through rigorous memory and task management, and enhanced system reliability via AWS IoT monitoring and a firmware-level motor shutdown failsafe. This work highlights a unified, multi-disciplinary methodology, resulting in a robust and operational autonomous delivery system capable of real-world deployment.
☆ VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs
In most existing embodied navigation tasks, instructions are well-defined and unambiguous, such as instruction following and object searching. Under this idealized setting, agents are required solely to produce effective navigation outputs conditioned on vision and language inputs. However, real-world navigation instructions are often vague and ambiguous, requiring the agent to resolve uncertainty and infer user intent through active dialog. To address this gap, we propose Interactive Instance Object Navigation (IION), a task that requires agents not only to generate navigation actions but also to produce language outputs via active dialog, thereby aligning more closely with practical settings. IION extends Instance Object Navigation (ION) by allowing agents to freely consult an oracle in natural language while navigating. Building on this task, we present the Vision Language-Language Navigation (VL-LN) benchmark, which provides a large-scale, automatically generated dataset and a comprehensive evaluation protocol for training and assessing dialog-enabled navigation models. VL-LN comprises over 41k long-horizon dialog-augmented trajectories for training and an automatic evaluation protocol with an oracle capable of responding to agent queries. Using this benchmark, we train a navigation model equipped with dialog capabilities and show that it achieves significant improvements over the baselines. Extensive experiments and analyses further demonstrate the effectiveness and reliability of VL-LN for advancing research on dialog-enabled embodied navigation. Code and dataset: https://0309hws.github.io/VL-LN.github.io/
♻ ☆ RoboCade: Gamifying Robot Data Collection
Imitation learning from human demonstrations has become a dominant approach for training autonomous robot policies. However, collecting demonstration datasets is costly: it often requires access to robots and needs sustained effort in a tedious, long process. These factors limit the scale of data available for training policies. We aim to address this scalability challenge by involving a broader audience in a gamified data collection experience that is both accessible and motivating. Specifically, we develop a gamified remote teleoperation platform, RoboCade, to engage general users in collecting data that is beneficial for downstream policy training. To do this, we embed gamification strategies into the design of the system interface and data collection tasks. In the system interface, we include components such as visual feedback, sound effects, goal visualizations, progress bars, leaderboards, and badges. We additionally propose principles for constructing gamified tasks that have overlapping structure with useful downstream target tasks. We instantiate RoboCade on three manipulation tasks -- including spatial arrangement, scanning, and insertion. To illustrate the viability of gamified robot data collection, we collect a demonstration dataset through our platform, and show that co-training robot policies with this data can improve success rate on non-gamified target tasks (+16-56%). Further, we conduct a user study to validate that novice users find the gamified platform significantly more enjoyable than a standard non-gamified platform (+24%). These results highlight the promise of gamified data collection as a scalable, accessible, and engaging method for collecting demonstration data.
comment: 10 pages, 9 figures
♻ ☆ LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
♻ ☆ Transformer Driven Visual Servoing for Fabric Texture Matching Using Dual-Arm Manipulator
In this paper, we propose a method to align and place a fabric piece on top of another using a dual-arm manipulator and a grayscale camera, so that their surface textures are accurately matched. We propose a novel control scheme that combines Transformer-driven visual servoing with dualarm impedance control. This approach enables the system to simultaneously control the pose of the fabric piece and place it onto the underlying one while applying tension to keep the fabric piece flat. Our transformer-based network incorporates pretrained backbones and a newly introduced Difference Extraction Attention Module (DEAM), which significantly enhances pose difference prediction accuracy. Trained entirely on synthetic images generated using rendering software, the network enables zero-shot deployment in real-world scenarios without requiring prior training on specific fabric textures. Real-world experiments demonstrate that the proposed system accurately aligns fabric pieces with different textures.
comment: 8 pages, 11 figures. Accepted to IEEE Robotics and Automation Letters (RA-L)
♻ ☆ Super-LIO: A Robust and Efficient LiDAR-Inertial Odometry System with a Compact Mapping Strategy
LiDAR-Inertial Odometry (LIO) is a foundational technique for autonomous systems, yet its deployment on resource-constrained platforms remains challenging due to computational and memory limitations. We propose Super-LIO, a robust LIO system that demands both high performance and accuracy, ideal for applications such as aerial robots and mobile autonomous systems. At the core of Super-LIO is a compact octo-voxel-based map structure, termed OctVox, that limits each voxel to eight fused subvoxels, enabling strict point density control and incremental denoising during map updates. This design enables a simple yet efficient and accurate map structure, which can be easily integrated into existing LIO frameworks. Additionally, Super-LIO designs a heuristic-guided KNN strategy (HKNN) that accelerates the correspondence search by leveraging spatial locality, further reducing runtime overhead. We evaluated the proposed system using four publicly available datasets and several self-collected datasets, totaling more than 30 sequences. Extensive testing on both X86 and ARM platforms confirms that Super-LIO offers superior efficiency and robustness, while maintaining competitive accuracy. Super-LIO processes each frame approximately 73% faster than SOTA, while consuming less CPU resources. The system is fully open-source and plug-and-play compatible with a wide range of LiDAR sensors and platforms. The implementation is available at: https://github.com/Liansheng-Wang/Super-LIO.git
comment: 8 pages, 5 figures
♻ ☆ Learning collision risk proactively from naturalistic driving data at scale
Accurately and proactively alerting drivers or automated systems to emerging collisions is crucial for road safety, particularly in highly interactive and complex urban environments. Existing methods either require labour-intensive annotation of sparse risk, struggle to consider varying contextual factors, or are tailored to limited scenarios. Here we present the Generalised Surrogate Safety Measure (GSSM), a data-driven approach that learns collision risk from naturalistic driving without the need for crash or risk labels. Trained over multiple datasets and evaluated on 2,591 real-world crashes and near-crashes, a basic GSSM using only instantaneous motion kinematics achieves an area under the precision-recall curve of 0.9, and secures a median time advance of 2.6 seconds to prevent potential collisions. Incorporating additional interaction patterns and contextual factors provides further performance gains. Across interaction scenarios such as rear-end, merging, and turning, GSSM consistently outperforms existing baselines in accuracy and timeliness. These results establish GSSM as a scalable, context-aware, and generalisable foundation to identify risky interactions before they become unavoidable, supporting proactive safety in autonomous driving systems and traffic incident management. Code and experiment data are openly accessible at https://github.com/Yiru-Jiao/GSSM.
comment: Equation (15) in the previous versions was wrong, which has been corrected since v4
♻ ☆ Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models
The integration of Large Language Models (LLMs) into robotics has unlocked unprecedented capabilities in high-level task planning. However, most current systems operate in an open-loop fashion, where LLMs act as one-shot planners, rendering them brittle and unable to adapt to unforeseen circumstances in dynamic physical environments. To overcome this limitation, this paper introduces the "Think, Act, Learn" (T-A-L) framework, a novel architecture that enables an embodied agent to autonomously learn and refine its policies through continuous interaction. Our framework establishes a closed-loop cycle where an LLM first "thinks" by decomposing high-level commands into actionable plans. The robot then "acts" by executing these plans while gathering rich, multimodal sensory feedback. Critically, the "learn" module processes this feedback to facilitate LLM-driven self-reflection, allowing the agent to perform causal analysis on its failures and generate corrective strategies. These insights are stored in an experiential memory to guide future planning cycles. We demonstrate through extensive experiments in both simulation and the real world that our T-A-L agent significantly outperforms baseline methods, including open-loop LLMs, Behavioral Cloning, and traditional Reinforcement Learning. Our framework achieves over a 97% success rate on complex, long-horizon tasks, converges to a stable policy in an average of just 9 trials, and exhibits remarkable generalization to unseen tasks. This work presents a significant step towards developing more robust, adaptive, and truly autonomous robotic agents.
comment: 13 pages, 7 figures
♻ ☆ RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
comment: 11 pages, 6 figures
♻ ☆ Never-Ending Behavior-Cloning Agent for Robotic Manipulation
Relying on multi-modal observations, embodied robots (e.g., humanoid robots) could perform multiple robotic manipulation tasks in unstructured real-world environments. However, most language-conditioned behavior-cloning agents in robots still face existing long-standing challenges, i.e., 3D scene representation and human-level task learning, when adapting into a series of new tasks in practical scenarios. We here investigate these above challenges with NBAgent in embodied robots, a pioneering language-conditioned Never-ending Behavior-cloning Agent, which can continually learn observation knowledge of novel 3D scene semantics and robot manipulation skills from skill-shared and skill-specific attributes, respectively. Specifically, we propose a skill-shared semantic rendering module and a skill-shared representation distillation module to effectively learn 3D scene semantics from skill-shared attribute, further tackling 3D scene representation overlooking. Meanwhile, we establish a skill-specific evolving planner to perform manipulation knowledge decoupling, which can continually embed novel skill-specific knowledge like human from latent and low-rank space. Finally, we design a never-ending embodied robot manipulation benchmark, and expensive experiments demonstrate the significant performance of our method.
comment: 17 pages, 6 figures, 9 tables