MyArxiv
Robotics 1
♻ ☆ Estimating Spatially-Dependent GPS Errors Using a Swarm of Robots
External factors, including urban canyons and adversarial interference, can lead to Global Positioning System (GPS) inaccuracies that vary as a function of the position in the environment. This study addresses the challenge of estimating a static, spatially-varying error function using a team of robots. We introduce a State Bias Estimation Algorithm (SBE) whose purpose is to estimate the GPS biases. The central idea is to use sensed estimates of the range and bearing to the other robots in the team to estimate changes in bias across the environment. A set of drones moves in a 2D environment, each sampling data from GPS, range, and bearing sensors. The biases calculated by the SBE at estimated positions are used to train a Gaussian Process Regression (GPR) model. We use a Sparse Gaussian process-based Informative Path Planning (IPP) algorithm that identifies high-value regions of the environment for data collection. The swarm plans paths that maximize information gain in each iteration, further refining their understanding of the environment's positional bias landscape. We evaluated SBE and IPP in simulation and compared the IPP methodology to an open-loop strategy.
comment: 6 pages, 7 figures, 2025 IEEE 21st International Conference on Automation Science and Engineering
Robotics 41
Whole-Body Conditioned Egocentric Video Prediction
We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.
comment: Project Page: https://dannytran123.github.io/PEVA
☆ SAM4D: Segment Anything in Camera and LiDAR Streams ICCV2025
We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
comment: Accepted by ICCV2025, Project Page: https://SAM4D-Project.github.io
☆ WorldVLA: Towards Autoregressive Action World Model
We present WorldVLA, an autoregressive action world model that unifies action and image understanding and generation. Our WorldVLA intergrates Vision-Language-Action (VLA) model and world model in one single framework. The world model predicts future images by leveraging both action and image understanding, with the purpose of learning the underlying physics of the environment to improve action generation. Meanwhile, the action model generates the subsequent actions based on image observations, aiding in visual understanding and in turn helps visual generation of the world model. We demonstrate that WorldVLA outperforms standalone action and world models, highlighting the mutual enhancement between the world model and the action model. In addition, we find that the performance of the action model deteriorates when generating sequences of actions in an autoregressive manner. This phenomenon can be attributed to the model's limited generalization capability for action prediction, leading to the propagation of errors from earlier actions to subsequent ones. To address this issue, we propose an attention mask strategy that selectively masks prior actions during the generation of the current action, which shows significant performance improvement in the action chunk generation task.
comment: Code: https://github.com/alibaba-damo-academy/WorldVLA
☆ Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning
Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.
☆ EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting
Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and dynamic surgical scenes. The source code will be publicly available upon paper acceptance.
☆ ToosiCubix: Monocular 3D Cuboid Labeling via Vehicle Part Annotations
Many existing methods for 3D cuboid annotation of vehicles rely on expensive and carefully calibrated camera-LiDAR or stereo setups, limiting their accessibility for large-scale data collection. We introduce ToosiCubix, a simple yet powerful approach for annotating ground-truth cuboids using only monocular images and intrinsic camera parameters. Our method requires only about 10 user clicks per vehicle, making it highly practical for adding 3D annotations to existing datasets originally collected without specialized equipment. By annotating specific features (e.g., wheels, car badge, symmetries) across different vehicle parts, we accurately estimate each vehicle's position, orientation, and dimensions up to a scale ambiguity (8 DoF). The geometric constraints are formulated as an optimization problem, which we solve using a coordinate descent strategy, alternating between Perspective-n-Points (PnP) and least-squares subproblems. To handle common ambiguities such as scale and unobserved dimensions, we incorporate probabilistic size priors, enabling 9 DoF cuboid placements. We validate our annotations against the KITTI and Cityscapes3D datasets, demonstrating that our method offers a cost-effective and scalable solution for high-quality 3D cuboid annotation.
☆ Real-time Terrain Analysis for Off-road Autonomous Vehicles
This research addresses critical autonomous vehicle control challenges arising from road roughness variation, which induces course deviations and potential loss of road contact during steering operations. We present a novel real-time road roughness estimation system employing Bayesian calibration methodology that processes axle accelerations to predict terrain roughness with quantifiable confidence measures. The technical framework integrates a Gaussian process surrogate model with a simulated half-vehicle model, systematically processing vehicle velocity and road surface roughness parameters to generate corresponding axle acceleration responses. The Bayesian calibration routine performs inverse estimation of road roughness from observed accelerations and velocities, yielding posterior distributions that quantify prediction uncertainty for adaptive risk management. Training data generation utilizes Latin Hypercube sampling across comprehensive velocity and roughness parameter spaces, while the calibrated model integrates seamlessly with a Simplex controller architecture to dynamically adjust velocity limits based on real-time roughness predictions. Experimental validation on stochastically generated surfaces featuring varying roughness regions demonstrates robust real-time characterization capabilities, with the integrated Simplex control strategy effectively enhancing autonomous vehicle operational safety through proactive surface condition response. This innovative Bayesian framework establishes a comprehensive foundation for mitigating roughness-related operational risks while simultaneously improving efficiency and safety margins in autonomous vehicle systems.
☆ "Who Should I Believe?": User Interpretation and Decision-Making When a Family Healthcare Robot Contradicts Human Memory
Advancements in robotic capabilities for providing physical assistance, psychological support, and daily health management are making the deployment of intelligent healthcare robots in home environments increasingly feasible in the near future. However, challenges arise when the information provided by these robots contradicts users' memory, raising concerns about user trust and decision-making. This paper presents a study that examines how varying a robot's level of transparency and sociability influences user interpretation, decision-making and perceived trust when faced with conflicting information from a robot. In a 2 x 2 between-subjects online study, 176 participants watched videos of a Furhat robot acting as a family healthcare assistant and suggesting a fictional user to take medication at a different time from that remembered by the user. Results indicate that robot transparency influenced users' interpretation of information discrepancies: with a low transparency robot, the most frequent assumption was that the user had not correctly remembered the time, while with the high transparency robot, participants were more likely to attribute the discrepancy to external factors, such as a partner or another household member modifying the robot's information. Additionally, participants exhibited a tendency toward overtrust, often prioritizing the robot's recommendations over the user's memory, even when suspecting system malfunctions or third-party interference. These findings highlight the impact of transparency mechanisms in robotic systems, the complexity and importance associated with system access control for multi-user robots deployed in home environments, and the potential risks of users' over reliance on robots in sensitive domains such as healthcare.
comment: 8 pages
☆ Active Disturbance Rejection Control for Trajectory Tracking of a Seagoing USV: Design, Simulation, and Field Experiments IROS 2025
Unmanned Surface Vessels (USVs) face significant control challenges due to uncertain environmental disturbances like waves and currents. This paper proposes a trajectory tracking controller based on Active Disturbance Rejection Control (ADRC) implemented on the DUS V2500. A custom simulation incorporating realistic waves and current disturbances is developed to validate the controller's performance, supported by further validation through field tests in the harbour of Scheveningen, the Netherlands, and at sea. Simulation results demonstrate that ADRC significantly reduces cross-track error across all tested conditions compared to a baseline PID controller but increases control effort and energy consumption. Field trials confirm these findings while revealing a further increase in energy consumption during sea trials compared to the baseline.
comment: Accepted for presentation at IROS 2025. Submitted version
☆ ACTLLM: Action Consistency Tuned Large Language Model
This paper introduces ACTLLM (Action Consistency Tuned Large Language Model), a novel approach for robot manipulation in dynamic environments. Traditional vision-based systems often struggle to learn visual representations that excel in both task execution and spatial reasoning, thereby limiting their adaptability in dynamic environments. ACTLLM addresses these challenges by harnessing language to craft structured scene descriptors, providing a uniform interface for both spatial understanding and task performance through flexible language instructions. Moreover, we introduce a novel action consistency constraint that aligns visual perception with corresponding actions, thereby enhancing the learning of actionable visual representations. Additionally, we have reformulated the Markov decision process for manipulation tasks into a multi-turn visual dialogue framework. This approach enables the modeling of long-term task execution with enhanced contextual relevance derived from the history of task execution. During our evaluation, ACTLLM excels in diverse scenarios, proving its effectiveness on challenging vision-based robot manipulation tasks.
☆ Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping
This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM's uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm's polar workspace, preserving wrist orientation.
☆ World-aware Planning Narratives Enhance Large Vision-Language Model Planner
Large Vision-Language Models (LVLMs) show promise for embodied planning tasks but struggle with complex scenarios involving unfamiliar environments and multi-step goals. Current approaches rely on environment-agnostic imitation learning that disconnects instructions from environmental contexts, causing models to struggle with context-sensitive instructions and rely on supplementary cues rather than visual reasoning during long-horizon interactions. In this work, we propose World-Aware Planning Narrative Enhancement (WAP), a framework that infuses LVLMs with comprehensive environmental understanding through four cognitive capabilities (visual appearance modeling, spatial reasoning, functional abstraction, and syntactic grounding) while developing and evaluating models using only raw visual observations through curriculum learning. Evaluations on the EB-ALFRED benchmark demonstrate substantial improvements, with Qwen2.5-VL achieving a 60.7 absolute improvement in task success rates, particularly in commonsense reasoning (+60.0) and long-horizon planning (+70.0). Notably, our enhanced open-source models outperform proprietary systems like GPT-4o and Claude-3.5-Sonnet by a large margin.
☆ Dynamic Risk-Aware MPPI for Mobile Robots in Crowds via Efficient Monte Carlo Approximations IROS 2025
Deploying mobile robots safely among humans requires the motion planner to account for the uncertainty in the other agents' predicted trajectories. This remains challenging in traditional approaches, especially with arbitrarily shaped predictions and real-time constraints. To address these challenges, we propose a Dynamic Risk-Aware Model Predictive Path Integral control (DRA-MPPI), a motion planner that incorporates uncertain future motions modelled with potentially non-Gaussian stochastic predictions. By leveraging MPPI's gradient-free nature, we propose a method that efficiently approximates the joint Collision Probability (CP) among multiple dynamic obstacles for several hundred sampled trajectories in real-time via a Monte Carlo (MC) approach. This enables the rejection of samples exceeding a predefined CP threshold or the integration of CP as a weighted objective within the navigation cost function. Consequently, DRA-MPPI mitigates the freezing robot problem while enhancing safety. Real-world and simulated experiments with multiple dynamic obstacles demonstrate DRA-MPPI's superior performance compared to state-of-the-art approaches, including Scenario-based Model Predictive Control (S-MPC), Frenet planner, and vanilla MPPI.
comment: Accepted for presentation at IROS 2025. Submitted Version
☆ Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation ICCV 2025
Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK.
comment: Accepted to ICCV 2025. All data and code will be made publicly available at https://github.com/yihong-97/UNLOCK
☆ Out-of-Distribution Semantic Occupancy Prediction
3D Semantic Occupancy Prediction is crucial for autonomous driving, providing a dense, semantically rich environmental representation. However, existing methods focus on in-distribution scenes, making them susceptible to Out-of-Distribution (OoD) objects and long-tail distributions, which increases the risk of undetected anomalies and misinterpretations, posing safety hazards. To address these challenges, we introduce Out-of-Distribution Semantic Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that injects synthetic anomalies while preserving realistic spatial and occlusion patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360. We introduce OccOoD, a novel framework integrating OoD detection into 3D semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF) leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m region, while maintaining competitive occupancy prediction performance. The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD.
comment: The established datasets and source code will be made publicly available at https://github.com/7uHeng/OccOoD
☆ UAIbot: Beginner-friendly web-based simulator for interactive robotics learning and research
This paper presents UAIbot, a free and open-source web-based robotics simulator designed to address the educational and research challenges conventional simulation platforms generally face. The Python and JavaScript interfaces of UAIbot enable accessible hands-on learning experiences without cumbersome installations. By allowing users to explore fundamental mathematical and physical principles interactively, ranging from manipulator kinematics to pedestrian flow dynamics, UAIbot provides an effective tool for deepening student understanding, facilitating rapid experimentation, and enhancing research dissemination.
comment: 12 pages, 8 figures, submitted to Springer proceedings
☆ GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction ICML 2025
Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.
comment: Accepted by ICML 2025
☆ CURL-SLAM: Continuous and Compact LiDAR Mapping
This paper studies 3D LiDAR mapping with a focus on developing an updatable and localizable map representation that enables continuity, compactness and consistency in 3D maps. Traditional LiDAR Simultaneous Localization and Mapping (SLAM) systems often rely on 3D point cloud maps, which typically require extensive storage to preserve structural details in large-scale environments. In this paper, we propose a novel paradigm for LiDAR SLAM by leveraging the Continuous and Ultra-compact Representation of LiDAR (CURL) introduced in [1]. Our proposed LiDAR mapping approach, CURL-SLAM, produces compact 3D maps capable of continuous reconstruction at variable densities using CURL's spherical harmonics implicit encoding, and achieves global map consistency after loop closure. Unlike popular Iterative Closest Point (ICP)-based LiDAR odometry techniques, CURL-SLAM formulates LiDAR pose estimation as a unique optimization problem tailored for CURL and extends it to local Bundle Adjustment (BA), enabling simultaneous pose refinement and map correction. Experimental results demonstrate that CURL-SLAM achieves state-of-the-art 3D mapping quality and competitive LiDAR trajectory accuracy, delivering sensor-rate real-time performance (10 Hz) on a CPU. We will release the CURL-SLAM implementation to the community.
☆ Control of Marine Robots in the Era of Data-Driven Intelligence
The control of marine robots has long relied on model-based methods grounded in classical and modern control theory. However, the nonlinearity and uncertainties inherent in robot dynamics, coupled with the complexity of marine environments, have revealed the limitations of conventional control methods. The rapid evolution of machine learning has opened new avenues for incorporating data-driven intelligence into control strategies, prompting a paradigm shift in the control of marine robots. This paper provides a review of recent progress in marine robot control through the lens of this emerging paradigm. The review covers both individual and cooperative marine robotic systems, highlighting notable achievements in data-driven control of marine robots and summarizing open-source resources that support the development and validation of advanced control methods. Finally, several future perspectives are outlined to guide research toward achieving high-level autonomy for marine robots in real-world applications. This paper aims to serve as a roadmap toward the next-generation control framework of marine robots in the era of data-driven intelligence.
☆ Knowledge-Driven Imitation Learning: Enabling Generalization Across Diverse Conditions IROS 2025
Imitation learning has emerged as a powerful paradigm in robot manipulation, yet its generalization capability remains constrained by object-specific dependencies in limited expert demonstrations. To address this challenge, we propose knowledge-driven imitation learning, a framework that leverages external structural semantic knowledge to abstract object representations within the same category. We introduce a novel semantic keypoint graph as a knowledge template and develop a coarse-to-fine template-matching algorithm that optimizes both structural consistency and semantic similarity. Evaluated on three real-world robotic manipulation tasks, our method achieves superior performance, surpassing image-based diffusion policies with only one-quarter of the expert demonstrations. Extensive experiments further demonstrate its robustness across novel objects, backgrounds, and lighting conditions. This work pioneers a knowledge-driven approach to data-efficient robotic learning in real-world settings. Code and more materials are available on https://knowledge-driven.github.io/.
comment: IROS 2025
☆ V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling
Ensuring robust planning and decision-making under rare, diverse, and visually degraded long-tail scenarios remains a fundamental challenge for autonomous driving in urban environments. This issue becomes more critical in cooperative settings, where vehicles and infrastructure jointly perceive and reason across complex environments. To address this challenge, we propose V2X-REALM, a vision-language model (VLM)-based framework with adaptive multimodal learning for robust cooperative autonomous driving under long-tail scenarios. V2X-REALM introduces three core innovations: (i) a prompt-driven long-tail scenario generation and evaluation pipeline that leverages foundation models to synthesize realistic long-tail conditions such as snow and fog across vehicle- and infrastructure-side views, enriching training diversity efficiently; (ii) a gated multi-scenario adaptive attention module that modulates the visual stream using scenario priors to recalibrate ambiguous or corrupted features; and (iii) a multi-task scenario-aware contrastive learning objective that improves multimodal alignment and promotes cross-scenario feature separability. Extensive experiments demonstrate that V2X-REALM significantly outperforms existing baselines in robustness, semantic reasoning, safety, and planning accuracy under complex, challenging driving conditions, advancing the scalability of end-to-end cooperative autonomous driving.
☆ STEP Planner: Constructing cross-hierarchical subgoal tree as an embodied long-horizon task planner
The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.
☆ Fault-Tolerant Spacecraft Attitude Determination using State Estimation Techniques
The extended and unscented Kalman filter, and the particle filter provide a robust framework for fault-tolerant attitude estimation on spacecraft. This paper explores how each filter performs for a large satellite in a low earth orbit. Additionally, various techniques, built on these filters, for fault detection, isolation and recovery from erroneous sensor measurements, are analyzed. Key results from this analysis include filter performance for various fault modes.
comment: 8 pages, 19 figures
☆ Our Coding Adventure: Using LLMs to Personalise the Narrative of a Tangible Programming Robot for Preschoolers
Finding balanced ways to employ Large Language Models (LLMs) in education is a challenge due to inherent risks of poor understanding of the technology and of a susceptible audience. This is particularly so with younger children, who are known to have difficulties with pervasive screen time. Working with a tangible programming robot called Cubetto, we propose an approach to benefit from the capabilities of LLMs by employing such models in the preparation of personalised storytelling, necessary for preschool children to get accustomed to the practice of commanding the robot. We engage in action research to develop an early version of a formalised process to rapidly prototype game stories for Cubetto. Our approach has both reproducible results, because it employs open weight models, and is model-agnostic, because we test it with 5 different LLMs. We document on one hand the process, the used materials and prompts, and on the other the learning experience and outcomes. We deem the generation successful for the intended purposes of using the results as a teacher aid. Testing the models on 4 different task scenarios, we encounter issues of consistency and hallucinations and document the corresponding evaluation process and attempts (some successful and some not) to overcome these issues. Importantly, the process does not expose children to LLMs directly. Rather, the technology is used to help teachers easily develop personalised narratives on children's preferred topics. We believe our method is adequate for preschool classes and we are planning to further experiment in real-world educational settings.
comment: accepted at D-SAIL Workshop - Transformative Curriculum Design: Digitalization, Sustainability, and AI Literacy for 21st Century Learning
☆ ThermalDiffusion: Visual-to-Thermal Image-to-Image Translation for Autonomous Navigation ICRA 2025
Autonomous systems rely on sensors to estimate the environment around them. However, cameras, LiDARs, and RADARs have their own limitations. In nighttime or degraded environments such as fog, mist, or dust, thermal cameras can provide valuable information regarding the presence of objects of interest due to their heat signature. They make it easy to identify humans and vehicles that are usually at higher temperatures compared to their surroundings. In this paper, we focus on the adaptation of thermal cameras for robotics and automation, where the biggest hurdle is the lack of data. Several multi-modal datasets are available for driving robotics research in tasks such as scene segmentation, object detection, and depth estimation, which are the cornerstone of autonomous systems. However, they are found to be lacking in thermal imagery. Our paper proposes a solution to augment these datasets with synthetic thermal data to enable widespread and rapid adaptation of thermal cameras. We explore the use of conditional diffusion models to convert existing RGB images to thermal images using self-attention to learn the thermal properties of real-world objects.
comment: Accepted at Thermal Infrared in Robotics (TIRO) Workshop, ICRA 2025
☆ Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends
Vision-language-action (VLA) models extend vision-language models (VLM) by integrating action generation modules for robotic manipulation. Leveraging strengths of VLM in vision perception and instruction understanding, VLA models exhibit promising generalization across diverse manipulation tasks. However, applications demanding high precision and accuracy reveal performance gaps without further adaptation. Evidence from multiple domains highlights the critical role of post-training to align foundational models with downstream applications, spurring extensive research on post-training VLA models. VLA model post-training aims to address the challenge of improving an embodiment's ability to interact with the environment for the given tasks, analogous to the process of humans motor skills acquisition. Accordingly, this paper reviews post-training strategies for VLA models through the lens of human motor learning, focusing on three dimensions: environments, embodiments, and tasks. A structured taxonomy is introduced aligned with human learning mechanisms: (1) enhancing environmental perception, (2) improving embodiment awareness, (3) deepening task comprehension, and (4) multi-component integration. Finally, key challenges and trends in post-training VLA models are identified, establishing a conceptual framework to guide future research. This work delivers both a comprehensive overview of current VLA model post-training methods from a human motor learning perspective and practical insights for VLA model development. (Project website: https://github.com/AoqunJin/Awesome-VLA-Post-Training)
☆ Cooperative Circumnavigation for Multi-Quadrotor Systems via Onboard Sensing
A cooperative circumnavigation framework is proposed for multi-quadrotor systems to enclose and track a moving target without reliance on external localization systems. The distinct relationships between quadrotor-quadrotor and quadrotor-target interactions are evaluated using a heterogeneous perception strategy and corresponding state estimation algorithms. A modified Kalman filter is developed to fuse visual-inertial odometry with range measurements to enhance the accuracy of inter-quadrotor relative localization. An event-triggered distributed Kalman filter is designed to achieve robust target state estimation under visual occlusion by incorporating neighbor measurements and estimated inter-quadrotor relative positions. Using the estimation results, a cooperative circumnavigation controller is constructed, leveraging an oscillator-based autonomous formation flight strategy. We conduct extensive indoor and outdoor experiments to validate the efficiency of the proposed circumnavigation framework in occluded environments. Furthermore, a quadrotor failure experiment highlights the inherent fault tolerance property of the proposed framework, underscoring its potential for deployment in search-and-rescue operations.
comment: 8 Pages, 7 figures. Accepted by RA-L
☆ Effect of Haptic Feedback on Avoidance Behavior and Visual Exploration in Dynamic VR Pedestrian Environment
Human crowd simulation in virtual reality (VR) is a powerful tool with potential applications including emergency evacuation training and assessment of building layout. While haptic feedback in VR enhances immersive experience, its effect on walking behavior in dense and dynamic pedestrian flows is unknown. Through a user study, we investigated how haptic feedback changes user walking motion in crowded pedestrian flows in VR. The results indicate that haptic feedback changed users' collision avoidance movements, as measured by increased walking trajectory length and change in pelvis angle. The displacements of users' lateral position and pelvis angle were also increased in the instantaneous response to a collision with a non-player character (NPC), even when the NPC was inside the field of view. Haptic feedback also enhanced users' awareness and visual exploration when an NPC approached from the side and back. Furthermore, variation in walking speed was increased by the haptic feedback. These results suggested that the haptic feedback enhanced users' sensitivity to a collision in VR environment.
♻ ☆ Finding the Easy Way Through -- the Probabilistic Gap Planner for Social Robot Navigation
In Social Robot Navigation, autonomous agents need to resolve many sequential interactions with other agents. State-of-the art planners can efficiently resolve the next, imminent interaction cooperatively and do not focus on longer planning horizons. This makes it hard to maneuver scenarios where the agent needs to select a good strategy to find gaps or channels in the crowd. We propose to decompose trajectory planning into two separate steps: Conflict avoidance for finding good, macroscopic trajectories, and cooperative collision avoidance (CCA) for resolving the next interaction optimally. We propose the Probabilistic Gap Planner (PGP) as a conflict avoidance planner. PGP modifies an established probabilistic collision risk model to include a general assumption of cooperativity. PGP biases the short-term CCA planner to head towards gaps in the crowd. In extensive simulations with crowds of varying density, we show that using PGP in addition to state-of-the-art CCA planners improves the agents' performance: On average, agents keep more space to others, create less tension, and cause fewer collisions. This typically comes at the expense of slightly longer paths. PGP runs in real-time on WaPOCHI mobile robot by Honda R&D.
♻ ☆ ReactEMG: Zero-Shot, Low-Latency Intent Detection via sEMG
Surface electromyography (sEMG) signals show promise for effective human-computer interfaces, particularly in rehabilitation and prosthetics. However, challenges remain in developing systems that respond quickly and reliably to user intent, across different subjects and without requiring time-consuming calibration. In this work, we propose a framework for EMG-based intent detection that addresses these challenges. Unlike traditional gesture recognition models that wait until a gesture is completed before classifying it, our approach uses a segmentation strategy to assign intent labels at every timestep as the gesture unfolds. We introduce a novel masked modeling strategy that aligns muscle activations with their corresponding user intents, enabling rapid onset detection and stable tracking of ongoing gestures. In evaluations against baseline methods, considering both accuracy and stability for device control, our approach surpasses state-of-the-art performance in zero-shot transfer conditions, demonstrating its potential for wearable robotics and next-generation prosthetic systems. Our project page is available at: https://reactemg.github.io
♻ ☆ Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception IROS 2025
Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the downstream task of robotic grasping. We propose a method for training lightweight, deep networks to predict whether a grasp guided by an image-based pose estimate will succeed before that grasp is attempted. We generate training data for our networks via object pose estimation on real images and simulated grasping. We also find that, despite high object variability in grasping trials, networks benefit from training on all objects jointly, suggesting that a diverse variety of objects can nevertheless contribute to the same goal.
comment: Accepted to IROS 2025
♻ ☆ Learning Efficient and Robust Language-conditioned Manipulation using Textual-Visual Relevancy and Equivariant Language Mapping
Controlling robots through natural language is pivotal for enhancing human-robot collaboration and synthesizing complex robot behaviors. Recent works that are trained on large robot datasets show impressive generalization abilities. However, such pretrained methods are (1) often fragile to unseen scenarios, and (2) expensive to adapt to new tasks. This paper introduces Grounded Equivariant Manipulation (GEM), a robust yet efficient approach that leverages pretrained vision-language models with equivariant language mapping for language-conditioned manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and generalization ability across diverse tasks in both simulation and the real world. GEM achieves similar or higher performance with orders of magnitude fewer robot data compared with major data-efficient baselines such as CLIPort and VIMA. Finally, our approach demonstrates greater robustness compared to large VLA model, e.g, OpenVLA, at correctly interpreting natural language commands on unseen objects and poses. Code, data, and training details are available https://saulbatman.github.io/gem_page/
♻ ☆ 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors IROS 2025
Crop yield estimation is a relevant problem in agriculture, because an accurate yield estimate can support farmers' decisions on harvesting or precision intervention. Robots can help to automate this process. To do so, they need to be able to perceive the surrounding environment to identify target objects such as trees and plants. In this paper, we introduce a novel approach to address the problem of hierarchical panoptic segmentation of apple orchards on 3D data from different sensors. Our approach is able to simultaneously provide semantic segmentation, instance segmentation of trunks and fruits, and instance segmentation of trees (a trunk with its fruits). This allows us to identify relevant information such as individual plants, fruits, and trunks, and capture the relationship among them, such as precisely estimate the number of fruits associated to each tree in an orchard. To efficiently evaluate our approach for hierarchical panoptic segmentation, we provide a dataset designed specifically for this task. Our dataset is recorded in Bonn, Germany, in a real apple orchard with a variety of sensors, spanning from a terrestrial laser scanner to a RGB-D camera mounted on different robots platforms. The experiments show that our approach surpasses state-of-the-art approaches in 3D panoptic segmentation in the agricultural domain, while also providing full hierarchical panoptic segmentation. Our dataset is publicly available at https://www.ipb.uni-bonn.de/data/hops/. The open-source implementation of our approach is available at https://github.com/PRBonn/hapt3D.
comment: Accepted to IROS 2025
♻ ☆ Rapid Gyroscope Calibration: A Deep Learning Approach
Low-cost gyroscope calibration is essential for ensuring the accuracy and reliability of gyroscope measurements. Stationary calibration estimates the deterministic parts of measurement errors. To this end, a common practice is to average the gyroscope readings during a predefined period and estimate the gyroscope bias. Calibration duration plays a crucial role in performance, therefore, longer periods are preferred. However, some applications require quick startup times and calibration is therefore allowed only for a short time. In this work, we focus on reducing low-cost gyroscope calibration time using deep learning methods. We propose an end-to-end convolutional neural network for the application of gyroscope calibration. We explore the possibilities of using multiple real and virtual gyroscopes to improve the calibration performance of single gyroscopes. To train and validate our approach, we recorded a dataset consisting of 186.6 hours of gyroscope readings, using 36 gyroscopes of four different brands. We also created a virtual dataset consisting of simulated gyroscope readings. The six datasets were used to evaluate our proposed approach. One of our key achievements in this work is reducing gyroscope calibration time by up to 89% using three low-cost gyroscopes. Our dataset is publicly available to allow reproducibility of our work and to increase research in the field.
comment: 10 Pages, 14 Figures
♻ ☆ PCF-Grasp: Converting Point Completion to Geometry Feature to Enhance 6-DoF Grasp
The 6-Degree of Freedom (DoF) grasp method based on point clouds has shown significant potential in enabling robots to grasp target objects. However, most existing methods are based on the point clouds (2.5D points) generated from single-view depth images. These point clouds only have one surface side of the object providing incomplete geometry information, which mislead the grasping algorithm to judge the shape of the target object, resulting in low grasping accuracy. Humans can accurately grasp objects from a single view by leveraging their geometry experience to estimate object shapes. Inspired by humans, we propose a novel 6-DoF grasping framework that converts the point completion results as object shape features to train the 6-DoF grasp network. Here, point completion can generate approximate complete points from the 2.5D points similar to the human geometry experience, and converting it as shape features is the way to utilize it to improve grasp efficiency. Furthermore, due to the gap between the network generation and actual execution, we integrate a score filter into our framework to select more executable grasp proposals for the real robot. This enables our method to maintain a high grasp quality in any camera viewpoint. Extensive experiments demonstrate that utilizing complete point features enables the generation of significantly more accurate grasp proposals and the inclusion of a score filter greatly enhances the credibility of real-world robot grasping. Our method achieves a 17.8\% success rate higher than the state-of-the-art method in real-world experiments.
♻ ☆ RAMBO: RL-augmented Model-based Whole-body Control for Loco-manipulation
Loco-manipulation, physical interaction of various objects that is concurrently coordinated with locomotion, remains a major challenge for legged robots due to the need for both precise end-effector control and robustness to unmodeled dynamics. While model-based controllers provide precise planning via online optimization, they are limited by model inaccuracies. In contrast, learning-based methods offer robustness, but they struggle with precise modulation of interaction forces. We introduce RAMBO, a hybrid framework that integrates model-based whole-body control within a feedback policy trained with reinforcement learning. The model-based module generates feedforward torques by solving a quadratic program, while the policy provides feedback corrective terms to enhance robustness. We validate our framework on a quadruped robot across a diverse set of real-world loco-manipulation tasks, such as pushing a shopping cart, balancing a plate, and holding soft objects, in both quadrupedal and bipedal walking. Our experiments demonstrate that RAMBO enables precise manipulation capabilities while achieving robust and dynamic locomotion.
♻ ☆ The Starlink Robot: A Platform and Dataset for Mobile Satellite Communication
The integration of satellite communication into mobile devices represents a paradigm shift in connectivity, yet the performance characteristics under motion and environmental occlusion remain poorly understood. We present the Starlink Robot, the first mobile robotic platform equipped with Starlink satellite internet, comprehensive sensor suite including upward-facing camera, LiDAR, and IMU, designed to systematically study satellite communication performance during movement. Our multi-modal dataset captures synchronized communication metrics, motion dynamics, sky visibility, and 3D environmental context across diverse scenarios including steady-state motion, variable speeds, and different occlusion conditions. This platform and dataset enable researchers to develop motion-aware communication protocols, predict connectivity disruptions, and optimize satellite communication for emerging mobile applications from smartphones to autonomous vehicles. The project is available at https://github.com/StarlinkRobot.
♻ ☆ CREStE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance
We introduce CREStE, a scalable learning-based mapless navigation framework to address the open-world generalization and robustness challenges of outdoor urban navigation. Key to achieving this is learning perceptual representations that generalize to open-set factors (e.g. novel semantic classes, terrains, dynamic entities) and inferring expert-aligned navigation costs from limited demonstrations. CREStE addresses both these issues, introducing 1) a visual foundation model (VFM) distillation objective for learning open-set structured bird's-eye-view perceptual representations, and 2) counterfactual inverse reinforcement learning (IRL), a novel active learning formulation that uses counterfactual trajectory demonstrations to reason about the most important cues when inferring navigation costs. We evaluate CREStE on the task of kilometer-scale mapless navigation in a variety of city, offroad, and residential environments and find that it outperforms all state-of-the-art approaches with 70% fewer human interventions, including a 2-kilometer mission in an unseen environment with just 1 intervention; showcasing its robustness and effectiveness for long-horizon mapless navigation. Videos and additional materials can be found on the project page: https://amrl.cs.utexas.edu/creste
comment: 18 pages, 10 figures, 5 tables
♻ ☆ IMPACT: Behavioral Intention-aware Multimodal Trajectory Prediction with Adaptive Context Trimming
While most prior research has focused on improving the precision of multimodal trajectory predictions, the explicit modeling of multimodal behavioral intentions (e.g., yielding, overtaking) remains relatively underexplored. This paper proposes a unified framework that jointly predicts both behavioral intentions and trajectories to enhance prediction accuracy, interpretability, and efficiency. Specifically, we employ a shared context encoder for both intention and trajectory predictions, thereby reducing structural redundancy and information loss. Moreover, we address the lack of ground-truth behavioral intention labels in mainstream datasets (Waymo, Argoverse) by auto-labeling these datasets, thus advancing the community's efforts in this direction. We further introduce a vectorized occupancy prediction module that infers the probability of each map polyline being occupied by the target vehicle's future trajectory. By leveraging these intention and occupancy prediction priors, our method conducts dynamic, modality-dependent pruning of irrelevant agents and map polylines in the decoding stage, effectively reducing computational overhead and mitigating noise from non-critical elements. Our approach ranks first among LiDAR-free methods on the Waymo Motion Dataset and achieves first place on the Waymo Interactive Prediction Dataset. Remarkably, even without model ensembling, our single-model framework improves the soft mean average precision (softmAP) by 10 percent compared to the second-best method in the Waymo Interactive Prediction Leaderboard. Furthermore, the proposed framework has been successfully deployed on real vehicles, demonstrating its practical effectiveness in real-world applications.
comment: under review
♻ ☆ EFEAR-4D: Ego-Velocity Filtering for Efficient and Accurate 4D radar Odometry
Odometry is a crucial component for successfully implementing autonomous navigation, relying on sensors such as cameras, LiDARs and IMUs. However, these sensors may encounter challenges in extreme weather conditions, such as snowfall and fog. The emergence of FMCW radar technology offers the potential for robust perception in adverse conditions. As the latest generation of FWCW radars, the 4D mmWave radar provides point cloud with range, azimuth, elevation, and Doppler velocity information, despite inherent sparsity and noises in the point cloud. In this paper, we propose EFEAR-4D, an accurate, highly efficient, and learning-free method for large-scale 4D radar odometry estimation. EFEAR-4D exploits Doppler velocity information delicately for robust ego-velocity estimation, resulting in a highly accurate prior guess. EFEAR-4D maintains robustness against point-cloud sparsity and noises across diverse environments through dynamic object removal and effective region-wise feature extraction. Extensive experiments on two publicly available 4D radar datasets demonstrate state-of-the-art reliability and localization accuracy of EFEAR-4D under various conditions. Furthermore, we have collected a dataset following the same route but varying installation heights of the 4D radar, emphasizing the significant impact of radar height on point cloud quality - a crucial consideration for real-world deployments. Our algorithm and dataset will be available soon at https://github.com/CLASS-Lab/EFEAR-4D.
♻ ☆ What Foundation Models can Bring for Robot Learning in Manipulation : A Survey
The realization of universal robots is an ultimate goal of researchers. However, a key hurdle in achieving this goal lies in the robots' ability to manipulate objects in their unstructured surrounding environments according to different tasks. The learning-based approach is considered an effective way to address generalization. The impressive performance of foundation models in the fields of computer vision and natural language suggests the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation capability. However, we believe achieving general manipulation capability requires an overarching framework akin to auto driving. This framework should encompass multiple functional modules, with different foundation models assuming distinct roles in facilitating general manipulation capability. This survey focuses on the contributions of foundation models to robot learning for manipulation. We propose a comprehensive framework and detail how foundation models can address challenges in each module of the framework. What's more, we examine current approaches, outline challenges, suggest future research directions, and identify potential risks associated with integrating foundation models into this domain.
Robotics 46
☆ DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
We propose DemoDiffusion, a simple and scalable method for enabling robots to perform manipulation tasks in natural environments by imitating a single human demonstration. Our approach is based on two key insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Our approach avoids the need for online reinforcement learning or paired human-robot data, enabling robust adaptation to new tasks and scenes with minimal manual effort. Experiments in both simulation and real-world settings show that DemoDiffusion outperforms both the base policy and the retargeted trajectory, enabling the robot to succeed even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/
comment: Preprint(17 pages). Under Review
☆ A Computationally Aware Multi Objective Framework for Camera LiDAR Calibration
Accurate extrinsic calibration between LiDAR and camera sensors is important for reliable perception in autonomous systems. In this paper, we present a novel multi-objective optimization framework that jointly minimizes the geometric alignment error and computational cost associated with camera-LiDAR calibration. We optimize two objectives: (1) error between projected LiDAR points and ground-truth image edges, and (2) a composite metric for computational cost reflecting runtime and resource usage. Using the NSGA-II \cite{deb2002nsga2} evolutionary algorithm, we explore the parameter space defined by 6-DoF transformations and point sampling rates, yielding a well-characterized Pareto frontier that exposes trade-offs between calibration fidelity and resource efficiency. Evaluations are conducted on the KITTI dataset using its ground-truth extrinsic parameters for validation, with results verified through both multi-objective and constrained single-objective baselines. Compared to existing gradient-based and learned calibration methods, our approach demonstrates interpretable, tunable performance with lower deployment overhead. Pareto-optimal configurations are further analyzed for parameter sensitivity and innovation insights. A preference-based decision-making strategy selects solutions from the Pareto knee region to suit the constraints of the embedded system. The robustness of calibration is tested across variable edge-intensity weighting schemes, highlighting optimal balance points. Although real-time deployment on embedded platforms is deferred to future work, this framework establishes a scalable and transparent method for calibration under realistic misalignment and resource-limited conditions, critical for long-term autonomy, particularly in SAE L3+ vehicles receiving OTA updates.
comment: 16 pages, 10 figures
☆ Task Allocation of UAVs for Monitoring Missions via Hardware-in-the-Loop Simulation and Experimental Validation
This study addresses the optimisation of task allocation for Unmanned Aerial Vehicles (UAVs) within industrial monitoring missions. The proposed methodology integrates a Genetic Algorithms (GA) with a 2-Opt local search technique to obtain a high-quality solution. Our approach was experimentally validated in an industrial zone to demonstrate its efficacy in real-world scenarios. Also, a Hardware-in-the-loop (HIL) simulator for the UAVs team is introduced. Moreover, insights about the correlation between the theoretical cost function and the actual battery consumption and time of flight are deeply analysed. Results show that the considered costs for the optimisation part of the problem closely correlate with real-world data, confirming the practicality of the proposed approach.
☆ Learning-Based Distance Estimation for 360° Single-Sensor Setups
Accurate distance estimation is a fundamental challenge in robotic perception, particularly in omnidirectional imaging, where traditional geometric methods struggle with lens distortions and environmental variability. In this work, we propose a neural network-based approach for monocular distance estimation using a single 360{\deg} fisheye lens camera. Unlike classical trigonometric techniques that rely on precise lens calibration, our method directly learns and infers the distance of objects from raw omnidirectional inputs, offering greater robustness and adaptability across diverse conditions. We evaluate our approach on three 360{\deg} datasets (LOAF, ULM360, and a newly captured dataset Boat360), each representing distinct environmental and sensor setups. Our experimental results demonstrate that the proposed learning-based model outperforms traditional geometry-based methods and other learning baselines in both accuracy and robustness. These findings highlight the potential of deep learning for real-time omnidirectional distance estimation, making our approach particularly well-suited for low-cost applications in robotics, autonomous navigation, and surveillance.
comment: Submitted to ECMR 2025
☆ Communication-Aware Map Compression for Online Path-Planning: A Rate-Distortion Approach
This paper addresses the problem of collaborative navigation in an unknown environment, where two robots, referred to in the sequel as the Seeker and the Supporter, traverse the space simultaneously. The Supporter assists the Seeker by transmitting a compressed representation of its local map under bandwidth constraints to support the Seeker's path-planning task. We introduce a bit-rate metric based on the expected binary codeword length to quantify communication cost. Using this metric, we formulate the compression design problem as a rate-distortion optimization problem that determines when to communicate, which regions of the map should be included in the compressed representation, and at what resolution (i.e., quantization level) they should be encoded. Our formulation allows different map regions to be encoded at varying quantization levels based on their relevance to the Seeker's path-planning task. We demonstrate that the resulting optimization problem is convex, and admits a closed-form solution known in the information theory literature as reverse water-filling, enabling efficient, low-computation, and real-time implementation. Additionally, we show that the Seeker can infer the compression decisions of the Supporter independently, requiring only the encoded map content and not the encoding policy itself to be transmitted, thereby reducing communication overhead. Simulation results indicate that our method effectively constructs compressed, task-relevant map representations, both in content and resolution, that guide the Seeker's planning decisions even under tight bandwidth limitations.
☆ HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction
Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: https://github.com/interaction-lab/HRIBench.
comment: Accepted to the 19th International Symposium on Experimental Robotics (ISER 2025)
☆ Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation
Learning-based robotic systems demand rigorous validation to assure reliable performance, but extensive real-world testing is often prohibitively expensive, and if conducted may still yield insufficient data for high-confidence guarantees. In this work, we introduce a general estimation framework that leverages paired data across test platforms, e.g., paired simulation and real-world observations, to achieve better estimates of real-world metrics via the method of control variates. By incorporating cheap and abundant auxiliary measurements (for example, simulator outputs) as control variates for costly real-world samples, our method provably reduces the variance of Monte Carlo estimates and thus requires significantly fewer real-world samples to attain a specified confidence bound on the mean performance. We provide theoretical analysis characterizing the variance and sample-efficiency improvement, and demonstrate empirically in autonomous driving and quadruped robotics settings that our approach achieves high-probability bounds with markedly improved sample efficiency. Our technique can lower the real-world testing burden for validating the performance of the stack, thereby enabling more efficient and cost-effective experimental evaluation of robotic systems.
☆ Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos
Modern image-based object detection models, such as YOLOv7, primarily process individual frames independently, thus ignoring valuable temporal context naturally present in videos. Meanwhile, existing video-based detection methods often introduce complex temporal modules, significantly increasing model size and computational complexity. In practical applications such as surveillance and autonomous driving, transient challenges including motion blur, occlusions, and abrupt appearance changes can severely degrade single-frame detection performance. To address these issues, we propose a straightforward yet highly effective strategy: stacking multiple consecutive frames as input to a YOLO-based detector while supervising only the output corresponding to a single target frame. This approach leverages temporal information with minimal modifications to existing architectures, preserving simplicity, computational efficiency, and real-time inference capability. Extensive experiments on the challenging MOT20Det and our BOAT360 datasets demonstrate that our method improves detection robustness, especially for lightweight models, effectively narrowing the gap between compact and heavy detection networks. Additionally, we contribute the BOAT360 benchmark dataset, comprising annotated fisheye video sequences captured from a boat, to support future research in multi-frame video object detection in challenging real-world scenarios.
comment: Submitted to ECMR 2025
☆ Critical Anatomy-Preserving & Terrain-Augmenting Navigation (CAPTAiN): Application to Laminectomy Surgical Education
Surgical training remains a crucial milestone in modern medicine, with procedures such as laminectomy exemplifying the high risks involved. Laminectomy drilling requires precise manual control to mill bony tissue while preserving spinal segment integrity and avoiding breaches in the dura: the protective membrane surrounding the spinal cord. Despite unintended tears occurring in up to 11.3% of cases, no assistive tools are currently utilized to reduce this risk. Variability in patient anatomy further complicates learning for novice surgeons. This study introduces CAPTAiN, a critical anatomy-preserving and terrain-augmenting navigation system that provides layered, color-coded voxel guidance to enhance anatomical awareness during spinal drilling. CAPTAiN was evaluated against a standard non-navigated approach through 110 virtual laminectomies performed by 11 orthopedic residents and medical students. CAPTAiN significantly improved surgical completion rates of target anatomy (87.99% vs. 74.42%) and reduced cognitive load across multiple NASA-TLX domains. It also minimized performance gaps across experience levels, enabling novices to perform on par with advanced trainees. These findings highlight CAPTAiN's potential to optimize surgical execution and support skill development across experience levels. Beyond laminectomy, it demonstrates potential for broader applications across various surgical and drilling procedures, including those in neurosurgery, otolaryngology, and other medical fields.
☆ Behavior Foundation Model: Towards Next-Generation Whole-Body Control System of Humanoid Robots
Humanoid robots are drawing significant attention as versatile platforms for complex motor control, human-robot interaction, and general-purpose physical intelligence. However, achieving efficient whole-body control (WBC) in humanoids remains a fundamental challenge due to sophisticated dynamics, underactuation, and diverse task requirements. While learning-based controllers have shown promise for complex tasks, their reliance on labor-intensive and costly retraining for new scenarios limits real-world applicability. To address these limitations, behavior(al) foundation models (BFMs) have emerged as a new paradigm that leverages large-scale pretraining to learn reusable primitive skills and behavioral priors, enabling zero-shot or rapid adaptation to a wide range of downstream tasks. In this paper, we present a comprehensive overview of BFMs for humanoid WBC, tracing their development across diverse pre-training pipelines. Furthermore, we discuss real-world applications, current limitations, urgent challenges, and future opportunities, positioning BFMs as a key approach toward scalable and general-purpose humanoid intelligence. Finally, we provide a curated and long-term list of BFM papers and projects to facilitate more subsequent research, which is available at https://github.com/yuanmingqi/awesome-bfm-papers.
comment: 19 pages, 8 figures
☆ EANS: Reducing Energy Consumption for UAV with an Environmental Adaptive Navigation Strategy
Unmanned Aerial Vehicles (UAVS) are limited by the onboard energy. Refinement of the navigation strategy directly affects both the flight velocity and the trajectory based on the adjustment of key parameters in the UAVS pipeline, thus reducing energy consumption. However, existing techniques tend to adopt static and conservative strategies in dynamic scenarios, leading to inefficient energy reduction. Dynamically adjusting the navigation strategy requires overcoming the challenges including the task pipeline interdependencies, the environmental-strategy correlations, and the selecting parameters. To solve the aforementioned problems, this paper proposes a method to dynamically adjust the navigation strategy of the UAVS by analyzing its dynamic characteristics and the temporal characteristics of the autonomous navigation pipeline, thereby reducing UAVS energy consumption in response to environmental changes. We compare our method with the baseline through hardware-in-the-loop (HIL) simulation and real-world experiments, showing our method 3.2X and 2.6X improvements in mission time, 2.4X and 1.6X improvements in energy, respectively.
☆ A Review of Personalisation in Human-Robot Collaboration and Future Perspectives Towards Industry 5.0
The shift in research focus from Industry 4.0 to Industry 5.0 (I5.0) promises a human-centric workplace, with social and well-being values at the centre of technological implementation. Human-Robot Collaboration (HRC) is a core aspect of I5.0 development, with an increase in adaptive and personalised interactions and behaviours. This review investigates recent advancements towards personalised HRC, where user-centric adaption is key. There is a growing trend for adaptable HRC research, however there lacks a consistent and unified approach. The review highlights key research trends on which personal factors are considered, workcell and interaction design, and adaptive task completion. This raises various key considerations for future developments, particularly around the ethical and regulatory development of personalised systems, which are discussed in detail.
comment: Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
☆ Learn to Position -- A Novel Meta Method for Robotic Positioning
Absolute positioning accuracy is a vital specification for robots. Achieving high position precision can be challenging due to the presence of various sources of errors. Meanwhile, accurately depicting these errors is difficult due to their stochastic nature. Vision-based methods are commonly integrated to guide robotic positioning, but their performance can be highly impacted by inevitable occlusions or adverse lighting conditions. Drawing on the aforementioned considerations, a vision-free, model-agnostic meta-method for compensating robotic position errors is proposed, which maximizes the probability of accurate robotic position via interactive feedback. Meanwhile, the proposed method endows the robot with the capability to learn and adapt to various position errors, which is inspired by the human's instinct for grasping under uncertainties. Furthermore, it is a self-learning and self-adaptive method able to accelerate the robotic positioning process as more examples are incorporated and learned. Empirical studies validate the effectiveness of the proposed method. As of the writing of this paper, the proposed meta search method has already been implemented in a robotic-based assembly line for odd-form electronic components.
☆ Multimodal Behaviour Trees for Robotic Laboratory Task Automation ICRA 2025
Laboratory robotics offer the capability to conduct experiments with a high degree of precision and reproducibility, with the potential to transform scientific research. Trivial and repeatable tasks; e.g., sample transportation for analysis and vial capping are well-suited for robots; if done successfully and reliably, chemists could contribute their efforts towards more critical research activities. Currently, robots can perform these tasks faster than chemists, but how reliable are they? Improper capping could result in human exposure to toxic chemicals which could be fatal. To ensure that robots perform these tasks as accurately as humans, sensory feedback is required to assess the progress of task execution. To address this, we propose a novel methodology based on behaviour trees with multimodal perception. Along with automating robotic tasks, this methodology also verifies the successful execution of the task, a fundamental requirement in safety-critical environments. The experimental evaluation was conducted on two lab tasks: sample vial capping and laboratory rack insertion. The results show high success rate, i.e., 88% for capping and 92% for insertion, along with strong error detection capabilities. This ultimately proves the robustness and reliability of our approach and that using multimodal behaviour trees should pave the way towards the next generation of robotic chemists.
comment: 7 pages, 5 figures, accepted and presented in ICRA 2025
☆ SPARK: Graph-Based Online Semantic Integration System for Robot Task Planning
The ability to update information acquired through various means online during task execution is crucial for a general-purpose service robot. This information includes geometric and semantic data. While SLAM handles geometric updates on 2D maps or 3D point clouds, online updates of semantic information remain unexplored. We attribute the challenge to the online scene graph representation, for its utility and scalability. Building on previous works regarding offline scene graph representations, we study online graph representations of semantic information in this work. We introduce SPARK: Spatial Perception and Robot Knowledge Integration. This framework extracts semantic information from environment-embedded cues and updates the scene graph accordingly, which is then used for subsequent task planning. We demonstrate that graph representations of spatial relationships enhance the robot system's ability to perform tasks in dynamic environments and adapt to unconventional spatial cues, like gestures.
☆ Enhanced Robotic Navigation in Deformable Environments using Learning from Demonstration and Dynamic Modulation IROS 2025
This paper presents a novel approach for robot navigation in environments containing deformable obstacles. By integrating Learning from Demonstration (LfD) with Dynamical Systems (DS), we enable adaptive and efficient navigation in complex environments where obstacles consist of both soft and hard regions. We introduce a dynamic modulation matrix within the DS framework, allowing the system to distinguish between traversable soft regions and impassable hard areas in real-time, ensuring safe and flexible trajectory planning. We validate our method through extensive simulations and robot experiments, demonstrating its ability to navigate deformable environments. Additionally, the approach provides control over both trajectory and velocity when interacting with deformable objects, including at intersections, while maintaining adherence to the original DS trajectory and dynamically adapting to obstacles for smooth and reliable navigation.
comment: Accepted to IROS 2025
☆ CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition
We introduce CARMA, a system for situational grounding in human-robot group interactions. Effective collaboration in such group settings requires situational awareness based on a consistent representation of present persons and objects coupled with an episodic abstraction of events regarding actors and manipulated objects. This calls for a clear and consistent assignment of instances, ensuring that robots correctly recognize and track actors, objects, and their interactions over time. To achieve this, CARMA uniquely identifies physical instances of such entities in the real world and organizes them into grounded triplets of actors, objects, and actions. To validate our approach, we conducted three experiments, where multiple humans and a robot interact: collaborative pouring, handovers, and sorting. These scenarios allow the assessment of the system's capabilities as to role distinction, multi-actor awareness, and consistent instance identification. Our experiments demonstrate that the system can reliably generate accurate actor-action-object triplets, providing a structured and robust foundation for applications requiring spatiotemporal reasoning and situated decision-making in collaborative settings.
☆ PIMBS: Efficient Body Schema Learning for Musculoskeletal Humanoids with Physics-Informed Neural Networks
Musculoskeletal humanoids are robots that closely mimic the human musculoskeletal system, offering various advantages such as variable stiffness control, redundancy, and flexibility. However, their body structure is complex, and muscle paths often significantly deviate from geometric models. To address this, numerous studies have been conducted to learn body schema, particularly the relationships among joint angles, muscle tension, and muscle length. These studies typically rely solely on data collected from the actual robot, but this data collection process is labor-intensive, and learning becomes difficult when the amount of data is limited. Therefore, in this study, we propose a method that applies the concept of Physics-Informed Neural Networks (PINNs) to the learning of body schema in musculoskeletal humanoids, enabling high-accuracy learning even with a small amount of data. By utilizing not only data obtained from the actual robot but also the physical laws governing the relationship between torque and muscle tension under the assumption of correct joint structure, more efficient learning becomes possible. We apply the proposed method to both simulation and an actual musculoskeletal humanoid and discuss its effectiveness and characteristics.
comment: Accepted at IEEE Robotics and Automation Letters
☆ Building Forest Inventories with Autonomous Legged Robots -- System, Lessons, and Challenges Ahead
Legged robots are increasingly being adopted in industries such as oil, gas, mining, nuclear, and agriculture. However, new challenges exist when moving into natural, less-structured environments, such as forestry applications. This paper presents a prototype system for autonomous, under-canopy forest inventory with legged platforms. Motivated by the robustness and mobility of modern legged robots, we introduce a system architecture which enabled a quadruped platform to autonomously navigate and map forest plots. Our solution involves a complete navigation stack for state estimation, mission planning, and tree detection and trait estimation. We report the performance of the system from trials executed over one and a half years in forests in three European countries. Our results with the ANYmal robot demonstrate that we can survey plots up to 1 ha plot under 30 min, while also identifying trees with typical DBH accuracy of 2cm. The findings of this project are presented as five lessons and challenges. Particularly, we discuss the maturity of hardware development, state estimation limitations, open problems in forest navigation, future avenues for robotic forest inventory, and more general challenges to assess autonomous systems. By sharing these lessons and challenges, we offer insight and new directions for future research on legged robots, navigation systems, and applications in natural environments. Additional videos can be found in https://dynamic.robots.ox.ac.uk/projects/legged-robots
comment: 20 pages, 13 figures. Pre-print version of the accepted paper for IEEE Transactions on Field Robotics (T-FR)
☆ Near Time-Optimal Hybrid Motion Planning for Timber Cranes ICRA 2025
Efficient, collision-free motion planning is essential for automating large-scale manipulators like timber cranes. They come with unique challenges such as hydraulic actuation constraints and passive joints-factors that are seldom addressed by current motion planning methods. This paper introduces a novel approach for time-optimal, collision-free hybrid motion planning for a hydraulically actuated timber crane with passive joints. We enhance the via-point-based stochastic trajectory optimization (VP-STO) algorithm to include pump flow rate constraints and develop a novel collision cost formulation to improve robustness. The effectiveness of the enhanced VP-STO as an optimal single-query global planner is validated by comparison with an informed RRT* algorithm using a time-optimal path parameterization (TOPP). The overall hybrid motion planning is formed by combination with a gradient-based local planner that is designed to follow the global planner's reference and to systematically consider the passive joint dynamics for both collision avoidance and sway damping.
comment: Accepted at ICRA 2025
☆ Real-Time Obstacle Avoidance Algorithms for Unmanned Aerial and Ground Vehicles
The growing use of mobile robots in sectors such as automotive, agriculture, and rescue operations reflects progress in robotics and autonomy. In unmanned aerial vehicles (UAVs), most research emphasizes visual SLAM, sensor fusion, and path planning. However, applying UAVs to search and rescue missions in disaster zones remains underexplored, especially for autonomous navigation. This report develops methods for real-time and secure UAV maneuvering in complex 3D environments, crucial during forest fires. Building upon past research, it focuses on designing navigation algorithms for unfamiliar and hazardous environments, aiming to improve rescue efficiency and safety through UAV-based early warning and rapid response. The work unfolds in phases. First, a 2D fusion navigation strategy is explored, initially for mobile robots, enabling safe movement in dynamic settings. This sets the stage for advanced features such as adaptive obstacle handling and decision-making enhancements. Next, a novel 3D reactive navigation strategy is introduced for collision-free movement in forest fire simulations, addressing the unique challenges of UAV operations in such scenarios. Finally, the report proposes a unified control approach that integrates UAVs and unmanned ground vehicles (UGVs) for coordinated rescue missions in forest environments. Each phase presents challenges, proposes control models, and validates them with mathematical and simulation-based evidence. The study offers practical value and academic insights for improving the role of UAVs in natural disaster rescue operations.
☆ Why Robots Are Bad at Detecting Their Mistakes: Limitations of Miscommunication Detection in Human-Robot Dialogue
Detecting miscommunication in human-robot interaction is a critical function for maintaining user engagement and trust. While humans effortlessly detect communication errors in conversations through both verbal and non-verbal cues, robots face significant challenges in interpreting non-verbal feedback, despite advances in computer vision for recognizing affective expressions. This research evaluates the effectiveness of machine learning models in detecting miscommunications in robot dialogue. Using a multi-modal dataset of 240 human-robot conversations, where four distinct types of conversational failures were systematically introduced, we assess the performance of state-of-the-art computer vision models. After each conversational turn, users provided feedback on whether they perceived an error, enabling an analysis of the models' ability to accurately detect robot mistakes. Despite using state-of-the-art models, the performance barely exceeds random chance in identifying miscommunication, while on a dataset with more expressive emotional content, they successfully identified confused states. To explore the underlying cause, we asked human raters to do the same. They could also only identify around half of the induced miscommunications, similarly to our model. These results uncover a fundamental limitation in identifying robot miscommunications in dialogue: even when users perceive the induced miscommunication as such, they often do not communicate this to their robotic conversation partner. This knowledge can shape expectations of the performance of computer vision models and can help researchers to design better human-robot conversations by deliberately eliciting feedback where needed.
comment: Accepted at the 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2025)
☆ Generating and Customizing Robotic Arm Trajectories using Neural Networks
We introduce a neural network approach for generating and customizing the trajectory of a robotic arm, that guarantees precision and repeatability. To highlight the potential of this novel method, we describe the design and implementation of the technique and show its application in an experimental setting of cognitive robotics. In this scenario, the NICO robot was characterized by the ability to point to specific points in space with precise linear movements, increasing the predictability of the robotic action during its interaction with humans. To achieve this goal, the neural network computes the forward kinematics of the robot arm. By integrating it with a generator of joint angles, another neural network was developed and trained on an artificial dataset created from suitable start and end poses of the robotic arm. Through the computation of angular velocities, the robot was characterized by its ability to perform the movement, and the quality of its action was evaluated in terms of shape and accuracy. Thanks to its broad applicability, our approach successfully generates precise trajectories that could be customized in their shape and adapted to different settings.
comment: The code is released at https://github.com/andylucny/nico2/tree/main/generate
☆ Personalized Mental State Evaluation in Human-Robot Interaction using Federated Learning
With the advent of Industry 5.0, manufacturers are increasingly prioritizing worker well-being alongside mass customization. Stress-aware Human-Robot Collaboration (HRC) plays a crucial role in this paradigm, where robots must adapt their behavior to human mental states to improve collaboration fluency and safety. This paper presents a novel framework that integrates Federated Learning (FL) to enable personalized mental state evaluation while preserving user privacy. By leveraging physiological signals, including EEG, ECG, EDA, EMG, and respiration, a multimodal model predicts an operator's stress level, facilitating real-time robot adaptation. The FL-based approach allows distributed on-device training, ensuring data confidentiality while improving model generalization and individual customization. Results demonstrate that the deployment of an FL approach results in a global model with performance in stress prediction accuracy comparable to a centralized training approach. Moreover, FL allows for enhancing personalization, thereby optimizing human-robot interaction in industrial settings, while preserving data privacy. The proposed framework advances privacy-preserving, adaptive robotics to enhance workforce well-being in smart manufacturing.
☆ PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
We propose PSALM-V, the first autonomous neuro-symbolic learning system able to induce symbolic action semantics (i.e., pre- and post-conditions) in visual environments through interaction. PSALM-V bootstraps reliable symbolic planning without expert action definitions, using LLMs to generate heuristic plans and candidate symbolic semantics. Previous work has explored using large language models to generate action semantics for Planning Domain Definition Language (PDDL)-based symbolic planners. However, these approaches have primarily focused on text-based domains or relied on unrealistic assumptions, such as access to a predefined problem file, full observability, or explicit error messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations. The system iteratively generates and executes plans while maintaining a tree-structured belief over possible action semantics for each action, iteratively refining these beliefs until a goal state is reached. Simulated experiments of task completion in ALFRED demonstrate that PSALM-V increases the plan success rate from 37% (Claude-3.7) to 74% in partially observed setups. Results on two 2D game environments, RTFM and Overcooked-AI, show that PSALM-V improves step efficiency and succeeds in domain induction in multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions for real-world robot BlocksWorld tasks, despite low-level manipulation failures from the robot.
☆ Model-Based Real-Time Pose and Sag Estimation of Overhead Power Lines Using LiDAR for Drone Inspection
Drones can inspect overhead power lines while they remain energized, significantly simplifying the inspection process. However, localizing a drone relative to all conductors using an onboard LiDAR sensor presents several challenges: (1) conductors provide minimal surface for LiDAR beams limiting the number of conductor points in a scan, (2) not all conductors are consistently detected, and (3) distinguishing LiDAR points corresponding to conductors from other objects, such as trees and pylons, is difficult. This paper proposes an estimation approach that minimizes the error between LiDAR measurements and a single geometric model representing the entire conductor array, rather than tracking individual conductors separately. Experimental results, using data from a power line drone inspection, demonstrate that this method achieves accurate tracking, with a solver converging under 50 ms per frame, even in the presence of partial observations, noise, and outliers. A sensitivity analysis shows that the estimation approach can tolerate up to twice as many outlier points as valid conductors measurements.
comment: Submitted to IEEE case 2025
☆ Online Planning for Cooperative Air-Ground Robot Systems with Unknown Fuel Requirements
We consider an online variant of the fuel-constrained UAV routing problem with a ground-based mobile refueling station (FCURP-MRS), where targets incur unknown fuel costs. We develop a two-phase solution: an offline heuristic-based planner computes initial UAV and UGV paths, and a novel online planning algorithm that dynamically adjusts rendezvous points based on real-time fuel consumption during target processing. Preliminary Gazebo simulations demonstrate the feasibility of our approach in maintaining UAV-UGV path validity, ensuring mission completion. Link to video: https://youtu.be/EmpVj-fjqNY
comment: Submitted to RSS (MRS Workshop)
☆ IMA-Catcher: An IMpact-Aware Nonprehensile Catching Framework based on Combined Optimization and Learning
Robotic catching of flying objects typically generates high impact forces that might lead to task failure and potential hardware damages. This is accentuated when the object mass to robot payload ratio increases, given the strong inertial components characterizing this task. This paper aims to address this problem by proposing an implicitly impact-aware framework that accomplishes the catching task in both pre- and post-catching phases. In the first phase, a motion planner generates optimal trajectories that minimize catching forces, while in the second, the object's energy is dissipated smoothly, minimizing bouncing. In particular, in the pre-catching phase, a real-time optimal planner is responsible for generating trajectories of the end-effector that minimize the velocity difference between the robot and the object to reduce impact forces during catching. In the post-catching phase, the robot's position, velocity, and stiffness trajectories are generated based on human demonstrations when catching a series of free-falling objects with unknown masses. A hierarchical quadratic programming-based controller is used to enforce the robot's constraints (i.e., joint and torque limits) and create a stack of tasks that minimizes the reflected mass at the end-effector as a secondary objective. The initial experiments isolate the problem along one dimension to accurately study the effects of each contribution on the metrics proposed. We show how the same task, without velocity matching, would be infeasible due to excessive joint torques resulting from the impact. The addition of reflected mass minimization is then investigated, and the catching height is increased to evaluate the method's robustness. Finally, the setup is extended to catching along multiple Cartesian axes, to prove its generalization in space.
comment: 25 pages, 17 figures, accepted by International Journal of Robotics Research (IJRR)
☆ How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?
Gestures enable non-verbal human-robot communication, especially in noisy environments like agile production. Traditional deep learning-based gesture recognition relies on task-specific architectures using images, videos, or skeletal pose estimates as input. Meanwhile, Vision Foundation Models (VFMs) and Vision Language Models (VLMs) with their strong generalization abilities offer potential to reduce system complexity by replacing dedicated task-specific modules. This study investigates adapting such models for dynamic, full-body gesture recognition, comparing V-JEPA (a state-of-the-art VFM), Gemini Flash 2.0 (a multimodal VLM), and HD-GCN (a top-performing skeleton-based approach). We introduce NUGGET, a dataset tailored for human-robot communication in intralogistics environments, to evaluate the different gesture recognition approaches. In our experiments, HD-GCN achieves best performance, but V-JEPA comes close with a simple, task-specific classification head - thus paving a possible way towards reducing system complexity, by using it as a shared multi-task model. In contrast, Gemini struggles to differentiate gestures based solely on textual descriptions in the zero-shot setting, highlighting the need of further research on suitable input representations for gestures.
☆ ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations
Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.
♻ ☆ FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation
Handling delicate and fragile objects remains a major challenge for robotic manipulation, especially for rigid parallel grippers. While the simplicity and versatility of parallel grippers have led to widespread adoption, these grippers are limited by their heavy reliance on visual feedback. Tactile sensing and soft robotics can add responsiveness and compliance. However, existing methods typically involve high integration complexity or suffer from slow response times. In this work, we introduce FORTE, a tactile sensing system embedded in compliant gripper fingers. FORTE uses 3D-printed fin-ray grippers with internal air channels to provide low-latency force and slip feedback. FORTE applies just enough force to grasp objects without damaging them, while remaining easy to fabricate and integrate. We find that FORTE can accurately estimate grasping forces from 0-8 N with an average error of 0.2 N, and detect slip events within 100 ms of occurring. We demonstrate FORTE's ability to grasp a wide range of slippery, fragile, and deformable objects. In particular, FORTE grasps fragile objects like raspberries and potato chips with a 98.6% success rate, and achieves 93% accuracy in detecting slip events. These results highlight FORTE's potential as a robust and practical solution for enabling delicate robotic manipulation. Project page: https://merge-lab.github.io/FORTE
♻ ☆ BEVPlace++: Fast, Robust, and Lightweight LiDAR Global Localization for Unmanned Ground Vehicles
This article introduces BEVPlace++, a novel, fast, and robust LiDAR global localization method for unmanned ground vehicles. It uses lightweight convolutional neural networks (CNNs) on Bird's Eye View (BEV) image-like representations of LiDAR data to achieve accurate global localization through place recognition, followed by 3-DoF pose estimation. Our detailed analyses reveal an interesting fact that CNNs are inherently effective at extracting distinctive features from LiDAR BEV images. Remarkably, keypoints of two BEV images with large translations can be effectively matched using CNN-extracted features. Building on this insight, we design a Rotation Equivariant Module (REM) to obtain distinctive features while enhancing robustness to rotational changes. A Rotation Equivariant and Invariant Network (REIN) is then developed by cascading REM and a descriptor generator, NetVLAD, to sequentially generate rotation equivariant local features and rotation invariant global descriptors. The global descriptors are used first to achieve robust place recognition, and then local features are used for accurate pose estimation. \revise{Experimental results on seven public datasets and our UGV platform demonstrate that BEVPlace++, even when trained on a small dataset (3000 frames of KITTI) only with place labels, generalizes well to unseen environments, performs consistently across different days and years, and adapts to various types of LiDAR scanners.} BEVPlace++ achieves state-of-the-art performance in multiple tasks, including place recognition, loop closure detection, and global localization. Additionally, BEVPlace++ is lightweight, runs in real-time, and does not require accurate pose supervision, making it highly convenient for deployment. \revise{The source codes are publicly available at https://github.com/zjuluolun/BEVPlace2.
comment: Accepted to IEEE Transactions on Robotics
♻ ☆ A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation
Robotic manipulation faces critical challenges in understanding spatial affordances--the "where" and "how" of object interactions--essential for complex manipulation tasks like wiping a board or stacking objects. Existing methods, including modular-based and end-to-end approaches, often lack robust spatial reasoning capabilities. Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding and low-level action execution. A0 leverages the Embodiment-Agnostic Affordance Representation, which captures object-centric spatial affordances by predicting contact points and post-contact trajectories. A0 is pre-trained on 1 million contact points data and fine-tuned on annotated trajectories, enabling generalization across platforms. Key components include Position Offset Attention for motion-aware feature extraction and a Spatial Information Aggregation Layer for precise coordinate mapping. The model's output is executed by the action execution module. Experiments on multiple robotic systems (Franka, Kinova, Realman, and Dobot) demonstrate A0's superior performance in complex tasks, showcasing its efficiency, flexibility, and real-world applicability.
♻ ☆ Proximal Control of UAVs with Federated Learning for Human-Robot Collaborative Domains
The human-robot interaction (HRI) is a growing area of research. In HRI, complex command (action) classification is still an open problem that usually prevents the real applicability of such a technique. The literature presents some works that use neural networks to detect these actions. However, occlusion is still a major issue in HRI, especially when using uncrewed aerial vehicles (UAVs), since, during the robot's movement, the human operator is often out of the robot's field of view. Furthermore, in multi-robot scenarios, distributed training is also an open problem. In this sense, this work proposes an action recognition and control approach based on Long Short-Term Memory (LSTM) Deep Neural Networks with two layers in association with three densely connected layers and Federated Learning (FL) embedded in multiple drones. The FL enabled our approach to be trained in a distributed fashion, i.e., access to data without the need for cloud or other repositories, which facilitates the multi-robot system's learning. Furthermore, our multi-robot approach results also prevented occlusion situations, with experiments with real robots achieving an accuracy greater than 96%.
comment: version 2
♻ ☆ Physics-informed Imitative Reinforcement Learning for Real-world Driving
Recent advances in imitative reinforcement learning (IRL) have considerably enhanced the ability of autonomous agents to assimilate expert demonstrations, leading to rapid skill acquisition in a range of demanding tasks. However, such learning-based agents face significant challenges when transferring knowledge to highly dynamic closed-loop environments. Their performance is significantly impacted by the conflicting optimization objectives of imitation learning (IL) and reinforcement learning (RL), sample inefficiency, and the complexity of uncovering the hidden world model and physics. To address this challenge, we propose a physics-informed IRL that is entirely data-driven. It leverages both expert demonstration data and exploratory data with a joint optimization objective, allowing the underlying physical principles of vehicle dynamics to emerge naturally from the training process. The performance is evaluated through empirical experiments and results exceed popular IL, RL and IRL algorithms in closed-loop settings on Waymax benchmark. Our approach exhibits 37.8% reduction in collision rate and 22.2% reduction in off-road rate compared to the baseline method.
♻ ☆ Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration WACV 2025
Neural field-based SLAM methods typically employ a single, monolithic field as their scene representation. This prevents efficient incorporation of loop closure constraints and limits scalability. To address these shortcomings, we propose a novel RGB-D neural mapping framework in which the scene is represented by a collection of lightweight neural fields which are dynamically anchored to the pose graph of a sparse visual SLAM system. Our approach shows the ability to integrate large-scale loop closures, while requiring only minimal reintegration. Furthermore, we verify the scalability of our approach by demonstrating successful building-scale mapping taking multiple loop closures into account during the optimization, and show that our method outperforms existing state-of-the-art approaches on large scenes in terms of quality and runtime. Our code is available open-source at https://github.com/KTH-RPL/neural_graph_mapping.
comment: WACV 2025, Project page: https://kth-rpl.github.io/neural_graph_mapping/
♻ ☆ Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning ICML 2025
Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: https://github.com/qortmdgh4141/GAS.
comment: ICML 2025
♻ ☆ Mamba Policy: Towards Efficient 3D Diffusion Policy with Hybrid Selective State Models IROS 2025
Diffusion models have been widely employed in the field of 3D manipulation due to their efficient capability to learn distributions, allowing for precise prediction of action trajectories. However, diffusion models typically rely on large parameter UNet backbones as policy networks, which can be challenging to deploy on resource-constrained devices. Recently, the Mamba model has emerged as a promising solution for efficient modeling, offering low computational complexity and strong performance in sequence modeling. In this work, we propose the Mamba Policy, a lighter but stronger policy that reduces the parameter count by over 80% compared to the original policy network while achieving superior performance. Specifically, we introduce the XMamba Block, which effectively integrates input information with conditional features and leverages a combination of Mamba and Attention mechanisms for deep feature extraction. Extensive experiments demonstrate that the Mamba Policy excels on the Adroit, Dexart, and MetaWorld datasets, requiring significantly fewer computational resources. Additionally, we highlight the Mamba Policy's enhanced robustness in long-horizon scenarios compared to baseline methods and explore the performance of various Mamba variants within the Mamba Policy framework. Real-world experiments are also conducted to further validate its effectiveness. Our open-source project page can be found at https://andycao1125.github.io/mamba_policy/.
comment: Accepted to IROS 2025
♻ ☆ Teacher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain IROS 2025
Achieving robust locomotion on complex terrains remains a challenge due to high dimensional control and environmental uncertainties. This paper introduces a teacher prior framework based on the teacher student paradigm, integrating imitation and auxiliary task learning to improve learning efficiency and generalization. Unlike traditional paradigms that strongly rely on encoder-based state embeddings, our framework decouples the network design, simplifying the policy network and deployment. A high performance teacher policy is first trained using privileged information to acquire generalizable motion skills. The teacher's motion distribution is transferred to the student policy, which relies only on noisy proprioceptive data, via a generative adversarial mechanism to mitigate performance degradation caused by distributional shifts. Additionally, auxiliary task learning enhances the student policy's feature representation, speeding up convergence and improving adaptability to varying terrains. The framework is validated on a humanoid robot, showing a great improvement in locomotion stability on dynamic terrains and significant reductions in development costs. This work provides a practical solution for deploying robust locomotion strategies in humanoid robots.
comment: 8 pages, 6 figures, 6 tables, IROS 2025
♻ ☆ FGS-SLAM: Fourier-based Gaussian Splatting for Real-time SLAM with Sparse and Dense Map Fusion
3D gaussian splatting has advanced simultaneous localization and mapping (SLAM) technology by enabling real-time positioning and the construction of high-fidelity maps. However, the uncertainty in gaussian position and initialization parameters introduces challenges, often requiring extensive iterative convergence and resulting in redundant or insufficient gaussian representations. To address this, we introduce a novel adaptive densification method based on Fourier frequency domain analysis to establish gaussian priors for rapid convergence. Additionally, we propose constructing independent and unified sparse and dense maps, where a sparse map supports efficient tracking via Generalized Iterative Closest Point (GICP) and a dense map creates high-fidelity visual representations. This is the first SLAM system leveraging frequency domain analysis to achieve high-quality gaussian mapping in real-time. Experimental results demonstrate an average frame rate of 36 FPS on Replica and TUM RGB-D datasets, achieving competitive accuracy in both localization and mapping.
♻ ☆ IKDiffuser: A Generative Inverse Kinematics Solver for Multi-arm Robots via Diffusion Model
Solving Inverse Kinematics (IK) problems is fundamental to robotics, but has primarily been successful with single serial manipulators. For multi-arm robotic systems, IK remains challenging due to complex self-collisions, coupled joints, and high-dimensional redundancy. These complexities make traditional IK solvers slow, prone to failure, and lacking in solution diversity. In this paper, we present IKDiffuser, a diffusion-based model designed for fast and diverse IK solution generation for multi-arm robotic systems. IKDiffuser learns the joint distribution over the configuration space, capturing complex dependencies and enabling seamless generalization to multi-arm robotic systems of different structures. In addition, IKDiffuser can incorporate additional objectives during inference without retraining, offering versatility and adaptability for task-specific requirements. In experiments on 6 different multi-arm systems, the proposed IKDiffuser achieves superior solution accuracy, precision, diversity, and computational efficiency compared to existing solvers. The proposed IKDiffuser framework offers a scalable, unified approach to solving multi-arm IK problems, facilitating the potential of multi-arm robotic systems in real-time manipulation tasks.
comment: under review
♻ ☆ EvDetMAV: Generalized MAV Detection from Moving Event Cameras
Existing micro aerial vehicle (MAV) detection methods mainly rely on the target's appearance features in RGB images, whose diversity makes it difficult to achieve generalized MAV detection. We notice that different types of MAVs share the same distinctive features in event streams due to their high-speed rotating propellers, which are hard to see in RGB images. This paper studies how to detect different types of MAVs from an event camera by fully exploiting the features of propellers in the original event stream. The proposed method consists of three modules to extract the salient and spatio-temporal features of the propellers while filtering out noise from background objects and camera motion. Since there are no existing event-based MAV datasets, we introduce a novel MAV dataset for the community. This is the first event-based MAV dataset comprising multiple scenarios and different types of MAVs. Without training, our method significantly outperforms state-of-the-art methods and can deal with challenging scenarios, achieving a precision rate of 83.0\% (+30.3\%) and a recall rate of 81.5\% (+36.4\%) on the proposed testing dataset. The dataset and code are available at: https://github.com/WindyLab/EvDetMAV.
comment: 8 pages, 7 figures. This paper is accepted by IEEE Robotics and Automation Letters
♻ ☆ AnchorDP3: 3D Affordance Guided Sparse Diffusion Policy for Robotic Manipulation
We present AnchorDP3, a diffusion policy framework for dual-arm robotic manipulation that achieves state-of-the-art performance in highly randomized environments. AnchorDP3 integrates three key innovations: (1) Simulator-Supervised Semantic Segmentation, using rendered ground truth to explicitly segment task-critical objects within the point cloud, which provides strong affordance priors; (2) Task-Conditioned Feature Encoders, lightweight modules processing augmented point clouds per task, enabling efficient multi-task learning through a shared diffusion-based action expert; (3) Affordance-Anchored Keypose Diffusion with Full State Supervision, replacing dense trajectory prediction with sparse, geometrically meaningful action anchors, i.e., keyposes such as pre-grasp pose, grasp pose directly anchored to affordances, drastically simplifying the prediction space; the action expert is forced to predict both robot joint angles and end-effector poses simultaneously, which exploits geometric consistency to accelerate convergence and boost accuracy. Trained on large-scale, procedurally generated simulation data, AnchorDP3 achieves a 98.7% average success rate in the RoboTwin benchmark across diverse tasks under extreme randomization of objects, clutter, table height, lighting, and backgrounds. This framework, when integrated with the RoboTwin real-to-sim pipeline, has the potential to enable fully autonomous generation of deployable visuomotor policies from only scene and instruction, totally eliminating human demonstrations from learning manipulation skills.
♻ ☆ ReLink: Computational Circular Design of Planar Linkage Mechanisms Using Available Standard Parts
The Circular Economy framework emphasizes sustainability by reducing resource consumption and waste through the reuse of components and materials. This paper presents ReLink, a computational framework for the circular design of planar linkage mechanisms using available standard parts. Unlike most mechanism design methods, which assume the ability to create custom parts and infinite part availability, ReLink prioritizes the reuse of discrete, standardized components, thus minimizing the need for new parts. The framework consists of two main components: design generation, where a generative design algorithm generates mechanisms from an inventory of available parts, and inverse design, which uses optimization methods to identify designs that match a user-defined trajectory curve. The paper also examines the trade-offs between kinematic performance and CO2 footprint when incorporating new parts. Challenges such as the combinatorial nature of the design problem and the enforcement of valid solutions are addressed. By combining sustainability principles with kinematic synthesis, ReLink lays the groundwork for further research into computational circular design to support the development of systems that integrate reused components into mechanical products.
comment: 29 pages, 18 figures, Submitted
♻ ☆ Using Explainable AI and Hierarchical Planning for Outreach with Robots
Understanding how robots plan and execute tasks is crucial in today's world, where they are becoming more prevalent in our daily lives. However, teaching non-experts, such as K-12 students, the complexities of robot planning can be challenging. This work presents an open-source platform, JEDAI.Ed, that simplifies the process using a visual interface that abstracts the details of various planning processes that robots use for performing complex mobile manipulation tasks. Using principles developed in the field of explainable AI, this intuitive platform enables students to use a high-level intuitive instruction set to perform complex tasks, visualize them on an in-built simulator, and to obtain helpful hints and natural language explanations for errors. Finally, JEDAI.Ed, includes an adaptive curriculum generation method that provides students with customized learning ramps. This platform's efficacy was tested through a user study with university students who had little to no computer science background. Our results show that JEDAI.Ed is highly effective in increasing student engagement, teaching robotics programming, and decreasing the time need to solve tasks as compared to baselines.
♻ ☆ Steering Your Diffusion Policy with Latent Space Reinforcement Learning
Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.
Robotics 52
☆ Unified Vision-Language-Action Model
Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.
comment: technical report
☆ ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model
Multi-task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual-arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi-body spatiotemporal dynamics. An existing method ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single-arm settings, which ignores the interaction of multiple embodiments for dual-arm systems with significant performance drop. In this paper, we propose ManiGaussian++, an extension of ManiGaussian framework that improves multi-task bimanual manipulation by digesting multi-body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task-oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader-follower architecture, where the multi-body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real-world tasks. Our code is available at https://github.com/April-Yz/ManiGaussian_Bimanual.
☆ Look to Locate: Vision-Based Multisensory Navigation with 3-D Digital Maps for GNSS-Challenged Environments
In Global Navigation Satellite System (GNSS)-denied environments such as indoor parking structures or dense urban canyons, achieving accurate and robust vehicle positioning remains a significant challenge. This paper proposes a cost-effective, vision-based multi-sensor navigation system that integrates monocular depth estimation, semantic filtering, and visual map registration (VMR) with 3-D digital maps. Extensive testing in real-world indoor and outdoor driving scenarios demonstrates the effectiveness of the proposed system, achieving sub-meter accuracy of 92% indoors and more than 80% outdoors, with consistent horizontal positioning and heading average root mean-square errors of approximately 0.98 m and 1.25 {\deg}, respectively. Compared to the baselines examined, the proposed solution significantly reduced drift and improved robustness under various conditions, achieving positioning accuracy improvements of approximately 88% on average. This work highlights the potential of cost-effective monocular vision systems combined with 3D maps for scalable, GNSS-independent navigation in land vehicles.
☆ CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong generalization across manipulation tasks. However, they remain constrained by a single-frame observation paradigm and cannot fully benefit from the motion information offered by aggregated multi-frame historical observations, as the large vision-language backbone introduces substantial computational cost and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage. CronusVLA comprises three key components: (1) single-frame pretraining on large-scale embodied datasets with autoregressive action tokens prediction, which establishes an embodied vision-language foundation; (2) multi-frame encoding, adapting the prediction of vision-language backbones from discrete action tokens to motion features during post-training, and aggregating motion features from historical frames into a feature chunking; (3) cross-frame decoding, which maps the feature chunking to accurate actions via a shared decoder with cross-attention. By reducing redundant token computation and caching past motion features, CronusVLA achieves efficient inference. As an application of motion features, we further propose an action adaptation mechanism based on feature-action retrieval to improve model performance during finetuning. CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO. Real-world Franka experiments also show the strong performance and robustness.
comment: 36 pages, 21 figures
☆ ReactEMG: Zero-Shot, Low-Latency Intent Detection via sEMG
Surface electromyography (sEMG) signals show promise for effective human-computer interfaces, particularly in rehabilitation and prosthetics. However, challenges remain in developing systems that respond quickly and reliably to user intent, across different subjects and without requiring time-consuming calibration. In this work, we propose a framework for EMG-based intent detection that addresses these challenges. Unlike traditional gesture recognition models that wait until a gesture is completed before classifying it, our approach uses a segmentation strategy to assign intent labels at every timestep as the gesture unfolds. We introduce a novel masked modeling strategy that aligns muscle activations with their corresponding user intents, enabling rapid onset detection and stable tracking of ongoing gestures. In evaluations against baseline methods, considering both accuracy and stability for device control, our approach surpasses state-of-the-art performance in zero-shot transfer conditions, demonstrating its potential for wearable robotics and next-generation prosthetic systems. Our project page is available at: https://reactemg.github.io
☆ The Starlink Robot: A Platform and Dataset for Mobile Satellite Communication
The integration of satellite communication into mobile devices represents a paradigm shift in connectivity, yet the performance characteristics under motion and environmental occlusion remain poorly understood. We present the Starlink Robot, the first mobile robotic platform equipped with Starlink satellite internet, comprehensive sensor suite including upward-facing camera, LiDAR, and IMU, designed to systematically study satellite communication performance during movement. Our multi-modal dataset captures synchronized communication metrics, motion dynamics, sky visibility, and 3D environmental context across diverse scenarios including steady-state motion, variable speeds, and different occlusion conditions. This platform and dataset enable researchers to develop motion-aware communication protocols, predict connectivity disruptions, and optimize satellite communication for emerging mobile applications from smartphones to autonomous vehicles. The project is available at https://github.com/StarlinkRobot.
☆ Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images
Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel dataset FISHnCHIPS, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses, available at: https://www.vision.rwth-aachen.de/fishnchips
comment: Presented at IEEE International Conference on Robotics and Automation 2025
☆ Estimating Spatially-Dependent GPS Errors Using a Swarm of Robots
External factors, including urban canyons and adversarial interference, can lead to Global Positioning System (GPS) inaccuracies that vary as a function of the position in the environment. This study addresses the challenge of estimating a static, spatially-varying error function using a team of robots. We introduce a State Bias Estimation Algorithm (SBE) whose purpose is to estimate the GPS biases. The central idea is to use sensed estimates of the range and bearing to the other robots in the team to estimate changes in bias across the environment. A set of drones moves in a 2D environment, each sampling data from GPS, range, and bearing sensors. The biases calculated by the SBE at estimated positions are used to train a Gaussian Process Regression (GPR) model. We use a Sparse Gaussian process-based Informative Path Planning (IPP) algorithm that identifies high-value regions of the environment for data collection. The swarm plans paths that maximize information gain in each iteration, further refining their understanding of the environment's positional bias landscape. We evaluated SBE and IPP in simulation and compared the IPP methodology to an open-loop strategy.
comment: 6 pages, 7 figures, 2025 IEEE 21st International Conference on Automation Science and Engineering
☆ UniTac-NV: A Unified Tactile Representation For Non-Vision-Based Tactile Sensors IROS
Generalizable algorithms for tactile sensing remain underexplored, primarily due to the diversity of sensor modalities. Recently, many methods for cross-sensor transfer between optical (vision-based) tactile sensors have been investigated, yet little work focus on non-optical tactile sensors. To address this gap, we propose an encoder-decoder architecture to unify tactile data across non-vision-based sensors. By leveraging sensor-specific encoders, the framework creates a latent space that is sensor-agnostic, enabling cross-sensor data transfer with low errors and direct use in downstream applications. We leverage this network to unify tactile data from two commercial tactile sensors: the Xela uSkin uSPa 46 and the Contactile PapillArray. Both were mounted on a UR5e robotic arm, performing force-controlled pressing sequences against distinct object shapes (circular, square, and hexagonal prisms) and two materials (rigid PLA and flexible TPU). Another more complex unseen object was also included to investigate the model's generalization capabilities. We show that alignment in latent space can be implicitly learned from joint autoencoder training with matching contacts collected via different sensors. We further demonstrate the practical utility of our approach through contact geometry estimation, where downstream models trained on one sensor's latent representation can be directly applied to another without retraining.
comment: 7 pages, 8 figures. Accepted version to appear in: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
☆ ReLink: Computational Circular Design of Planar Linkage Mechanisms Using Available Standard Parts
The Circular Economy framework emphasizes sustainability by reducing resource consumption and waste through the reuse of components and materials. This paper presents ReLink, a computational framework for the circular design of planar linkage mechanisms using available standard parts. Unlike most mechanism design methods, which assume the ability to create custom parts and infinite part availability, ReLink prioritizes the reuse of discrete, standardized components, thus minimizing the need for new parts. The framework consists of two main components: design generation, where a generative design algorithm generates mechanisms from an inventory of available parts, and inverse design, which uses optimization methods to identify designs that match a user-defined trajectory curve. The paper also examines the trade-offs between kinematic performance and CO2 footprint when incorporating new parts. Challenges such as the combinatorial nature of the design problem and the enforcement of valid solutions are addressed. By combining sustainability principles with kinematic synthesis, ReLink lays the groundwork for further research into computational circular design to support the development of systems that integrate reused components into mechanical products.
comment: 29 pages, 18 figures, submitted to the Journal of Cleaner Production
☆ A Verification Methodology for Safety Assurance of Robotic Autonomous Systems
Autonomous robots deployed in shared human environments, such as agricultural settings, require rigorous safety assurance to meet both functional reliability and regulatory compliance. These systems must operate in dynamic, unstructured environments, interact safely with humans, and respond effectively to a wide range of potential hazards. This paper presents a verification workflow for the safety assurance of an autonomous agricultural robot, covering the entire development life-cycle, from concept study and design to runtime verification. The outlined methodology begins with a systematic hazard analysis and risk assessment to identify potential risks and derive corresponding safety requirements. A formal model of the safety controller is then developed to capture its behaviour and verify that the controller satisfies the specified safety properties with respect to these requirements. The proposed approach is demonstrated on a field robot operating in an agricultural setting. The results show that the methodology can be effectively used to verify safety-critical properties and facilitate the early identification of design issues, contributing to the development of safer robots and autonomous systems.
comment: In Proc. of the 26th TAROS (Towards Autonomous Robotic Systems) Conference, York, UK, August, 2025
☆ Probabilistic modelling and safety assurance of an agriculture robot providing light-treatment
Continued adoption of agricultural robots postulates the farmer's trust in the reliability, robustness and safety of the new technology. This motivates our work on safety assurance of agricultural robots, particularly their ability to detect, track and avoid obstacles and humans. This paper considers a probabilistic modelling and risk analysis framework for use in the early development phases. Starting off with hazard identification and a risk assessment matrix, the behaviour of the mobile robot platform, sensor and perception system, and any humans present are captured using three state machines. An auto-generated probabilistic model is then solved and analysed using the probabilistic model checker PRISM. The result provides unique insight into fundamental development and engineering aspects by quantifying the effect of the risk mitigation actions and risk reduction associated with distinct design concepts. These include implications of adopting a higher performance and more expensive Object Detection System or opting for a more elaborate warning system to increase human awareness. Although this paper mainly focuses on the initial concept-development phase, the proposed safety assurance framework can also be used during implementation, and subsequent deployment and operation phases.
☆ Soft Robotic Delivery of Coiled Anchors for Cardiac Interventions
Trans-catheter cardiac intervention has become an increasingly available option for high-risk patients without the complications of open heart surgery. However, current catheterbased platforms suffer from a lack of dexterity, force application, and compliance required to perform complex intracardiac procedures. An exemplary task that would significantly ease minimally invasive intracardiac procedures is the implantation of anchor coils, which can be used to fix and implant various devices in the beating heart. We introduce a robotic platform capable of delivering anchor coils. We develop a kineto-statics model of the robotic platform and demonstrate low positional error. We leverage the passive compliance and high force output of the actuator in a multi-anchor delivery procedure against a motile in-vitro simulator with millimeter level accuracy.
comment: This work has been submitted to the IEEE for possible publication
☆ Robotics Under Construction: Challenges on Job Sites ICRA
As labor shortages and productivity stagnation increasingly challenge the construction industry, automation has become essential for sustainable infrastructure development. This paper presents an autonomous payload transportation system as an initial step toward fully unmanned construction sites. Our system, based on the CD110R-3 crawler carrier, integrates autonomous navigation, fleet management, and GNSS-based localization to facilitate material transport in construction site environments. While the current system does not yet incorporate dynamic environment adaptation algorithms, we have begun fundamental investigations into external-sensor based perception and mapping system. Preliminary results highlight the potential challenges, including navigation in evolving terrain, environmental perception under construction-specific conditions, and sensor placement optimization for improving autonomy and efficiency. Looking forward, we envision a construction ecosystem where collaborative autonomous agents dynamically adapt to site conditions, optimizing workflow and reducing human intervention. This paper provides foundational insights into the future of robotics-driven construction automation and identifies critical areas for further technological development.
comment: Workshop on Field Robotics, ICRA
☆ Robust Robotic Exploration and Mapping Using Generative Occupancy Map Synthesis
We present a novel approach for enhancing robotic exploration by using generative occupancy mapping. We introduce SceneSense, a diffusion model designed and trained for predicting 3D occupancy maps given partial observations. Our proposed approach probabilistically fuses these predictions into a running occupancy map in real-time, resulting in significant improvements in map quality and traversability. We implement SceneSense onboard a quadruped robot and validate its performance with real-world experiments to demonstrate the effectiveness of the model. In these experiments, we show that occupancy maps enhanced with SceneSense predictions better represent our fully observed ground truth data (24.44% FID improvement around the robot and 75.59% improvement at range). We additionally show that integrating SceneSense-enhanced maps into our robotic exploration stack as a "drop-in" map improvement, utilizing an existing off-the-shelf planner, results in improvements in robustness and traversability time. Finally we show results of full exploration evaluations with our proposed system in two dissimilar environments and find that locally enhanced maps provide more consistent exploration results than maps constructed only from direct sensor measurements.
comment: arXiv admin note: text overlap with arXiv:2409.10681
☆ Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception
Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the downstream task of robotic grasping. We propose a method for training lightweight, deep networks to predict whether a grasp guided by an image-based pose estimate will succeed before that grasp is attempted. We generate training data for our networks via object pose estimation on real images and simulated grasping. We also find that, despite high object variability in grasping trials, networks benefit from training on all objects jointly, suggesting that a diverse variety of objects can nevertheless contribute to the same goal.
☆ Hierarchical Reinforcement Learning and Value Optimization for Challenging Quadruped Locomotion
We propose a novel hierarchical reinforcement learning framework for quadruped locomotion over challenging terrain. Our approach incorporates a two-layer hierarchy in which a high-level policy (HLP) selects optimal goals for a low-level policy (LLP). The LLP is trained using an on-policy actor-critic RL algorithm and is given footstep placements as goals. We propose an HLP that does not require any additional training or environment samples and instead operates via an online optimization process over the learned value function of the LLP. We demonstrate the benefits of this framework by comparing it with an end-to-end reinforcement learning (RL) approach. We observe improvements in its ability to achieve higher rewards with fewer collisions across an array of different terrains, including terrains more difficult than any encountered during training.
☆ Robust Embodied Self-Identification of Morphology in Damaged Multi-Legged Robots
Multi-legged robots (MLRs) are vulnerable to leg damage during complex missions, which can impair their performance. This paper presents a self-modeling and damage identification algorithm that enables autonomous adaptation to partial or complete leg loss using only data from a low-cost IMU. A novel FFT-based filter is introduced to address time-inconsistent signals, improving damage detection by comparing body orientation between the robot and its model. The proposed method identifies damaged legs and updates the robot's model for integration into its control system. Experiments on uneven terrain validate its robustness and computational efficiency.
☆ Evolutionary Gait Reconfiguration in Damaged Legged Robots
Multi-legged robots deployed in complex missions are susceptible to physical damage in their legs, impairing task performance and potentially compromising mission success. This letter presents a rapid, training-free damage recovery algorithm for legged robots subject to partial or complete loss of functional legs. The proposed method first stabilizes locomotion by generating a new gait sequence and subsequently optimally reconfigures leg gaits via a developed differential evolution algorithm to maximize forward progression while minimizing body rotation and lateral drift. The algorithm successfully restores locomotion in a 24-degree-of-freedom hexapod within one hour, demonstrating both high efficiency and robustness to structural damage.
☆ Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning
We introduce TAPAS (Task-based Adaptation and Planning using AgentS), a multi-agent framework that integrates Large Language Models (LLMs) with symbolic planning to solve complex tasks without the need for manually defined environment models. TAPAS employs specialized LLM-based agents that collaboratively generate and adapt domain models, initial states, and goal specifications as needed using structured tool-calling mechanisms. Through this tool-based interaction, downstream agents can request modifications from upstream agents, enabling adaptation to novel attributes and constraints without manual domain redefinition. A ReAct (Reason+Act)-style execution agent, coupled with natural language plan translation, bridges the gap between dynamically generated plans and real-world robot capabilities. TAPAS demonstrates strong performance in benchmark planning domains and in the VirtualHome simulated real-world environment.
☆ Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects
Robotic scene understanding increasingly relies on vision-language models (VLMs) to generate natural language descriptions of the environment. In this work, we present a comparative study of captioning strategies for tabletop scenes captured by a robotic arm equipped with an RGB camera. The robot collects images of objects from multiple viewpoints, and we evaluate several models that generate scene descriptions. We compare the performance of various captioning models, like BLIP and VLMs. Our experiments examine the trade-offs between single-view and multi-view captioning, and difference between recognising real-world and 3D printed objects. We quantitatively evaluate object identification accuracy, completeness, and naturalness of the generated captions. Results show that VLMs can be used in robotic settings where common objects need to be recognised, but fail to generalise to novel representations. Our findings provide practical insights into deploying foundation models for embodied agents in real-world settings.
☆ T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models NeurIPS 2025
Building a general robotic manipulation system capable of performing a wide variety of tasks in real-world settings is a challenging task. Vision-Language Models (VLMs) have demonstrated remarkable potential in robotic manipulation tasks, primarily due to the extensive world knowledge they gain from large-scale datasets. In this process, Spatial Representations (such as points representing object positions or vectors representing object orientations) act as a bridge between VLMs and real-world scene, effectively grounding the reasoning abilities of VLMs and applying them to specific task scenarios. However, existing VLM-based robotic approaches often adopt a fixed spatial representation extraction scheme for various tasks, resulting in insufficient representational capability or excessive extraction time. In this work, we introduce T-Rex, a Task-Adaptive Framework for Spatial Representation Extraction, which dynamically selects the most appropriate spatial representation extraction scheme for each entity based on specific task requirements. Our key insight is that task complexity determines the types and granularity of spatial representations, and Stronger representational capabilities are typically associated with Higher overall system operation costs. Through comprehensive experiments in real-world robotic environments, we show that our approach delivers significant advantages in spatial understanding, efficiency, and stability without additional training.
comment: submitted to NeurIPS 2025
☆ Ground-Effect-Aware Modeling and Control for Multicopters
The ground effect on multicopters introduces several challenges, such as control errors caused by additional lift, oscillations that may occur during near-ground flight due to external torques, and the influence of ground airflow on models such as the rotor drag and the mixing matrix. This article collects and analyzes the dynamics data of near-ground multicopter flight through various methods, including force measurement platforms and real-world flights. For the first time, we summarize the mathematical model of the external torque of multicopters under ground effect. The influence of ground airflow on rotor drag and the mixing matrix is also verified through adequate experimentation and analysis. Through simplification and derivation, the differential flatness of the multicopter's dynamic model under ground effect is confirmed. To mitigate the influence of these disturbance models on control, we propose a control method that combines dynamic inverse and disturbance models, ensuring consistent control effectiveness at both high and low altitudes. In this method, the additional thrust and variations in rotor drag under ground effect are both considered and compensated through feedforward models. The leveling torque of ground effect can be equivalently represented as variations in the center of gravity and the moment of inertia. In this way, the leveling torque does not explicitly appear in the dynamic model. The final experimental results show that the method proposed in this paper reduces the control error (RMSE) by \textbf{45.3\%}. Please check the supplementary material at: https://github.com/ZJU-FAST-Lab/Ground-effect-controller.
☆ Is an object-centric representation beneficial for robotic manipulation ?
Object-centric representation (OCR) has recently become a subject of interest in the computer vision community for learning a structured representation of images and videos. It has been several times presented as a potential way to improve data-efficiency and generalization capabilities to learn an agent on downstream tasks. However, most existing work only evaluates such models on scene decomposition, without any notion of reasoning over the learned representation. Robotic manipulation tasks generally involve multi-object environments with potential inter-object interaction. We thus argue that they are a very interesting playground to really evaluate the potential of existing object-centric work. To do so, we create several robotic manipulation tasks in simulated environments involving multiple objects (several distractors, the robot, etc.) and a high-level of randomization (object positions, colors, shapes, background, initial positions, etc.). We then evaluate one classical object-centric method across several generalization scenarios and compare its results against several state-of-the-art hollistic representations. Our results exhibit that existing methods are prone to failure in difficult scenarios involving complex scene structures, whereas object-centric methods help overcome these challenges.
☆ A Survey on Soft Robot Adaptability: Implementations, Applications, and Prospects
Soft robots, compared to rigid robots, possess inherent advantages, including higher degrees of freedom, compliance, and enhanced safety, which have contributed to their increasing application across various fields. Among these benefits, adaptability is particularly noteworthy. In this paper, adaptability in soft robots is categorized into external and internal adaptability. External adaptability refers to the robot's ability to adjust, either passively or actively, to variations in environments, object properties, geometries, and task dynamics. Internal adaptability refers to the robot's ability to cope with internal variations, such as manufacturing tolerances or material aging, and to generalize control strategies across different robots. As the field of soft robotics continues to evolve, the significance of adaptability has become increasingly pronounced. In this review, we summarize various approaches to enhancing the adaptability of soft robots, including design, sensing, and control strategies. Additionally, we assess the impact of adaptability on applications such as surgery, wearable devices, locomotion, and manipulation. We also discuss the limitations of soft robotics adaptability and prospective directions for future research. By analyzing adaptability through the lenses of implementation, application, and challenges, this paper aims to provide a comprehensive understanding of this essential characteristic in soft robotics and its implications for diverse applications.
comment: 12 pages, 4 figures, accepted by IEEE Robotics & Automation Magazine
☆ Zero-Shot Parameter Learning of Robot Dynamics Using Bayesian Statistics and Prior Knowledge
Inertial parameter identification of industrial robots is an established process, but standard methods using Least Squares or Machine Learning do not consider prior information about the robot and require extensive measurements. Inspired by Bayesian statistics, this paper presents an identification method with improved generalization that incorporates prior knowledge and is able to learn with only a few or without additional measurements (Zero-Shot Learning). Furthermore, our method is able to correctly learn not only the inertial but also the mechanical and base parameters of the MABI Max 100 robot while ensuring physical feasibility and specifying the confidence intervals of the results. We also provide different types of priors for serial robots with 6 degrees of freedom, where datasheets or CAD models are not available.
comment: Carsten Reiners and Minh Trinh contributed equally to this work
☆ Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference
Inferring physical properties can significantly enhance robotic manipulation by enabling robots to handle objects safely and efficiently through adaptive grasping strategies. Previous approaches have typically relied on either tactile or visual data, limiting their ability to fully capture properties. We introduce a novel cross-modal perception framework that integrates visual observations with tactile representations within a multimodal vision-language model. Our physical reasoning framework, which employs a hierarchical feature alignment mechanism and a refined prompting strategy, enables our model to make property-specific predictions that strongly correlate with ground-truth measurements. Evaluated on 35 diverse objects, our approach outperforms existing baselines and demonstrates strong zero-shot generalization. Keywords: tactile perception, visual-tactile fusion, physical property inference, multimodal integration, robot perception
comment: This paper has been accepted by the 2025 International Conference on Climbing and Walking Robots (CLAWAR). These authors contributed equally to this work: Zexiang Guo, Hengxiang Chen, Xinheng Mai
☆ Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding
Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.
comment: 14 pages, 13 figures
☆ AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration
While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.
☆ Ontology Neural Network and ORTSF: A Framework for Topological Reasoning and Delay-Robust Control
The advancement of autonomous robotic systems has led to impressive capabilities in perception, localization, mapping, and control. Yet, a fundamental gap remains: existing frameworks excel at geometric reasoning and dynamic stability but fall short in representing and preserving relational semantics, contextual reasoning, and cognitive transparency essential for collaboration in dynamic, human-centric environments. This paper introduces a unified architecture comprising the Ontology Neural Network (ONN) and the Ontological Real-Time Semantic Fabric (ORTSF) to address this gap. The ONN formalizes relational semantic reasoning as a dynamic topological process. By embedding Forman-Ricci curvature, persistent homology, and semantic tensor structures within a unified loss formulation, ONN ensures that relational integrity and topological coherence are preserved as scenes evolve over time. The ORTSF transforms reasoning traces into actionable control commands while compensating for system delays. It integrates predictive and delay-aware operators that ensure phase margin preservation and continuity of control signals, even under significant latency conditions. Empirical studies demonstrate the ONN + ORTSF framework's ability to unify semantic cognition and robust control, providing a mathematically principled and practically viable solution for cognitive robotics.
comment: 12 pages, 5 figures, includes theoretical proofs and simulation results
☆ Scaffolding Dexterous Manipulation with Vision-Language Models
Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. While reinforcement learning (RL) can alleviate the data bottleneck by generating experience in simulation, it typically relies on carefully designed, task-specific reward functions, which hinder scalability and generalization. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories. These trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories - particularly for dexterous hands - remains a significant challenge. Yet, the precise details in explicit reference trajectories are often unnecessary, as RL ultimately refines the motion. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., "open the cabinet") and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or "scaffolds" with high fidelity. Across a number of simulated tasks involving articulated objects and semantic understanding, we demonstrate that our method is able to learn robust dexterous manipulation policies. Moreover, we showcase that our method transfers to real-world robotic hands without any human demonstrations or handcrafted rewards.
☆ Preserving Sense of Agency: User Preferences for Robot Autonomy and User Control across Household Tasks
Roboticists often design with the assumption that assistive robots should be fully autonomous. However, it remains unclear whether users prefer highly autonomous robots, as prior work in assistive robotics suggests otherwise. High robot autonomy can reduce the user's sense of agency, which represents feeling in control of one's environment. How much control do users, in fact, want over the actions of robots used for in-home assistance? We investigate how robot autonomy levels affect users' sense of agency and the autonomy level they prefer in contexts with varying risks. Our study asked participants to rate their sense of agency as robot users across four distinct autonomy levels and ranked their robot preferences with respect to various household tasks. Our findings revealed that participants' sense of agency was primarily influenced by two factors: (1) whether the robot acts autonomously, and (2) whether a third party is involved in the robot's programming or operation. Notably, an end-user programmed robot highly preserved users' sense of agency, even though it acts autonomously. However, in high-risk settings, e.g., preparing a snack for a child with allergies, they preferred robots that prioritized their control significantly more. Additional contextual factors, such as trust in a third party operator, also shaped their preferences.
comment: Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
☆ The MOTIF Hand: A Robotic Hand for Multimodal Observations with Thermal, Inertial, and Force Sensors
Advancing dexterous manipulation with multi-fingered robotic hands requires rich sensory capabilities, while existing designs lack onboard thermal and torque sensing. In this work, we propose the MOTIF hand, a novel multimodal and versatile robotic hand that extends the LEAP hand by integrating: (i) dense tactile information across the fingers, (ii) a depth sensor, (iii) a thermal camera, (iv), IMU sensors, and (v) a visual sensor. The MOTIF hand is designed to be relatively low-cost (under 4000 USD) and easily reproducible. We validate our hand design through experiments that leverage its multimodal sensing for two representative tasks. First, we integrate thermal sensing into 3D reconstruction to guide temperature-aware, safe grasping. Second, we show how our hand can distinguish objects with identical appearance but different masses - a capability beyond methods that use vision only.
♻ ☆ Learning Accurate Whole-body Throwing with High-frequency Residual Policy and Pullback Tube Acceleration IROS 2025
Throwing is a fundamental skill that enables robots to manipulate objects in ways that extend beyond the reach of their arms. We present a control framework that combines learning and model-based control for prehensile whole-body throwing with legged mobile manipulators. Our framework consists of three components: a nominal tracking policy for the end-effector, a high-frequency residual policy to enhance tracking accuracy, and an optimization-based module to improve end-effector acceleration control. The proposed controller achieved the average of 0.28 m landing error when throwing at targets located 6 m away. Furthermore, in a comparative study with university students, the system achieved a velocity tracking error of 0.398 m/s and a success rate of 56.8%, hitting small targets randomly placed at distances of 3-5 m while throwing at a specified speed of 6 m/s. In contrast, humans have a success rate of only 15.2%. This work provides an early demonstration of prehensile throwing with quantified accuracy on hardware, contributing to progress in dynamic whole-body manipulation.
comment: 8 pages, IROS 2025
♻ ☆ Toward Teach and Repeat Across Seasonal Deep Snow Accumulation
Teach and repeat is a rapid way to achieve autonomy in challenging terrain and off-road environments. A human operator pilots the vehicles to create a network of paths that are mapped and associated with odometry. Immediately after teaching, the system can drive autonomously within its tracks. This precision lets operators remain confident that the robot will follow a traversable route. However, this operational paradigm has rarely been explored in off-road environments that change significantly through seasonal variation. This paper presents preliminary field trials using lidar and radar implementations of teach and repeat. Using a subset of the data from the upcoming FoMo dataset, we attempted to repeat routes that were 4 days, 44 days, and 113 days old. Lidar teach and repeat demonstrated a stronger ability to localize when the ground points were removed. FMCW radar was often able to localize on older maps, but only with small deviations from the taught path. Additionally, we highlight specific cases where radar localization failed with recent maps due to the high pitch or roll of the vehicle. We highlight lessons learned during the field deployment and highlight areas to improve to achieve reliable teach and repeat with seasonal changes in the environment. Please follow the dataset at https://norlab-ulaval.github.io/FoMo-website for updates and information on the data release.
♻ ☆ Energy-Efficient Motion Planner for Legged Robots IROS 2025
We propose an online motion planner for legged robot locomotion with the primary objective of achieving energy efficiency. The conceptual idea is to leverage a placement set of footstep positions based on the robot's body position to determine when and how to execute steps. In particular, the proposed planner uses virtual placement sets beneath the hip joints of the legs and executes a step when the foot is outside of such placement set. Furthermore, we propose a parameter design framework that considers both energy-efficiency and robustness measures to optimize the gait by changing the shape of the placement set along with other parameters, such as step height and swing time, as a function of walking speed. We show that the planner produces trajectories that have a low Cost of Transport (CoT) and high robustness measure, and evaluate our approach against model-free Reinforcement Learning (RL) and motion imitation using biological dog motion priors as the reference. Overall, within low to medium velocity range, we show a 50.4% improvement in CoT and improved robustness over model-free RL, our best performing baseline. Finally, we show ability to handle slippery surfaces, gait transitions, and disturbances in simulation and hardware with the Unitree A1 robot.
comment: This paper has been accepted for publication at the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025). 8 pages, 8 figures
♻ ☆ Robustness Assessment of Assemblies in Frictional Contact
This work establishes a solution to the problem of assessing the capacity of multi-object assemblies to withstand external forces without becoming unstable. Our physically-grounded approach handles arbitrary structures made from rigid objects of any shape and mass distribution without relying on heuristics or approximations. The result is a method that provides a foundation for autonomous robot decision-making when interacting with objects in frictional contact. Our strategy relies on a contact interface graph representation to reason about instabilities and makes use of object shape information to decouple sub-problems and improve efficiency. Our algorithm can be used by motion planners to produce safe assembly transportation plans, and by object placement planners to select better poses. Compared to prior work, our approach is more generally applicable than commonly used heuristics and more efficient than dynamics simulations.
comment: Submitted to IEEE Transactions on Automation Science and Engineering. Contains 14 pages, 16 figures, and 3 tables
♻ ☆ FusionForce: End-to-end Differentiable Neural-Symbolic Layer for Trajectory Prediction
We propose end-to-end differentiable model that predicts robot trajectories on rough offroad terrain from camera images and/or lidar point clouds. The model integrates a learnable component that predicts robot-terrain interaction forces with a neural-symbolic layer that enforces the laws of classical mechanics and consequently improves generalization on out-of-distribution data. The neural-symbolic layer includes a differentiable physics engine that computes the robot's trajectory by querying these forces at the points of contact with the terrain. As the proposed architecture comprises substantial geometrical and physics priors, the resulting model can also be seen as a learnable physics engine conditioned on real sensor data that delivers $10^4$ trajectories per second. We argue and empirically demonstrate that this architecture reduces the sim-to-real gap and mitigates out-of-distribution sensitivity. The differentiability, in conjunction with the rapid simulation speed, makes the model well-suited for various applications including model predictive control, trajectory shooting, supervised and reinforcement learning, or SLAM.
comment: Code: https://github.com/ctu-vras/fusionforce
♻ ☆ ros2 fanuc interface: Design and Evaluation of a Fanuc CRX Hardware Interface in ROS2
This paper introduces the ROS2 control and the Hardware Interface (HW) integration for the Fanuc CRX- robot family. It explains basic implementation details and communication protocols, and its integration with the Moveit2 motion planning library. We conducted a series of experiments to evaluate relevant performances in the robotics field. We tested the developed ros2_fanuc_interface for four relevant robotics cases: step response, trajectory tracking, collision avoidance integrated with Moveit2, and dynamic velocity scaling, respectively. Results show that, despite a non-negligible delay between command and feedback, the robot can track the defined path with negligible errors (if it complies with joint velocity limits), ensuring collision avoidance. Full code is open source and available at https://github.com/paolofrance/ros2_fanuc_interface.
♻ ☆ DroneDiffusion: Robust Quadrotor Dynamics Learning with Diffusion Models ICRA
An inherent fragility of quadrotor systems stems from model inaccuracies and external disturbances. These factors hinder performance and compromise the stability of the system, making precise control challenging. Existing model-based approaches either make deterministic assumptions, utilize Gaussian-based representations of uncertainty, or rely on nominal models, all of which often fall short in capturing the complex, multimodal nature of real-world dynamics. This work introduces DroneDiffusion, a novel framework that leverages conditional diffusion models to learn quadrotor dynamics, formulated as a sequence generation task. DroneDiffusion achieves superior generalization to unseen, complex scenarios by capturing the temporal nature of uncertainties and mitigating error propagation. We integrate the learned dynamics with an adaptive controller for trajectory tracking with stability guarantees. Extensive experiments in both simulation and real-world flights demonstrate the robustness of the framework across a range of scenarios, including unfamiliar flight paths and varying payloads, velocities, and wind disturbances.
comment: Accepted to the International Conference on Robotics and Automation (ICRA) 2025
♻ ☆ COBRA-PPM: A Causal Bayesian Reasoning Architecture Using Probabilistic Programming for Robot Manipulation Under Uncertainty
Manipulation tasks require robots to reason about cause and effect when interacting with objects. Yet, many data-driven approaches lack causal semantics and thus only consider correlations. We introduce COBRA-PPM, a novel causal Bayesian reasoning architecture that combines causal Bayesian networks and probabilistic programming to perform interventional inference for robot manipulation under uncertainty. We demonstrate its capabilities through high-fidelity Gazebo-based experiments on an exemplar block stacking task, where it predicts manipulation outcomes with high accuracy (Pred Acc: 88.6%) and performs greedy next-best action selection with a 94.2% task success rate. We further demonstrate sim2real transfer on a domestic robot, showing effectiveness in handling real-world uncertainty from sensor noise and stochastic actions. Our generalised and extensible framework supports a wide range of manipulation scenarios and lays a foundation for future work at the intersection of robotics and causality.
comment: 8 pages, 7 figures, accepted to the 2025 IEEE European Conference on Mobile Robots (ECMR 2025)
♻ ☆ SemGauss-SLAM: Dense Semantic Gaussian Splatting SLAM IROS 2025
We propose SemGauss-SLAM, a dense semantic SLAM system utilizing 3D Gaussian representation, that enables accurate 3D semantic mapping, robust camera tracking, and high-quality rendering simultaneously. In this system, we incorporate semantic feature embedding into 3D Gaussian representation, which effectively encodes semantic information within the spatial layout of the environment for precise semantic scene representation. Furthermore, we propose feature-level loss for updating 3D Gaussian representation, enabling higher-level guidance for 3D Gaussian optimization. In addition, to reduce cumulative drift in tracking and improve semantic reconstruction accuracy, we introduce semantic-informed bundle adjustment. By leveraging multi-frame semantic associations, this strategy enables joint optimization of 3D Gaussian representation and camera poses, resulting in low-drift tracking and accurate semantic mapping. Our SemGauss-SLAM demonstrates superior performance over existing radiance field-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in high-precision semantic segmentation and dense semantic mapping.
comment: IROS 2025
♻ ☆ Fully distributed and resilient source seeking for robot swarms
We propose a self-contained, resilient and fully distributed solution for locating the maximum of an unknown scalar field using a swarm of robots that travel at a constant speed. Unlike conventional reactive methods relying on gradient information, our methodology enables the swarm to determine an ascending direction so that it approaches the source with an arbitrary precision. Our source-seeking solution consists of three distributed algorithms running simultaneously in a slow-fast closed-loop system. The fastest algorithm provides the centroid-relative coordinates of the robots and the next slower one provides the ascending direction to be tracked. The tracking of the ascending direction by single integrators is instantaneous; howeverin this paper we will also focus on 2D unicycle-like robots with a constant speed. The third algorithm, the slowest one since the speed of the robots can be chosen arbitrarily slow, is the individual control law for the unicycle to track the estimated ascending direction.We will show that the three distributed algorithms converge exponentially fast to their objectives, allowing for a feasible slow-fast closed-loop system. The robots are not constrained to any particular geometric formation, and we study both discrete and continuous distributions of robots.The swarm shape analysis reveals the resiliency of our approach as expected in robot swarms, i.e., by amassing robots we ensure the source-seeking functionality in the event of missing or misplaced individuals or even if the robot network splits in two or more disconnected subnetworks.We exploit such an analysis so that the swarm can adapt to unknown environments by morphing its shape and maneuvering while still following an ascending direction. We analyze our solution with robots as kinematic points in n-dimensional Euclidean spaces and extend the analysis to 2D unicycle-like robots with constant speeds.
comment: 16 pages, submitted version to T-TAC. Jesus Bautista and Antonio Acuaviva contributed equally to this work. arXiv admin note: text overlap with arXiv:2309.02937
♻ ☆ AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making NeurIPS 2025
Vision-Language Models (VLMs) encode knowledge and reasoning capabilities for robotic manipulation within high-dimensional representation spaces. However, current approaches often project them into compressed intermediate representations, discarding important task-specific information such as fine-grained spatial or semantic details. To address this, we propose AntiGrounding, a new framework that reverses the instruction grounding process. It lifts candidate actions directly into the VLM representation space, renders trajectories from multiple views, and uses structured visual question answering for instruction-based decision making. This enables zero-shot synthesis of optimal closed-loop robot trajectories for new tasks. We also propose an offline policy refinement module that leverages past experience to enhance long-term performance. Experiments in both simulation and real-world environments show that our method outperforms baselines across diverse robotic manipulation tasks.
comment: submitted to NeurIPS 2025
♻ ☆ ContactDexNet: Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic Mapping
The deep learning models has significantly advanced dexterous manipulation techniques for multi-fingered hand grasping. However, the contact information-guided grasping in cluttered environments remains largely underexplored. To address this gap, we have developed a method for generating multi-fingered hand grasp samples in cluttered settings through contact semantic map. We introduce a contact semantic conditional variational autoencoder network (CoSe-CVAE) for creating comprehensive contact semantic map from object point cloud. We utilize grasp detection method to estimate hand grasp poses from the contact semantic map. Finally, an unified grasp evaluation model PointNetGPD++ is designed to assess grasp quality and collision probability, substantially improving the reliability of identifying optimal grasps in cluttered scenarios. Our grasp generation method has demonstrated remarkable success, outperforming state-of-the-art methods by at least 4.65% with 81.0% average grasping success rate in real-world single-object environment and 75.3% grasping success rate in cluttered scenes. We also proposed the multi-modal multi-fingered grasping dataset generation method. Our multi-fingered hand grasping dataset outperforms previous datasets in scene diversity, modality diversity. The dataset, code and supplementary materials can be found at https://sites.google.com/view/contact-dexnet.
comment: 8 pages
♻ ☆ Perspective-Shifted Neuro-Symbolic World Models: A Framework for Socially-Aware Robot Navigation
Navigating in environments alongside humans requires agents to reason under uncertainty and account for the beliefs and intentions of those around them. Under a sequential decision-making framework, egocentric navigation can naturally be represented as a Markov Decision Process (MDP). However, social navigation additionally requires reasoning about the hidden beliefs of others, inherently leading to a Partially Observable Markov Decision Process (POMDP), where agents lack direct access to others' mental states. Inspired by Theory of Mind and Epistemic Planning, we propose (1) a neuro-symbolic model-based reinforcement learning architecture for social navigation, addressing the challenge of belief tracking in partially observable environments; and (2) a perspective-shift operator for belief estimation, leveraging recent work on Influence-based Abstractions (IBA) in structured multi-agent settings.
comment: Accepted as a regular paper at the 2025 IEEE International Conference on Robot & Human Interactive Communication (RO-MAN). \c{opyright} 2025 IEEE. The final version will appear in IEEE Xplore (DOI TBD)
♻ ☆ Pseudo-Kinematic Trajectory Control and Planning of Tracked Vehicles
Tracked vehicles distribute their weight continuously over a large surface area (the tracks). This distinctive feature makes them the preferred choice for vehicles required to traverse soft and uneven terrain. From a robotics perspective, however, this flexibility comes at a cost: the complexity of modelling the system and the resulting difficulty in designing theoretically sound navigation solutions. In this paper, we aim to bridge this gap by proposing a framework for the navigation of tracked vehicles, built upon three key pillars. The first pillar comprises two models: a simulation model and a control-oriented model. The simulation model captures the intricate terramechanics dynamics arising from soil-track interaction and is employed to develop faithful digital twins of the system across a wide range of operating conditions. The control-oriented model is pseudo-kinematic and mathematically tractable, enabling the design of efficient and theoretically robust control schemes. The second pillar is a Lyapunov-based feedback trajectory controller that provides certifiable tracking guarantees. The third pillar is a portfolio of motion planning solutions, each offering different complexity-accuracy trade-offs. The various components of the proposed approach are validated through an extensive set of simulation and experimental data.
♻ ☆ Help or Hindrance: Understanding the Impact of Robot Communication in Action Teams
The human-robot interaction (HRI) field has recognized the importance of enabling robots to interact with teams. Human teams rely on effective communication for successful collaboration in time-sensitive environments. Robots can play a role in enhancing team coordination through real-time assistance. Despite significant progress in human-robot teaming research, there remains an essential gap in how robots can effectively communicate with action teams using multimodal interaction cues in time-sensitive environments. This study addresses this knowledge gap in an experimental in-lab study to investigate how multimodal robot communication in action teams affects workload and human perception of robots. We explore team collaboration in a medical training scenario where a robotic crash cart (RCC) provides verbal and non-verbal cues to help users remember to perform iterative tasks and search for supplies. Our findings show that verbal cues for object search tasks and visual cues for task reminders reduce team workload and increase perceived ease of use and perceived usefulness more effectively than a robot with no feedback. Our work contributes to multimodal interaction research in the HRI field, highlighting the need for more human-robot teaming research to understand best practices for integrating collaborative robots in time-sensitive environments such as in hospitals, search and rescue, and manufacturing applications.
comment: This is the author's original submitted version of the paper accepted to the 2025 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). \c{opyright} 2025 IEEE. Personal use of this material is permitted. For any other use, please contact IEEE
♻ ☆ Human-Robot Teaming Field Deployments: A Comparison Between Verbal and Non-verbal Communication
Healthcare workers (HCWs) encounter challenges in hospitals, such as retrieving medical supplies quickly from crash carts, which could potentially result in medical errors and delays in patient care. Robotic crash carts (RCCs) have shown promise in assisting healthcare teams during medical tasks through guided object searches and task reminders. Limited exploration has been done to determine what communication modalities are most effective and least disruptive to patient care in real-world settings. To address this gap, we conducted a between-subjects experiment comparing the RCC's verbal and non-verbal communication of object search with a standard crash cart in resuscitation scenarios to understand the impact of robot communication on workload and attitudes toward using robots in the workplace. Our findings indicate that verbal communication significantly reduced mental demand and effort compared to visual cues and with a traditional crash cart. Although frustration levels were slightly higher during collaborations with the robot compared to a traditional cart, these research insights provide valuable implications for human-robot teamwork in high-stakes environments.
comment: This is the author's original submitted version of the paper accepted to the 2025 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). \c{opyright} 2025 IEEE. Personal use of this material is permitted. For any other use, please contact IEEE
♻ ☆ TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning
Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineering with Vision-Language Models (VLMs) have shown promise, their sparse reward nature significantly limits sample efficiency. This paper introduces TeViR, a novel method that leverages a pre-trained text-to-video diffusion model to generate dense rewards by comparing the predicted image sequence with current observations. Experimental results across 11 complex robotic tasks demonstrate that TeViR outperforms traditional methods leveraging sparse rewards and other state-of-the-art (SOTA) methods, achieving better sample efficiency and performance without ground truth environmental rewards. TeViR's ability to efficiently guide agents in complex environments highlights its potential to advance reinforcement learning applications in robotic manipulation.
♻ ☆ DynNPC: Finding More Violations Induced by ADS in Simulation Testing through Dynamic NPC Behavior Generation
Recently, a number of simulation testing approaches have been proposed to generate diverse driving scenarios for autonomous driving systems (ADSs) testing. However, the behaviors of NPC vehicles in these scenarios generated by previous approaches are predefined and mutated before simulation execution, ignoring traffic signals and the behaviors of the Ego vehicle. Thus, a large number of the violations they found are induced by unrealistic behaviors of NPC vehicles, revealing no bugs of ADSs. Besides, the vast scenario search space of NPC behaviors during the iterative mutations limits the efficiency of previous approaches. To address these limitations, we propose a novel scenario-based testing framework, DynNPC, to generate more violation scenarios induced by the ADS. Specifically, DynNPC allows NPC vehicles to dynamically generate behaviors using different driving strategies during simulation execution based on traffic signals and the real-time behavior of the Ego vehicle. We compare DynNPC with five state-of-the-art scenario-based testing approaches. Our evaluation has demonstrated the effectiveness and efficiency of DynNPC in finding more violation scenarios induced by the ADS.
♻ ☆ Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation IROS 2025
3D point cloud semantic segmentation (PCSS) is a cornerstone for environmental perception in robotic systems and autonomous driving, enabling precise scene understanding through point-wise classification. While unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing methods critically overlook the inherent vulnerability to real-world perturbations (e.g., snow, fog, rain) and adversarial distortions. This work first identifies two intrinsic limitations that undermine current PCSS-UDA robustness: (a) unsupervised features overlap from unaligned boundaries in shared-class regions and (b) feature structure erosion caused by domain-invariant learning that suppresses target-specific patterns. To address the proposed problems, we propose a tripartite framework consisting of: 1) a robustness evaluation model quantifying resilience against adversarial attack/corruption types through robustness metrics; 2) an invertible attention alignment module (IAAM) enabling bidirectional domain mapping while preserving discriminative structure via attention-guided overlap suppression; and 3) a contrastive memory bank with quality-aware contrastive learning that progressively refines pseudo-labels with feature quality for more discriminative representations. Extensive experiments on SynLiDAR-to-SemanticPOSS adaptation demonstrate a maximum mIoU improvement of 14.3\% under adversarial attack.
comment: This paper has been accepted to the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
Robotics 55
☆ MinD: Unified Visual Imagination and Control via Hierarchical World Models
Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics by integrating simulation, prediction, and manipulation. However, their practical application remains limited due to (1) slowgeneration speed, which limits real-time interaction, and (2) poor consistency between imagined videos and executable actions. To address these challenges, we propose Manipulate in Dream (MinD), a hierarchical diffusion-based world model framework that employs a dual-system design for vision-language manipulation. MinD executes VGM at low frequencies to extract video prediction features, while leveraging a high-frequency diffusion policy for real-time interaction. This architecture enables low-latency, closed-loop control in manipulation with coherent visual guidance. To better coordinate the two systems, we introduce a video-action diffusion matching module (DiffMatcher), with a novel co-training strategy that uses separate schedulers for each diffusion model. Specifically, we introduce a diffusion-forcing mechanism to DiffMatcher that aligns their intermediate representations during training, helping the fast action model better understand video-based predictions. Beyond manipulation, MinD also functions as a world simulator, reliably predicting task success or failure in latent space before execution. Trustworthy analysis further shows that VGMs can preemptively evaluate task feasibility and mitigate risks. Extensive experiments across multiple benchmarks demonstrate that MinD achieves state-of-the-art manipulation (63%+) in RL-Bench, advancing the frontier of unified world modeling in robotics.
☆ GRAND-SLAM: Local Optimization for Globally Consistent Large-Scale Multi-Agent Gaussian SLAM
3D Gaussian splatting has emerged as an expressive scene representation for RGB-D visual SLAM, but its application to large-scale, multi-agent outdoor environments remains unexplored. Multi-agent Gaussian SLAM is a promising approach to rapid exploration and reconstruction of environments, offering scalable environment representations, but existing approaches are limited to small-scale, indoor environments. To that end, we propose Gaussian Reconstruction via Multi-Agent Dense SLAM, or GRAND-SLAM, a collaborative Gaussian splatting SLAM method that integrates i) an implicit tracking module based on local optimization over submaps and ii) an approach to inter- and intra-robot loop closure integrated into a pose-graph optimization framework. Experiments show that GRAND-SLAM provides state-of-the-art tracking performance and 28% higher PSNR than existing methods on the Replica indoor dataset, as well as 91% lower multi-agent tracking error and improved rendering over existing multi-agent methods on the large-scale, outdoor Kimera-Multi dataset.
☆ SViP: Sequencing Bimanual Visuomotor Policies with Object-Centric Motion Primitives
Imitation learning (IL), particularly when leveraging high-dimensional visual inputs for policy training, has proven intuitive and effective in complex bimanual manipulation tasks. Nonetheless, the generalization capability of visuomotor policies remains limited, especially when small demonstration datasets are available. Accumulated errors in visuomotor policies significantly hinder their ability to complete long-horizon tasks. To address these limitations, we propose SViP, a framework that seamlessly integrates visuomotor policies into task and motion planning (TAMP). SViP partitions human demonstrations into bimanual and unimanual operations using a semantic scene graph monitor. Continuous decision variables from the key scene graph are employed to train a switching condition generator. This generator produces parameterized scripted primitives that ensure reliable performance even when encountering out-of-the-distribution observations. Using only 20 real-world demonstrations, we show that SViP enables visuomotor policies to generalize across out-of-distribution initial conditions without requiring object pose estimators. For previously unseen tasks, SViP automatically discovers effective solutions to achieve the goal, leveraging constraint modeling in TAMP formulism. In real-world experiments, SViP outperforms state-of-the-art generative IL methods, indicating wider applicability for more complex tasks. Project website: https://sites.google.com/view/svip-bimanual
comment: Project website: https://sites.google.com/view/svip-bimanual
☆ Learning Physical Systems: Symplectification via Gauge Fixing in Dirac Structures
Physics-informed deep learning has achieved remarkable progress by embedding geometric priors, such as Hamiltonian symmetries and variational principles, into neural networks, enabling structure-preserving models that extrapolate with high accuracy. However, in systems with dissipation and holonomic constraints, ubiquitous in legged locomotion and multibody robotics, the canonical symplectic form becomes degenerate, undermining the very invariants that guarantee stability and long-term prediction. In this work, we tackle this foundational limitation by introducing Presymplectification Networks (PSNs), the first framework to learn the symplectification lift via Dirac structures, restoring a non-degenerate symplectic geometry by embedding constrained systems into a higher-dimensional manifold. Our architecture combines a recurrent encoder with a flow-matching objective to learn the augmented phase-space dynamics end-to-end. We then attach a lightweight Symplectic Network (SympNet) to forecast constrained trajectories while preserving energy, momentum, and constraint satisfaction. We demonstrate our method on the dynamics of the ANYmal quadruped robot, a challenging contact-rich, multibody system. To the best of our knowledge, this is the first framework that effectively bridges the gap between constrained, dissipative mechanical systems and symplectic learning, unlocking a whole new class of geometric machine learning models, grounded in first principles yet adaptable from data.
comment: Presented at Equivariant Systems: Theory and Applications in State Estimation, Artificial Intelligence and Control, Robotics: Science and Systems (RSS) 2025 Workshop, 6 Pages, 3 Figures
☆ OC-SOP: Enhancing Vision-Based 3D Semantic Occupancy Prediction by Object-Centric Awareness
Autonomous driving perception faces significant challenges due to occlusions and incomplete scene data in the environment. To overcome these issues, the task of semantic occupancy prediction (SOP) is proposed, which aims to jointly infer both the geometry and semantic labels of a scene from images. However, conventional camera-based methods typically treat all categories equally and primarily rely on local features, leading to suboptimal predictions, especially for dynamic foreground objects. To address this, we propose Object-Centric SOP (OC-SOP), a framework that integrates high-level object-centric cues extracted via a detection branch into the semantic occupancy prediction pipeline. This object-centric integration significantly enhances the prediction accuracy for foreground objects and achieves state-of-the-art performance among all categories on SemanticKITTI.
comment: under review
☆ SWA-SOP: Spatially-aware Window Attention for Semantic Occupancy Prediction in Autonomous Driving
Perception systems in autonomous driving rely on sensors such as LiDAR and cameras to perceive the 3D environment. However, due to occlusions and data sparsity, these sensors often fail to capture complete information. Semantic Occupancy Prediction (SOP) addresses this challenge by inferring both occupancy and semantics of unobserved regions. Existing transformer-based SOP methods lack explicit modeling of spatial structure in attention computation, resulting in limited geometric awareness and poor performance in sparse or occluded areas. To this end, we propose Spatially-aware Window Attention (SWA), a novel mechanism that incorporates local spatial context into attention. SWA significantly improves scene completion and achieves state-of-the-art results on LiDAR-based SOP benchmarks. We further validate its generality by integrating SWA into a camera-based SOP pipeline, where it also yields consistent gains across modalities.
comment: under reviewed
☆ DefFusionNet: Learning Multimodal Goal Shapes for Deformable Object Manipulation via a Diffusion-based Probabilistic Model
Deformable object manipulation is critical to many real-world robotic applications, ranging from surgical robotics and soft material handling in manufacturing to household tasks like laundry folding. At the core of this important robotic field is shape servoing, a task focused on controlling deformable objects into desired shapes. The shape servoing formulation requires the specification of a goal shape. However, most prior works in shape servoing rely on impractical goal shape acquisition methods, such as laborious domain-knowledge engineering or manual manipulation. DefGoalNet previously posed the current state-of-the-art solution to this problem, which learns deformable object goal shapes directly from a small number of human demonstrations. However, it significantly struggles in multi-modal settings, where multiple distinct goal shapes can all lead to successful task completion. As a deterministic model, DefGoalNet collapses these possibilities into a single averaged solution, often resulting in an unusable goal. In this paper, we address this problem by developing DefFusionNet, a novel neural network that leverages the diffusion probabilistic model to learn a distribution over all valid goal shapes rather than predicting a single deterministic outcome. This enables the generation of diverse goal shapes and avoids the averaging artifacts. We demonstrate our method's effectiveness on robotic tasks inspired by both manufacturing and surgical applications, both in simulation and on a physical robot. Our work is the first generative model capable of producing a diverse, multi-modal set of deformable object goals for real-world robotic applications.
☆ USVTrack: USV-Based 4D Radar-Camera Tracking Dataset for Autonomous Driving in Inland Waterways IROS
Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensors, our USV collected comprehensive object tracking data, which we present as USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in new generation waterborne transportation systems. Our USVTrack dataset presents rich scenarios, featuring diverse various waterways, varying times of day, and multiple weather and lighting conditions. Moreover, we present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers. Experimental results utilizing RCM demonstrate the effectiveness of the radar-camera matching in improving object tracking accuracy and reliability for autonomous driving in waterborne environments. The USVTrack dataset is public on https://usvtrack.github.io.
comment: Accepted by IROS
☆ TDACloud: Point Cloud Recognition Using Topological Data Analysis
Point cloud-based object/place recognition remains a problem of interest in applications such as autonomous driving, scene reconstruction, and localization. Extracting meaningful local descriptors from a query point cloud that can be matched with the descriptors of the collected point clouds is a challenging problem. Furthermore, when the query point cloud is noisy or has been transformed (e.g., rotated), it adds to the complexity. To this end, we propose a novel methodology, named TDACloud, using Topological Data Analysis (TDA) for local descriptor extraction from a point cloud, which does not need resource-intensive GPU-based machine learning training. More specifically, we used the ATOL vectorization method to generate vectors for point clouds. Unlike voxelization, our proposed technique can take raw point clouds as inputs and outputs a fixed-size TDA-descriptor vector. To test the quality of the proposed TDACloud technique, we have implemented it on multiple real-world (e.g., Oxford RobotCar, KITTI-360) and realistic (e.g., ShapeNet) point cloud datasets for object and place recognition. We have also tested TDACloud on noisy and transformed test cases where the query point cloud has been scaled, translated, or rotated. Our results demonstrate high recognition accuracies in noisy conditions and large-scale real-world place recognition while outperforming the baselines by up to approximately 14%.
☆ Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition IJCNN
Effective human action recognition is widely used for cobots in Industry 4.0 to assist in assembly tasks. However, conventional skeleton-based methods often lose keypoint semantics, limiting their effectiveness in complex interactions. In this work, we introduce a novel approach to skeleton-based action recognition that enriches input representations by leveraging word embeddings to encode semantic information. Our method replaces one-hot encodings with semantic volumes, enabling the model to capture meaningful relationships between joints and objects. Through extensive experiments on multiple assembly datasets, we demonstrate that our approach significantly improves classification performance, and enhances generalization capabilities by simultaneously supporting different skeleton types and object classes. Our findings highlight the potential of incorporating semantic information to enhance skeleton-based action recognition in dynamic and diverse environments.
comment: IEEE International Joint Conference on Neural Networks (IJCNN) 2025
☆ Safety-Aware Optimal Scheduling for Autonomous Masonry Construction using Collaborative Heterogeneous Aerial Robots IROS 2025
This paper presents a novel high-level task planning and optimal coordination framework for autonomous masonry construction, using a team of heterogeneous aerial robotic workers, consisting of agents with separate skills for brick placement and mortar application. This introduces new challenges in scheduling and coordination, particularly due to the mortar curing deadline required for structural bonding and ensuring the safety constraints among UAVs operating in parallel. To address this, an automated pipeline generates the wall construction plan based on the available bricks while identifying static structural dependencies and potential conflicts for safe operation. The proposed framework optimizes UAV task allocation and execution timing by incorporating dynamically coupled precedence deadline constraints that account for the curing process and static structural dependency constraints, while enforcing spatio-temporal constraints to prevent collisions and ensure safety. The primary objective of the scheduler is to minimize the overall construction makespan while minimizing logistics, traveling time between tasks, and the curing time to maintain both adhesion quality and safe workspace separation. The effectiveness of the proposed method in achieving coordinated and time-efficient aerial masonry construction is extensively validated through Gazebo simulated missions. The results demonstrate the framework's capability to streamline UAV operations, ensuring both structural integrity and safety during the construction process.
comment: This paper has been accepted for publication at the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
☆ NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments
Autonomous aerial target tracking in unstructured and GPS-denied environments remains a fundamental challenge in robotics. Many existing methods rely on motion capture systems, pre-mapped scenes, or feature-based localization to ensure safety and control, limiting their deployment in real-world conditions. We introduce NOVA, a fully onboard, object-centric framework that enables robust target tracking and collision-aware navigation using only a stereo camera and an IMU. Rather than constructing a global map or relying on absolute localization, NOVA formulates perception, estimation, and control entirely in the target's reference frame. A tightly integrated stack combines a lightweight object detector with stereo depth completion, followed by histogram-based filtering to infer robust target distances under occlusion and noise. These measurements feed a visual-inertial state estimator that recovers the full 6-DoF pose of the robot relative to the target. A nonlinear model predictive controller (NMPC) plans dynamically feasible trajectories in the target frame. To ensure safety, high-order control barrier functions are constructed online from a compact set of high-risk collision points extracted from depth, enabling real-time obstacle avoidance without maps or dense representations. We validate NOVA across challenging real-world scenarios, including urban mazes, forest trails, and repeated transitions through buildings with intermittent GPS loss and severe lighting changes that disrupt feature-based localization. Each experiment is repeated multiple times under similar conditions to assess resilience, showing consistent and reliable performance. NOVA achieves agile target following at speeds exceeding 50 km/h. These results show that high-speed vision-based tracking is possible in the wild using only onboard sensing, with no reliance on external localization or environment assumptions.
☆ MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation
Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on https://github.com/dtc111111/mcnslam.
☆ PG-LIO: Photometric-Geometric fusion for Robust LiDAR-Inertial Odometry
LiDAR-Inertial Odometry (LIO) is widely used for accurate state estimation and mapping which is an essential requirement for autonomous robots. Conventional LIO methods typically rely on formulating constraints from the geometric structure sampled by the LiDAR. Hence, in the lack of geometric structure, these tend to become ill-conditioned (degenerate) and fail. Robustness of LIO to such conditions is a necessity for its broader deployment. To address this, we propose PG-LIO, a real-time LIO method that fuses photometric and geometric information sampled by the LiDAR along with inertial constraints from an Inertial Measurement Unit (IMU). This multi-modal information is integrated into a factor graph optimized over a sliding window for real-time operation. We evaluate PG-LIO on multiple datasets that include both geometrically well-conditioned as well as self-similar scenarios. Our method achieves accuracy on par with state-of-the-art LIO in geometrically well-structured settings while significantly improving accuracy in degenerate cases including against methods that also fuse intensity. Notably, we demonstrate only 1 m drift over a 1 km manually piloted aerial trajectory through a geometrically self-similar tunnel at an average speed of 7.5m/s (max speed 10.8 m/s). For the benefit of the community, we shall also release our source code https://github.com/ntnu-arl/mimosa
comment: 8 pages, 6 figures
☆ Learning Point Correspondences In Radar 3D Point Clouds For Radar-Inertial Odometry
Using 3D point clouds in odometry estimation in robotics often requires finding a set of correspondences between points in subsequent scans. While there are established methods for point clouds of sufficient quality, state-of-the-art still struggles when this quality drops. Thus, this paper presents a novel learning-based framework for predicting robust point correspondences between pairs of noisy, sparse and unstructured 3D point clouds from a light-weight, low-power, inexpensive, consumer-grade System-on-Chip (SoC) Frequency Modulated Continuous Wave (FMCW) radar sensor. Our network is based on the transformer architecture which allows leveraging the attention mechanism to discover pairs of points in consecutive scans with the greatest mutual affinity. The proposed network is trained in a self-supervised way using set-based multi-label classification cross-entropy loss, where the ground-truth set of matches is found by solving the Linear Sum Assignment (LSA) optimization problem, which avoids tedious hand annotation of the training data. Additionally, posing the loss calculation as multi-label classification permits supervising on point correspondences directly instead of on odometry error, which is not feasible for sparse and noisy data from the SoC radar we use. We evaluate our method with an open-source state-of-the-art Radar-Inertial Odometry (RIO) framework in real-world Unmanned Aerial Vehicle (UAV) flights and with the widely used public Coloradar dataset. Evaluation shows that the proposed method improves the position estimation accuracy by over 14 % and 19 % on average, respectively. The open source code and datasets can be found here: https://github.com/aau-cns/radar_transformer.
☆ Design, fabrication and control of a cable-driven parallel robot
In cable driven parallel robots (CDPRs), the payload is suspended using a network of cables whose length can be controlled to maneuver the payload within the workspace. Compared to rigid link robots, CDPRs provide better maneuverability due to the flexibility of the cables and consume lesser power due to the high strength-to-weight ratio of the cables. However, amongst other things, the flexibility of the cables and the fact that they can only pull (and not push) render the dynamics of CDPRs complex. Hence advanced modelling paradigms and control algorithms must be developed to fully utilize the potential of CDPRs. Furthermore, given the complex dynamics of CDPRs, the models and control algorithms proposed for them must be validated on experimental setups to ascertain their efficacy in practice. We have recently developed an elaborate experimental setup for a CDPR with three cables and validated elementary open-loop motion planning algorithms on it. In this paper, we describe several aspects of the design and fabrication of our setup, including component selection and assembly, and present our experimental results. Our setup can reproduce complex phenomenon such as the transverse vibration of the cables seen in large CDPRs and will in the future be used to model and control such phenomenon and also to validate more sophisticated motion planning algorithms.
comment: 4 pages, 8 fugures
☆ Mirror Eyes: Explainable Human-Robot Interaction at a Glance
The gaze of a person tends to reflect their interest. This work explores what happens when this statement is taken literally and applied to robots. Here we present a robot system that employs a moving robot head with a screen-based eye model that can direct the robot's gaze to points in physical space and present a reflection-like mirror image of the attended region on top of each eye. We conducted a user study with 33 participants, who were asked to instruct the robot to perform pick-and-place tasks, monitor the robot's task execution, and interrupt it in case of erroneous actions. Despite a deliberate lack of instructions about the role of the eyes and a very brief system exposure, participants felt more aware about the robot's information processing, detected erroneous actions earlier, and rated the user experience higher when eye-based mirroring was enabled compared to non-reflective eyes. These results suggest a beneficial and intuitive utilization of the introduced method in cooperative human-robot interaction.
comment: Accepted to the 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
☆ A Motivational Architecture for Open-Ended Learning Challenges in Robots
Developing agents capable of autonomously interacting with complex and dynamic environments, where task structures may change over time and prior knowledge cannot be relied upon, is a key prerequisite for deploying artificial systems in real-world settings. The open-ended learning framework identifies the core challenges for creating such agents, including the ability to autonomously generate new goals, acquire the necessary skills (or curricula of skills) to achieve them, and adapt to non-stationary environments. While many existing works tackles various aspects of these challenges in isolation, few propose integrated solutions that address them simultaneously. In this paper, we introduce H-GRAIL, a hierarchical architecture that, through the use of different typologies of intrinsic motivations and interconnected learning mechanisms, autonomously discovers new goals, learns the required skills for their achievement, generates skill sequences for tackling interdependent tasks, and adapts to non-stationary environments. We tested H-GRAIL in a real robotic scenario, demonstrating how the proposed solutions effectively address the various challenges of open-ended learning.
comment: Accepted to RLDM 2025
☆ GraspMAS: Zero-Shot Language-driven Grasp Detection with Multi-Agent System IROS 2025
Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or finetuning step to adapt to new domains, limiting their generation in real-world applications. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which evaluates the outcomes and provides feedback. Intensive experiments on two large-scale datasets demonstrate that our GraspMAS significantly outperforms existing baselines. Additionally, robot experiments conducted in both simulation and real-world settings further validate the effectiveness of our approach.
comment: 8 pages, accepted to IROS 2025
☆ Radar and Event Camera Fusion for Agile Robot Ego-Motion Estimation
Achieving reliable ego motion estimation for agile robots, e.g., aerobatic aircraft, remains challenging because most robot sensors fail to respond timely and clearly to highly dynamic robot motions, often resulting in measurement blurring, distortion, and delays. In this paper, we propose an IMU-free and feature-association-free framework to achieve aggressive ego-motion velocity estimation of a robot platform in highly dynamic scenarios by combining two types of exteroceptive sensors, an event camera and a millimeter wave radar, First, we used instantaneous raw events and Doppler measurements to derive rotational and translational velocities directly. Without a sophisticated association process between measurement frames, the proposed method is more robust in texture-less and structureless environments and is more computationally efficient for edge computing devices. Then, in the back-end, we propose a continuous-time state-space model to fuse the hybrid time-based and event-based measurements to estimate the ego-motion velocity in a fixed-lagged smoother fashion. In the end, we validate our velometer framework extensively in self-collected experiment datasets. The results indicate that our IMU-free and association-free ego motion estimation framework can achieve reliable and efficient velocity output in challenging environments. The source code, illustrative video and dataset are available at https://github.com/ZzhYgwh/TwistEstimator.
☆ Integrating Maneuverable Planning and Adaptive Control for Robot Cart-Pushing under Disturbances
Precise and flexible cart-pushing is a challenging task for mobile robots. The motion constraints during cart-pushing and the robot's redundancy lead to complex motion planning problems, while variable payloads and disturbances present complicated dynamics. In this work, we propose a novel planning and control framework for flexible whole-body coordination and robust adaptive control. Our motion planning method employs a local coordinate representation and a novel kinematic model to solve a nonlinear optimization problem, thereby enhancing motion maneuverability by generating feasible and flexible push poses. Furthermore, we present a disturbance rejection control method to resist disturbances and reduce control errors for the complex control problem without requiring an accurate dynamic model. We validate our method through extensive experiments in simulation and real-world settings, demonstrating its superiority over existing approaches. To the best of our knowledge, this is the first work to systematically evaluate the flexibility and robustness of cart-pushing methods in experiments. The video supplement is available at https://sites.google.com/view/mpac-pushing/.
comment: 11 pages, 11 figures
☆ Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots
Despite growing interest in Learning-by-Teaching (LbT), few studies have explored how this paradigm can be implemented with autonomous, peer-like social robots in real classrooms. Most prior work has relied on scripted or Wizard-of-Oz behaviors, limiting our understanding of how real-time, interactive learning can be supported by artificial agents. This study addresses this gap by introducing Interactive Reinforcement Learning (RL) as a cognitive model for teachable social robots. We conducted two between-subject experiments with 58 primary school children, who either taught a robot or practiced independently on a tablet while learning French vocabulary (memorization) and grammatical rules (inference). The robot, powered by Interactive RL, learned from the child's evaluative feedback. Children in the LbT condition achieved significantly higher retention gains compared to those in the self-practice condition, especially on the grammar task. Learners with lower prior knowledge benefited most from teaching the robot. Behavioural metrics revealed that children adapted their teaching strategies over time and engaged more deeply during inference tasks. This work makes two contributions: (1) it introduces Interactive RL as a pedagogically effective and scalable model for peer-robot learning, and (2) it demonstrates, for the first time, the feasibility of deploying multiple autonomous robots simultaneously in real classrooms. These findings extend theoretical understanding of LbT by showing that social robots can function not only as passive tutees but as adaptive partners that enhance meta-cognitive engagement and long-term learning outcomes.
☆ Robotic Manipulation of a Rotating Chain with Bottom End Fixed
This paper studies the problem of using a robot arm to manipulate a uniformly rotating chain with its bottom end fixed. Existing studies have investigated ideal rotational shapes for practical applications, yet they do not discuss how these shapes can be consistently achieved through manipulation planning. Our work presents a manipulation strategy for stable and consistent shape transitions. We find that the configuration space of such a chain is homeomorphic to a three-dimensional cube. Using this property, we suggest a strategy to manipulate the chain into different configurations, specifically from one rotation mode to another, while taking stability and feasibility into consideration. We demonstrate the effectiveness of our strategy in physical experiments by successfully transitioning from rest to the first two rotation modes. The concepts explored in our work has critical applications in ensuring safety and efficiency of drill string and yarn spinning operations.
comment: 6 pages, 5 figures
☆ TritonZ: A Remotely Operated Underwater Rover with Manipulator Arm for Exploration and Rescue Operations
The increasing demand for underwater exploration and rescue operations enforces the development of advanced wireless or semi-wireless underwater vessels equipped with manipulator arms. This paper presents the implementation of a semi-wireless underwater vehicle, "TritonZ" equipped with a manipulator arm, tailored for effective underwater exploration and rescue operations. The vehicle's compact design enables deployment in different submarine surroundings, addressing the need for wireless systems capable of navigating challenging underwater terrains. The manipulator arm can interact with the environment, allowing the robot to perform sophisticated tasks during exploration and rescue missions in emergency situations. TritonZ is equipped with various sensors such as Pi-Camera, Humidity, and Temperature sensors to send real-time environmental data. Our underwater vehicle controlled using a customized remote controller can navigate efficiently in the water where Pi-Camera enables live streaming of the surroundings. Motion control and video capture are performed simultaneously using this camera. The manipulator arm is designed to perform various tasks, similar to grasping, manipulating, and collecting underwater objects. Experimental results shows the efficacy of the proposed remotely operated vehicle in performing a variety of underwater exploration and rescue tasks. Additionally, the results show that TritonZ can maintain an average of 13.5cm/s with a minimal delay of 2-3 seconds. Furthermore, the vehicle can sustain waves underwater by maintaining its position as well as average velocity. The full project details and source code can be accessed at this link: https://github.com/kawser-ahmed-byte/TritonZ
comment: 6 pages, 5 figures
☆ Crowdsourcing Ubiquitous Indoor Localization with Non-Cooperative Wi-Fi Ranging
Indoor localization opens the path to potentially transformative applications. Although many indoor localization methods have been proposed over the years, they remain too impractical for widespread deployment in the real world. In this paper, we introduce PeepLoc, a deployable and scalable Wi-Fi-based solution for indoor localization that relies only on pre-existing devices and infrastructure. Specifically, PeepLoc works on any mobile device with an unmodified Wi-Fi transceiver and in any indoor environment with a sufficient number of Wi-Fi access points (APs) and pedestrian traffic. At the core of PeepLoc is (a) a mechanism which allows any Wi-Fi device to obtain non-cooperative time-of-flight (ToF) to any Wi-Fi AP and (b) a novel bootstrapping mechanism that relies on pedestrian dead reckoning (PDR) and crowdsourcing to opportunistically initialize pre-existing APs as anchor points within an environment. We implement PeepLoc using commodity hardware and evaluate it extensively across 4 campus buildings. We show PeepLoc leads to a mean and median positional error of 3.41 m and 3.06 m respectively, which is superior to existing deployed indoor localization systems and is competitive with commodity GPS in outdoor environments.
☆ Improvement on LiDAR-Camera Calibration Using Square Targets
Precise sensor calibration is critical for autonomous vehicles as a prerequisite for perception algorithms to function properly. Rotation error of one degree can translate to position error of meters in target object detection at large distance, leading to improper reaction of the system or even safety related issues. Many methods for multi-sensor calibration have been proposed. However, there are very few work that comprehensively consider the challenges of the calibration procedure when applied to factory manufacturing pipeline or after-sales service scenarios. In this work, we introduce a fully automatic LiDAR-camera extrinsic calibration algorithm based on targets that is fast, easy to deploy and robust to sensor noises such as missing data. The core of the method include: (1) an automatic multi-stage LiDAR board detection pipeline using only geometry information with no specific material requirement; (2) a fast coarse extrinsic parameter search mechanism that is robust to initial extrinsic errors; (3) a direct optimization algorithm that is robust to sensor noises. We validate the effectiveness of our methods through experiments on data captured in real world scenarios.
☆ Learning Approach to Efficient Vision-based Active Tracking of a Flying Target by an Unmanned Aerial Vehicle
Autonomous tracking of flying aerial objects has important civilian and defense applications, ranging from search and rescue to counter-unmanned aerial systems (counter-UAS). Ground based tracking requires setting up infrastructure, could be range limited, and may not be feasible in remote areas, crowded cities or in dense vegetation areas. Vision based active tracking of aerial objects from another airborne vehicle, e.g., a chaser unmanned aerial vehicle (UAV), promises to fill this important gap, along with serving aerial coordination use cases. Vision-based active tracking by a UAV entails solving two coupled problems: 1) compute-efficient and accurate (target) object detection and target state estimation; and 2) maneuver decisions to ensure that the target remains in the field of view in the future time-steps and favorably positioned for continued detection. As a solution to the first problem, this paper presents a novel integration of standard deep learning based architectures with Kernelized Correlation Filter (KCF) to achieve compute-efficient object detection without compromising accuracy, unlike standalone learning or filtering approaches. The proposed perception framework is validated using a lab-scale setup. For the second problem, to obviate the linearity assumptions and background variations limiting effectiveness of the traditional controllers, we present the use of reinforcement learning to train a neuro-controller for fast computation of velocity maneuvers. New state space, action space and reward formulations are developed for this purpose, and training is performed in simulation using AirSim. The trained model is also tested in AirSim with respect to complex target maneuvers, and is found to outperform a baseline PID control in terms of tracking up-time and average distance maintained (from the target) during tracking.
comment: AIAA Aviation 2025
☆ Robot Tactile Gesture Recognition Based on Full-body Modular E-skin
With the development of robot electronic skin technology, various tactile sensors, enhanced by AI, are unlocking a new dimension of perception for robots. In this work, we explore how robots equipped with electronic skin can recognize tactile gestures and interpret them as human commands. We developed a modular robot E-skin, composed of multiple irregularly shaped skin patches, which can be assembled to cover the robot's body while capturing real-time pressure and pose data from thousands of sensing points. To process this information, we propose an equivariant graph neural network-based recognizer that efficiently and accurately classifies diverse tactile gestures, including poke, grab, stroke, and double-pat. By mapping the recognized gestures to predefined robot actions, we enable intuitive human-robot interaction purely through tactile input.
☆ Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning
Large vision-language models (VLMs) for autonomous driving (AD) are evolving beyond perception and cognition tasks toward motion planning. However, we identify two critical challenges in this direction: (1) VLMs tend to learn shortcuts by relying heavily on history input information, achieving seemingly strong planning results without genuinely understanding the visual inputs; and (2) the chain-ofthought (COT) reasoning processes are always misaligned with the motion planning outcomes, and how to effectively leverage the complex reasoning capability to enhance planning remains largely underexplored. In this paper, we start from a small-scale domain-specific VLM and propose Drive-R1 designed to bridges the scenario reasoning and motion planning for AD. Drive-R1 first undergoes the supervised finetuning on a elaborate dataset containing both long and short COT data. Drive-R1 is encouraged to reason step-by-step from visual input to final planning decisions. Subsequently, Drive-R1 is trained within a reinforcement learning framework that incentivizes the discovery of reasoning paths that are more informative for planning, guided by rewards based on predicted trajectories and meta actions. Experimental evaluations on the nuScenes and DriveLM-nuScenes benchmarks demonstrate that Drive-R1 achieves superior performance compared to existing state-of-the-art VLMs. We believe that Drive-R1 presents a promising direction for bridging reasoning and planning in AD, offering methodological insights for future research and applications.
☆ Haptic-ACT -- Pseudo Oocyte Manipulation by a Robot Using Multimodal Information and Action Chunking with Transformers IROS2025
In this paper we introduce Haptic-ACT, an advanced robotic system for pseudo oocyte manipulation, integrating multimodal information and Action Chunking with Transformers (ACT). Traditional automation methods for oocyte transfer rely heavily on visual perception, often requiring human supervision due to biological variability and environmental disturbances. Haptic-ACT enhances ACT by incorporating haptic feedback, enabling real-time grasp failure detection and adaptive correction. Additionally, we introduce a 3D-printed TPU soft gripper to facilitate delicate manipulations. Experimental results demonstrate that Haptic-ACT improves the task success rate, robustness, and adaptability compared to conventional ACT, particularly in dynamic environments. These findings highlight the potential of multimodal learning in robotics for biomedical automation.
comment: Accepted at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2025) Project website https://upedrou.github.io/haptic-act_IROS2025
☆ Low-Cost Infrastructure-Free 3D Relative Localization with Sub-Meter Accuracy in Near Field
Relative localization in the near-field scenario is critically important for unmanned vehicle (UxV) applications. Although related works addressing 2D relative localization problem have been widely studied for unmanned ground vehicles (UGVs), the problem in 3D scenarios for unmanned aerial vehicles (UAVs) involves more uncertainties and remains to be investigated. Inspired by the phenomenon that animals can achieve swarm behaviors solely based on individual perception of relative information, this study proposes an infrastructure-free 3D relative localization framework that relies exclusively on onboard ultra-wideband (UWB) sensors. Leveraging 2D relative positioning research, we conducted feasibility analysis, system modeling, simulations, performance evaluation, and field tests using UWB sensors. The key contributions of this work include: derivation of the Cram\'er-Rao lower bound (CRLB) and geometric dilution of precision (GDOP) for near-field scenarios; development of two localization algorithms -- one based on Euclidean distance matrix (EDM) and another employing maximum likelihood estimation (MLE); comprehensive performance comparison and computational complexity analysis against state-of-the-art methods; simulation studies and field experiments; a novel sensor deployment strategy inspired by animal behavior, enabling single-sensor implementation within the proposed framework for UxV applications. The theoretical, simulation, and experimental results demonstrate strong generalizability to other 3D near-field localization tasks, with significant potential for a cost-effective cross-platform UxV collaborative system.
☆ Situated Haptic Interaction: Exploring the Role of Context in Affective Perception of Robotic Touch
Affective interaction is not merely about recognizing emotions; it is an embodied, situated process shaped by context and co-created through interaction. In affective computing, the role of haptic feedback within dynamic emotional exchanges remains underexplored. This study investigates how situational emotional cues influence the perception and interpretation of haptic signals given by a robot. In a controlled experiment, 32 participants watched video scenarios in which a robot experienced either positive actions (such as being kissed), negative actions (such as being slapped) or neutral actions. After each video, the robot conveyed its emotional response through haptic communication, delivered via a wearable vibration sleeve worn by the participant. Participants rated the robot's emotional state-its valence (positive or negative) and arousal (intensity)-based on the video, the haptic feedback, and the combination of the two. The study reveals a dynamic interplay between visual context and touch. Participants' interpretation of haptic feedback was strongly shaped by the emotional context of the video, with visual context often overriding the perceived valence of the haptic signal. Negative haptic cues amplified the perceived valence of the interaction, while positive cues softened it. Furthermore, haptics override the participants' perception of arousal of the video. Together, these results offer insights into how situated haptic feedback can enrich affective human-robot interaction, pointing toward more nuanced and embodied approaches to emotional communication with machines.
☆ CUPID: Curating Data your Robot Loves with Influence Functions
In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes - such as closed-loop task success or failure - remains a persistent challenge. We propose CUPID, a robot data curation method based on a novel influence function-theoretic formulation for imitation learning policies. Given a set of evaluation rollouts, CUPID estimates the influence of each training demonstration on the policy's expected return. This enables ranking and selection of demonstrations according to their impact on the policy's closed-loop performance. We use CUPID to curate data by 1) filtering out training demonstrations that harm policy performance and 2) subselecting newly collected trajectories that will most improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can yield state-of-the-art diffusion policies on the simulated RoboMimic benchmark, with similar gains observed in hardware. Furthermore, hardware experiments show that our method can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance the post-training of generalist robot policies. Additional materials are made available at: https://cupid-curation.github.io.
comment: Project page: https://cupid-curation.github.io. 28 pages, 15 figures
☆ Analysis and experiments of the dissipative Twistcar: direction reversal and asymptotic approximations
Underactuated wheeled vehicles are commonly studied as nonholonomic systems with periodic actuation. Twistcar is a classical example inspired by a riding toy, which has been analyzed using a planar model of a dynamical system with nonholonomic constraints. Most of the previous analyses did not account for energy dissipation due to friction. In this work, we study a theoretical two-link model of the Twistcar while incorporating dissipation due to rolling resistance. We obtain asymptotic expressions for the system's small-amplitude steady-state periodic dynamics, which reveals the possibility of reversing the direction of motion upon varying the geometric and mass properties of the vehicle. Next, we design and construct a robotic prototype of the Twistcar whose center-of-mass position can be shifted by adding and removing a massive block, enabling demonstration of the Twistcar's direction reversal phenomenon. We also conduct parameter fitting for the frictional resistance in order to improve agreement with experiments.
☆ Multimodal Anomaly Detection with a Mixture-of-Experts IROS 2025
With a growing number of robots being deployed across diverse applications, robust multimodal anomaly detection becomes increasingly important. In robotic manipulation, failures typically arise from (1) robot-driven anomalies due to an insufficient task model or hardware limitations, and (2) environment-driven anomalies caused by dynamic environmental changes or external interferences. Conventional anomaly detection methods focus either on the first by low-level statistical modeling of proprioceptive signals or the second by deep learning-based visual environment observation, each with different computational and training data requirements. To effectively capture anomalies from both sources, we propose a mixture-of-experts framework that integrates the complementary detection mechanisms with a visual-language model for environment monitoring and a Gaussian-mixture regression-based detector for tracking deviations in interaction forces and robot motions. We introduce a confidence-based fusion mechanism that dynamically selects the most reliable detector for each situation. We evaluate our approach on both household and industrial tasks using two robotic systems, demonstrating a 60% reduction in detection delay while improving frame-wise anomaly detection performance compared to individual detectors.
comment: 8 pages, 5 figures, 1 table, the paper has been accepted for publication in the Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
☆ Faster Motion Planning via Restarts
Randomized methods such as PRM and RRT are widely used in motion planning. However, in some cases, their running-time suffers from inherent instability, leading to ``catastrophic'' performance even for relatively simple instances. We apply stochastic restart techniques, some of them new, for speeding up Las Vegas algorithms, that provide dramatic speedups in practice (a factor of $3$ [or larger] in many cases). Our experiments demonstrate that the new algorithms have faster runtimes, shorter paths, and greater gains from multi-threading (when compared with straightforward parallel implementation). We prove the optimality of the new variants. Our implementation is open source, available on github, and is easy to deploy and use.
comment: arXiv admin note: text overlap with arXiv:2503.04633
♻ ☆ Agile, Autonomous Spacecraft Constellations with Disruption Tolerant Networking to Monitor Precipitation and Urban Floods
Fully re-orientable small spacecraft are now supported by commercial technologies, allowing them to point their instruments in any direction and capture images, with short notice. When combined with improved onboard processing, and implemented on a constellation of inter-communicable satellites, this intelligent agility can significantly increase responsiveness to transient or evolving phenomena. We demonstrate a ground-based and onboard algorithmic framework that combines orbital mechanics, attitude control, inter-satellite communication, intelligent prediction and planning to schedule the time-varying, re-orientation of agile, small satellites in a constellation. Planner intelligence is improved by updating the predictive value of future space-time observations based on shared observations of evolving episodic precipitation and urban flood forecasts. Reliable inter-satellite communication within a fast, dynamic constellation topology is modeled in the physical, access control and network layer. We apply the framework on a representative 24-satellite constellation observing 5 global regions. Results show appropriately low latency in information exchange (average within 1/3rd available time for implicit consensus), enabling the onboard scheduler to observe ~7% more flood magnitude than a ground-based implementation. Both onboard and offline versions performed ~98% better than constellations without agility.
♻ ☆ Learning to Insert for Constructive Neural Vehicle Routing Solver
Neural Combinatorial Optimisation (NCO) is a promising learning-based approach for solving Vehicle Routing Problems (VRPs) without extensive manual design. While existing constructive NCO methods typically follow an appending-based paradigm that sequentially adds unvisited nodes to partial solutions, this rigid approach often leads to suboptimal results. To overcome this limitation, we explore the idea of insertion-based paradigm and propose Learning to Construct with Insertion-based Paradigm (L2C-Insert), a novel learning-based method for constructive NCO. Unlike traditional approaches, L2C-Insert builds solutions by strategically inserting unvisited nodes at any valid position in the current partial solution, which can significantly enhance the flexibility and solution quality. The proposed framework introduces three key components: a novel model architecture for precise insertion position prediction, an efficient training scheme for model optimization, and an advanced inference technique that fully exploits the insertion paradigm's flexibility. Extensive experiments on both synthetic and real-world instances of the Travelling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) demonstrate that L2C-Insert consistently achieves superior performance across various problem sizes.
♻ ☆ Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights
Predicting human behavior in shared environments is crucial for safe and efficient human-robot interaction. Traditional data-driven methods to that end are pre-trained on domain-specific datasets, activity types, and prediction horizons. In contrast, the recent breakthroughs in Large Language Models (LLMs) promise open-ended cross-domain generalization to describe various human activities and make predictions in any context. In particular, Multimodal LLMs (MLLMs) are able to integrate information from various sources, achieving more contextual awareness and improved scene understanding. The difficulty in applying general-purpose MLLMs directly for prediction stems from their limited capacity for processing large input sequences, sensitivity to prompt design, and expensive fine-tuning. In this paper, we present a systematic analysis of applying pre-trained MLLMs for context-aware human behavior prediction. To this end, we introduce a modular multimodal human activity prediction framework that allows us to benchmark various MLLMs, input variations, In-Context Learning (ICL), and autoregressive techniques. Our evaluation indicates that the best-performing framework configuration is able to reach 92.8% semantic similarity and 66.1% exact label accuracy in predicting human behaviors in the target frame.
comment: Accepted at IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2025
♻ ☆ Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks IROS
Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue to robotic applications that suffer from accumulating errors between detection, planning, and action execution. The paper introduces a novel method for the acquisition of real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The data set consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.
comment: Submitted and Accepted for Presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
♻ ☆ Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable subgoals. In this work, we propose DEMO3, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.
comment: Project page can be found at https://adrialopezescoriza.github.io/demo3/
♻ ☆ Diffusion-based learning of contact plans for agile locomotion
Legged robots have become capable of performing highly dynamic maneuvers in the past few years. However, agile locomotion in highly constrained environments such as stepping stones is still a challenge. In this paper, we propose a combination of model-based control, search, and learning to design efficient control policies for agile locomotion on stepping stones. In our framework, we use nonlinear model predictive control (NMPC) to generate whole-body motions for a given contact plan. To efficiently search for an optimal contact plan, we propose to use Monte Carlo tree search (MCTS). While the combination of MCTS and NMPC can quickly find a feasible plan for a given environment (a few seconds), it is not yet suitable to be used as a reactive policy. Hence, we generate a dataset for optimal goal-conditioned policy for a given scene and learn it through supervised learning. In particular, we leverage the power of diffusion models in handling multi-modality in the dataset. We test our proposed framework on a scenario where our quadruped robot Solo12 successfully jumps to different goals in a highly constrained environment.
♻ ☆ HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments. Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction. However, these methods quantize actions into discrete bins, which disrupts the continuity required for precise control. In contrast, existing diffusion-based VLA methods incorporate an additional diffusion head to predict continuous actions solely conditioned on feature representations extracted by the VLM, without fully leveraging the VLM's pretrained reasoning capabilities through token-level generation. To address these limitations, we introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression within a single large language model. To mitigate interference between the two generation paradigms, we propose a collaborative training recipe that seamlessly incorporates diffusion denoising into the next-token prediction process. With this recipe, we find these two action prediction methods not only reinforce each other but also exhibit varying strength across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses both predictions, leading to more robust control. HybridVLA outperforms previous state-of-the-art VLA methods by 14\% and 19\% in mean success rate on simulation and real-world tasks, respectively, while demonstrating stable manipulation in unseen configurations.
♻ ☆ LoopSR: Looping Sim-and-Real for Lifelong Policy Adaptation of Legged Robots IROS 2025
Reinforcement Learning (RL) has shown its remarkable and generalizable capability in legged locomotion through sim-to-real transfer. However, while adaptive methods like domain randomization are expected to enhance policy robustness across diverse environments, they potentially compromise the policy's performance in any specific environment, leading to suboptimal real-world deployment due to the No Free Lunch theorem. To address this, we propose LoopSR, a lifelong policy adaptation framework that continuously refines RL policies in the post-deployment stage. LoopSR employs a transformer-based encoder to map real-world trajectories into a latent space and reconstruct a digital twin of the real world for further improvement. Autoencoder architecture and contrastive learning methods are adopted to enhance feature extraction of real-world dynamics. Simulation parameters for continual training are derived by combining predicted values from the decoder with retrieved parameters from a pre-collected simulation trajectory dataset. By leveraging simulated continual training, LoopSR achieves superior data efficiency compared with strong baselines, yielding eminent performance with limited data in both sim-to-sim and sim-to-real experiments.
comment: IROS 2025
♻ ☆ SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency
We propose a flexible Semi-Automatic Labeling Tool (SALT) for general LiDAR point clouds with cross-scene adaptability and 4D consistency. Unlike recent approaches that rely on camera distillation, SALT operates directly on raw LiDAR data, automatically generating pre-segmentation results. To achieve this, we propose a novel zero-shot learning paradigm, termed data alignment, which transforms LiDAR data into pseudo-images by aligning with the training distribution of vision foundation models. Additionally, we design a 4D-consistent prompting strategy and 4D non-maximum suppression module to enhance SAM2, ensuring high-quality, temporally consistent presegmentation. SALT surpasses the latest zero-shot methods by 18.4% PQ on SemanticKITTI and achieves nearly 40-50% of human annotator performance on our newly collected low-resolution LiDAR data and on combined data from three LiDAR types, significantly boosting annotation efficiency. We anticipate that SALT's open-sourcing will catalyze substantial expansion of current LiDAR datasets and lay the groundwork for the future development of LiDAR foundation models. Code is available at https://github.com/Cavendish518/SALT.
♻ ☆ Accurate Simulation and Parameter Identification of Deformable Linear Objects using Discrete Elastic Rods in Generalized Coordinates
This paper presents a fast and accurate model of a deformable linear object (DLO) -- e.g., a rope, wire, or cable -- integrated into an established robot physics simulator, MuJoCo. Most accurate DLO models with low computational times exist in standalone numerical simulators, which are unable or require tedious work to handle external objects. Based on an existing state-of-the-art DLO model -- Discrete Elastic Rods (DER) -- our implementation provides an improvement in accuracy over MuJoCo's own native cable model. To minimize computational load, our model utilizes force-lever analysis to adapt the Cartesian stiffness forces of the DER into its generalized coordinates. As a key contribution, we introduce a novel parameter identification pipeline designed for both simplicity and accuracy, which we utilize to determine the bending and twisting stiffness of three distinct DLOs. We then evaluate the performance of each model by simulating the DLOs and comparing them to their real-world counterparts and against theoretically proven validation tests.
comment: 7 pages, 6 figures
♻ ☆ GAF: Gaussian Action Field as a Dynamic World Model for Robotic Manipulation
Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we propose a Vision-to-4D-to-Action (V-4D-A) framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing simultaneous modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF supports three key query types: reconstruction of the current scene, prediction of future frames, and estimation of initial action via robot motion. Furthermore, the high-quality current and future frames generated by GAF facilitate manipulation action refinement through a GAF-guided diffusion model. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average success rate in robotic manipulation tasks by 10.33% over state-of-the-art methods. Project page: http://chaiying1.github.io/GAF.github.io/project_page/
comment: http://chaiying1.github.io/GAF.github.io/project_page/
♻ ☆ IDCAIS: Inter-Defender Collision-Aware Interception Strategy against Multiple Attackers
In the prior literature on multi-agent area defense games, the assignments of the defenders to the attackers are done based on a cost metric associated only with the interception of the attackers. In contrast to that, this paper presents an Inter-Defender Collision-Aware Interception Strategy (IDCAIS) for defenders to intercept attackers in order to defend a protected area, such that the defender-to-attacker assignment protocol not only takes into account an interception-related cost but also takes into account any possible future collisions among the defenders on their optimal interception trajectories. In particular, in this paper, the defenders are assigned to intercept attackers using a mixed-integer quadratic program (MIQP) that: 1) minimizes the sum of times taken by defenders to capture the attackers under time-optimal control, as well as 2) helps eliminate or delay possible future collisions among the defenders on the optimal trajectories. To prevent inevitable collisions on optimal trajectories or collisions arising due to time-sub-optimal behavior by the attackers, a minimally augmented control using exponential control barrier function (ECBF) is also provided. Simulations show the efficacy of the approach.
comment: 14 pages, 12 figures
♻ ☆ Experimental Setup and Software Pipeline to Evaluate Optimization based Autonomous Multi-Robot Search Algorithms
Signal source localization has been a problem of interest in the multi-robot systems domain given its applications in search & rescue and hazard localization in various industrial and outdoor settings. A variety of multi-robot search algorithms exist that usually formulate and solve the associated autonomous motion planning problem as a heuristic model-free or belief model-based optimization process. Most of these algorithms however remains tested only in simulation, thereby losing the opportunity to generate knowledge about how such algorithms would compare/contrast in a real physical setting in terms of search performance and real-time computing performance. To address this gap, this paper presents a new lab-scale physical setup and associated open-source software pipeline to evaluate and benchmark multi-robot search algorithms. The presented physical setup innovatively uses an acoustic source (that is safe and inexpensive) and small ground robots (e-pucks) operating in a standard motion-capture environment. This setup can be easily recreated and used by most robotics researchers. The acoustic source also presents interesting uncertainty in terms of its noise-to-signal ratio, which is useful to assess sim-to-real gaps. The overall software pipeline is designed to readily interface with any multi-robot search algorithm with minimal effort and is executable in parallel asynchronous form. This pipeline includes a framework for distributed implementation of multi-robot or swarm search algorithms, integrated with a ROS (Robotics Operating System)-based software stack for motion capture supported localization. The utility of this novel setup is demonstrated by using it to evaluate two state-of-the-art multi-robot search algorithms, based on swarm optimization and batch-Bayesian Optimization (called Bayes-Swarm), as well as a random walk baseline.
comment: IDETC 2025
♻ ☆ Stochastic Motion Planning as Gaussian Variational Inference: Theory and Algorithms
We present a novel formulation for motion planning under uncertainties based on variational inference where the optimal motion plan is modeled as a posterior distribution. We propose a Gaussian variational inference-based framework, termed Gaussian Variational Inference Motion Planning (GVI-MP), to approximate this posterior by a Gaussian distribution over the trajectories. We show that the GVI-MP framework is dual to a special class of stochastic control problems and brings robustness into the decision-making in motion planning. We develop two algorithms to numerically solve this variational inference and the equivalent control formulations for motion planning. The first algorithm uses a natural gradient paradigm to iteratively update a Gaussian proposal distribution on the sparse motion planning factor graph. We propose a second algorithm, the Proximal Covariance Steering Motion Planner (PCS-MP), to solve the same inference problem in its stochastic control form with an additional terminal constraint. We leverage a proximal gradient paradigm where, at each iteration, we quadratically approximate nonlinear state costs and solve a linear covariance steering problem in closed form. The efficacy of the proposed algorithms is demonstrated through extensive experiments on various robot models. An implementation is provided in https://github.com/hzyu17/VIMP.
comment: 20 pages
♻ ☆ Learning Realistic Joint Space Boundaries for Range of Motion Analysis of Healthy and Impaired Human Arms
A realistic human kinematic model that satisfies anatomical constraints is essential for human-robot interaction, biomechanics and robot-assisted rehabilitation. Modeling realistic joint constraints, however, is challenging as human arm motion is constrained by joint limits, inter- and intra-joint dependencies, self-collisions, individual capabilities and muscular or neurological constraints which are difficult to represent. Hence, physicians and researchers have relied on simple box-constraints, ignoring important anatomical factors. In this paper, we propose a data-driven method to learn realistic anatomically constrained upper-limb range of motion (RoM) boundaries from motion capture data. This is achieved by fitting a one-class support vector machine to a dataset of upper-limb joint space exploration motions with an efficient hyper-parameter tuning scheme. Our approach outperforms similar works focused on valid RoM learning. Further, we propose an impairment index (II) metric that offers a quantitative assessment of capability/impairment when comparing healthy and impaired arms. We validate the metric on healthy subjects physically constrained to emulate hemiplegia and different disability levels as stroke patients. [https://sites.google.com/seas.upenn.edu/learning-rom]
♻ ☆ cuVSLAM: CUDA accelerated visual odometry and mapping
Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 cameras, in arbitrary geometric configurations, thus supporting a wide range of robotic setups. cuVSLAM is specifically optimized using CUDA to deploy in real-time applications with minimal computational overhead on edge-computing devices such as the NVIDIA Jetson. We present the design and implementation of cuVSLAM, example use cases, and empirical results on several state-of-the-art benchmarks demonstrating the best-in-class performance of cuVSLAM.
♻ ☆ Terrain-aware Low Altitude Path Planning
In this paper, we study the problem of generating low-altitude path plans for nap-of-the-earth (NOE) flight in real time with only RGB images from onboard cameras and the vehicle pose. We propose a novel training method that combines behavior cloning and self-supervised learning, where the self-supervision component allows the learned policy to refine the paths generated by the expert planner. Simulation studies show 24.7% reduction in average path elevation compared to the standard behavior cloning approach.
♻ ☆ Why Sample Space Matters: Keyframe Sampling Optimization for LiDAR-based Place Recognition
Recent advances in robotics are driving real-world autonomy for long-term and large-scale missions, where loop closures via place recognition are vital for mitigating pose estimation drift. However, achieving real-time performance remains challenging for resource-constrained mobile robots and multi-robot systems due to the computational burden of high-density sampling, which increases the complexity of comparing and verifying query samples against a growing map database. Conventional methods often retain redundant information or miss critical data by relying on fixed sampling intervals or operating in 3-D space instead of the descriptor feature space. To address these challenges, we introduce the concept of sample space and propose a novel keyframe sampling approach for LiDAR-based place recognition. Our method minimizes redundancy while preserving essential information in the hyper-dimensional descriptor space, supporting both learning-based and handcrafted descriptors. The proposed approach incorporates a sliding window optimization strategy to ensure efficient keyframe selection and real-time performance, enabling seamless integration into robotic pipelines. In sum, our approach demonstrates robust performance across diverse datasets, with the ability to adapt seamlessly from indoor to outdoor scenarios without parameter tuning, reducing loop closure detection times and memory requirements.
comment: The work is no longer intended for consideration in its current form. Readers are instead encouraged to refer to our related and more complete study, arXiv:2501.01791, which should be considered as a stand-alone contribution
♻ ☆ Employing Laban Shape for Generating Emotionally and Functionally Expressive Trajectories in Robotic Manipulators
Successful human-robot collaboration depends on cohesive communication and a precise understanding of the robot's abilities, goals, and constraints. While robotic manipulators offer high precision, versatility, and productivity, they exhibit expressionless and monotonous motions that conceal the robot's intention, resulting in a lack of efficiency and transparency with humans. In this work, we use Laban notation, a dance annotation language, to enable robotic manipulators to generate trajectories with functional expressivity, where the robot uses nonverbal cues to communicate its abilities and the likelihood of succeeding at its task. We achieve this by introducing two novel variants of Hesitant expressive motion (Spoke-Like and Arc-Like). We also enhance the emotional expressivity of four existing emotive trajectories (Happy, Sad, Shy, and Angry) by augmenting Laban Effort usage with Laban Shape. The functionally expressive motions are validated via a human-subjects study, where participants equate both variants of Hesitant motion with reduced robot competency. The enhanced emotive trajectories are shown to be viewed as distinct emotions using the Valence-Arousal-Dominance (VAD) spectrum, corroborating the usage of Laban Shape.
comment: Accepted for presentation at the 2025 IEEE RO-MAN Conference
Robotics 25
☆ Integrating LLMs and Digital Twins for Adaptive Multi-Robot Task Allocation in Construction
Multi-robot systems are emerging as a promising solution to the growing demand for productivity, safety, and adaptability across industrial sectors. However, effectively coordinating multiple robots in dynamic and uncertain environments, such as construction sites, remains a challenge, particularly due to unpredictable factors like material delays, unexpected site conditions, and weather-induced disruptions. To address these challenges, this study proposes an adaptive task allocation framework that strategically leverages the synergistic potential of Digital Twins, Integer Programming (IP), and Large Language Models (LLMs). The multi-robot task allocation problem is formally defined and solved using an IP model that accounts for task dependencies, robot heterogeneity, scheduling constraints, and re-planning requirements. A mechanism for narrative-driven schedule adaptation is introduced, in which unstructured natural language inputs are interpreted by an LLM, and optimization constraints are autonomously updated, enabling human-in-the-loop flexibility without manual coding. A digital twin-based system has been developed to enable real-time synchronization between physical operations and their digital representations. This closed-loop feedback framework ensures that the system remains dynamic and responsive to ongoing changes on site. A case study demonstrates both the computational efficiency of the optimization algorithm and the reasoning performance of several LLMs, with top-performing models achieving over 97% accuracy in constraint and parameter extraction. The results confirm the practicality, adaptability, and cross-domain applicability of the proposed methods.
☆ Automated Plan Refinement for Improving Efficiency of Robotic Layup of Composite Sheets
The automation of composite sheet layup is essential to meet the increasing demand for composite materials in various industries. However, draping plans for the robotic layup of composite sheets are not robust. A plan that works well under a certain condition does not work well in a different condition. Changes in operating conditions due to either changes in material properties or working environment may lead a draping plan to exhibit suboptimal performance. In this paper, we present a comprehensive framework aimed at refining plans based on the observed execution performance. Our framework prioritizes the minimization of uncompacted regions while simultaneously improving time efficiency. To achieve this, we integrate human expertise with data-driven decision-making to refine expert-crafted plans for diverse production environments. We conduct experiments to validate the effectiveness of our approach, revealing significant reductions in the number of corrective paths required compared to initial expert-crafted plans. Through a combination of empirical data analysis, action-effectiveness modeling, and search-based refinement, our system achieves superior time efficiency in robotic layup. Experimental results demonstrate the efficacy of our approach in optimizing the layup process, thereby advancing the state-of-the-art in composite manufacturing automation.
☆ RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies
Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.
comment: Website: https://robo-arena.github.io/
☆ RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two challenges: (1) the lack of an efficient, scalable data generation method for novel tasks, and (2) oversimplified simulation environments that fail to capture real-world complexity. We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. We first construct RoboTwin-OD, a large-scale object library comprising 731 instances across 147 categories, each annotated with semantic and manipulation-relevant labels. Building on this foundation, we develop an expert data synthesis pipeline that combines multimodal large language models (MLLMs) with simulation-in-the-loop refinement to generate task-level execution code automatically. To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby enhancing data diversity and policy robustness. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories. Empirical results show a 10.9% gain in code generation success and improved generalization to novel real-world scenarios. A VLA model fine-tuned on our dataset achieves a 367% relative improvement (42.0% vs. 9.0%) on unseen scene real-world tasks, while zero-shot models trained solely on our synthetic data achieve a 228% relative gain, highlighting strong generalization without real-world supervision. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation.
comment: Project Page: https://robotwin-platform.github.io/
☆ StereoTacTip: Vision-based Tactile Sensing with Biomimetic Skin-Marker Arrangements
Vision-Based Tactile Sensors (VBTSs) stand out for their superior performance due to their high-information content output. Recently, marker-based VBTSs have been shown to give accurate geometry reconstruction when using stereo cameras. \uhl{However, many marker-based VBTSs use complex biomimetic skin-marker arrangements, which presents issues for the geometric reconstruction of the skin surface from the markers}. Here we investigate how the marker-based skin morphology affects stereo vision-based tactile sensing, using a novel VBTS called the StereoTacTip. To achieve accurate geometry reconstruction, we introduce: (i) stereo marker matching and tracking using a novel Delaunay-Triangulation-Ring-Coding algorithm; (ii) a refractive depth correction model that corrects the depth distortion caused by refraction in the internal media; (iii) a skin surface correction model from the marker positions, relying on an inverse calculation of normals to the skin surface; and (iv)~methods for geometry reconstruction over multiple contacts. To demonstrate these findings, we reconstruct topographic terrains on a large 3D map. Even though contributions (i) and (ii) were developed for biomimetic markers, they should improve the performance of all marker-based VBTSs. Overall, this work illustrates that a thorough understanding and evaluation of the morphologically-complex skin and marker-based tactile sensor principles are crucial for obtaining accurate geometric information.
comment: 11 pages, 13 figures
☆ Leveraging Cloud-Fog Automation for Autonomous Collision Detection and Classification in Intelligent Unmanned Surface Vehicles
Industrial Cyber-Physical Systems (ICPS) technologies are foundational in driving maritime autonomy, particularly for Unmanned Surface Vehicles (USVs). However, onboard computational constraints and communication latency significantly restrict real-time data processing, analysis, and predictive modeling, hence limiting the scalability and responsiveness of maritime ICPS. To overcome these challenges, we propose a distributed Cloud-Edge-IoT architecture tailored for maritime ICPS by leveraging design principles from the recently proposed Cloud-Fog Automation paradigm. Our proposed architecture comprises three hierarchical layers: a Cloud Layer for centralized and decentralized data aggregation, advanced analytics, and future model refinement; an Edge Layer that executes localized AI-driven processing and decision-making; and an IoT Layer responsible for low-latency sensor data acquisition. Our experimental results demonstrated improvements in computational efficiency, responsiveness, and scalability. When compared with our conventional approaches, we achieved a classification accuracy of 86\%, with an improved latency performance. By adopting Cloud-Fog Automation, we address the low-latency processing constraints and scalability challenges in maritime ICPS applications. Our work offers a practical, modular, and scalable framework to advance robust autonomy and AI-driven decision-making and autonomy for intelligent USVs in future maritime ICPS.
comment: 6 pages, 5 figures, accepted paper on the 23rd IEEE International Conference on Industrial Informatics (INDIN), July 12-15, 2025, Kunming, China
☆ ADA-DPM: A Neural Descriptors-based Adaptive Noise Point Filtering Strategy for SLAM
LiDAR SLAM has demonstrated significant application value in various fields, including mobile robot navigation and high-precision map construction. However, existing methods often need to make a trade-off between positioning accuracy and system robustness when faced with dynamic object interference, point cloud noise, and unstructured environments. To address this challenge, we propose an adaptive noise filtering SLAM strategy-ADA-DPM, achieving excellent preference in both aspects. We design the Dynamic Segmentation Head to predict the category of feature points belonging to dynamic points, to eliminate dynamic feature points; design the Global Importance Scoring Head to adaptively select feature points with higher contribution and features while suppressing noise interference; and construct the Cross Layer Intra-Graph Convolution Module (GLI-GCN) to fuse multi-scale neighborhood structures, thereby enhancing the discriminative ability of overlapping features. Finally, to further validate the effectiveness of our method, we tested it on several publicly available datasets and achieved outstanding results.
☆ Newtonian and Lagrangian Neural Networks: A Comparison Towards Efficient Inverse Dynamics Identification
Accurate inverse dynamics models are essential tools for controlling industrial robots. Recent research combines neural network regression with inverse dynamics formulations of the Newton-Euler and the Euler-Lagrange equations of motion, resulting in so-called Newtonian neural networks and Lagrangian neural networks, respectively. These physics-informed models seek to identify unknowns in the analytical equations from data. Despite their potential, current literature lacks guidance on choosing between Lagrangian and Newtonian networks. In this study, we show that when motor torques are estimated instead of directly measuring joint torques, Lagrangian networks prove less effective compared to Newtonian networks as they do not explicitly model dissipative torques. The performance of these models is compared to neural network regression on data of a MABI MAX 100 industrial robot.
comment: Paper accepted for publication in 14th IFAC Symposium on Robotics
☆ CFTel: A Practical Architecture for Robust and Scalable Telerobotics with Cloud-Fog Automation
Telerobotics is a key foundation in autonomous Industrial Cyber-Physical Systems (ICPS), enabling remote operations across various domains. However, conventional cloud-based telerobotics suffers from latency, reliability, scalability, and resilience issues, hindering real-time performance in critical applications. Cloud-Fog Telerobotics (CFTel) builds on the Cloud-Fog Automation (CFA) paradigm to address these limitations by leveraging a distributed Cloud-Edge-Robotics computing architecture, enabling deterministic connectivity, deterministic connected intelligence, and deterministic networked computing. This paper synthesizes recent advancements in CFTel, aiming to highlight its role in facilitating scalable, low-latency, autonomous, and AI-driven telerobotics. We analyze architectural frameworks and technologies that enable them, including 5G Ultra-Reliable Low-Latency Communication, Edge Intelligence, Embodied AI, and Digital Twins. The study demonstrates that CFTel has the potential to enhance real-time control, scalability, and autonomy while supporting service-oriented solutions. We also discuss practical challenges, including latency constraints, cybersecurity risks, interoperability issues, and standardization efforts. This work serves as a foundational reference for researchers, stakeholders, and industry practitioners in future telerobotics research.
comment: 6 pages, 1 figure, accepted paper on the 23rd IEEE International Conference on Industrial Informatics (INDIN), July 12-15, 2025, Kunming, China
☆ GeNIE: A Generalizable Navigation System for In-the-Wild Environments
Reliable navigation in unstructured, real-world environments remains a significant challenge for embodied agents, especially when operating across diverse terrains, weather conditions, and sensor configurations. In this paper, we introduce GeNIE (Generalizable Navigation System for In-the-Wild Environments), a robust navigation framework designed for global deployment. GeNIE integrates a generalizable traversability prediction model built on SAM2 with a novel path fusion strategy that enhances planning stability in noisy and ambiguous settings. We deployed GeNIE in the Earth Rover Challenge (ERC) at ICRA 2025, where it was evaluated across six countries spanning three continents. GeNIE took first place and achieved 79% of the maximum possible score, outperforming the second-best team by 17%, and completed the entire competition without a single human intervention. These results set a new benchmark for robust, generalizable outdoor robot navigation. We will release the codebase, pretrained model weights, and newly curated datasets to support future research in real-world navigation.
comment: 8 pages, 5 figures. Jiaming Wang, Diwen Liu, and Jizhuo Chen contributed equally
☆ Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective ICML 2025
We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent "gibberish" can remarkably improve performance across diverse tasks. Notably, the "gibberish" always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature--such as symbiosis and self-organization--arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.
comment: ICML 2025, and Code will be released at: https://github.com/jianyu-cs/PromptQuine/
☆ Embedded Flexible Circumferential Sensing for Real-Time Intraoperative Environmental Perception in Continuum Robots
Continuum robots have been widely adopted in robot-assisted minimally invasive surgery (RMIS) because of their compact size and high flexibility. However, their proprioceptive capabilities remain limited, particularly in narrow lumens, where lack of environmental awareness can lead to unintended tissue contact and surgical risks. To address this challenge, this work proposes a flexible annular sensor structure integrated around the vertebral disks of continuum robots. The proposed design enables real-time environmental mapping by estimating the distance between the robotic disks and the surrounding tissue, thereby facilitating safer operation through advanced control strategies. The experiment has proven that its accuracy in obstacle detection can reach 0.19 mm. Fabricated using flexible printed circuit (FPC) technology, the sensor demonstrates a modular and cost-effective design with compact dimensions and low noise interference. Its adaptable parameters allow compatibility with various continuum robot architectures, offering a promising solution for enhancing intraoperative perception and control in surgical robotics.
☆ Cross-modal State Space Modeling for Real-time RGB-thermal Wild Scene Semantic Segmentation
The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross-modal 2D-selective-scan (CM-SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross-modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross-modal state space association (CM-SSA) module that effectively integrates global associations from CM-SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer-based approaches, CM-SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at https://github.com/xiaodonguo/CMSSM.
☆ Geometric Contact Flows: Contactomorphisms for Dynamics and Control ICML 2025
Accurately modeling and predicting complex dynamical systems, particularly those involving force exchange and dissipation, is crucial for applications ranging from fluid dynamics to robotics, but presents significant challenges due to the intricate interplay of geometric constraints and energy transfer. This paper introduces Geometric Contact Flows (GFC), a novel framework leveraging Riemannian and Contact geometry as inductive biases to learn such systems. GCF constructs a latent contact Hamiltonian model encoding desirable properties like stability or energy conservation. An ensemble of contactomorphisms then adapts this model to the target dynamics while preserving these properties. This ensemble allows for uncertainty-aware geodesics that attract the system's behavior toward the data support, enabling robust generalization and adaptation to unseen scenarios. Experiments on learning dynamics for physical systems and for controlling robots on interaction tasks demonstrate the effectiveness of our approach.
comment: Accepted at ICML 2025
♻ ☆ Structured Pneumatic Fingerpads for Actively Tunable Grip Friction
Grip surfaces with tunable friction can actively modify contact conditions, enabling transitions between higher- and lower-friction states for grasp adjustment. Friction can be increased to grip securely and then decreased to gently release (e.g., for handovers) or manipulate in-hand. Recent friction-tuning surface designs using soft pneumatic chambers show good control over grip friction; however, most require complex fabrication processes and/or custom gripper hardware. We present a practical structured fingerpad design for friction tuning that uses less than $1 USD of materials, takes only seconds to repair, and is easily adapted to existing grippers. Our design uses surface morphology changes to tune friction. The fingerpad is actuated by pressurizing its internal chambers, thereby deflecting its flexible grip surface out from or into these chambers. We characterize the friction-tuning capabilities of our design by measuring the shear force required to pull an object from a gripper equipped with two independently actuated fingerpads. Our results show that varying actuation pressure and timing changes the magnitude of friction forces on a gripped object by up to a factor of 2.8. We demonstrate additional features including macro-scale interlocking behaviour and pressure-based object detection.
comment: In Proceedings of the IEEE/RAS International Conference on Soft Robotics (RoboSoft'25), Lausanne, Switzerland, Apr. 22-26, 2025
♻ ☆ Learning to Adapt through Bio-Inspired Gait Strategies for Versatile Quadruped Locomotion
Legged robots must adapt their gait to navigate unpredictable environments, a challenge that animals master with ease. However, most deep reinforcement learning (DRL) approaches to quadruped locomotion rely on a fixed gait, limiting adaptability to changes in terrain and dynamic state. Here we show that integrating three core principles of animal locomotion-gait transition strategies, gait memory and real-time motion adjustments enables a DRL control framework to fluidly switch among multiple gaits and recover from instability, all without external sensing. Our framework is guided by biomechanics-inspired metrics that capture efficiency, stability and system limits, which are unified to inform optimal gait selection. The resulting framework achieves blind zero-shot deployment across diverse, real-world terrains and substantially significantly outperforms baseline controllers. By embedding biological principles into data-driven control, this work marks a step towards robust, efficient and versatile robotic locomotion, highlighting how animal motor intelligence can shape the next generation of adaptive machines.
comment: 19 pages, 8 figures, journal paper
♻ ☆ Physics-Informed Multi-Agent Reinforcement Learning for Distributed Multi-Robot Problems
The networked nature of multi-robot systems presents challenges in the context of multi-agent reinforcement learning. Centralized control policies do not scale with increasing numbers of robots, whereas independent control policies do not exploit the information provided by other robots, exhibiting poor performance in cooperative-competitive tasks. In this work we propose a physics-informed reinforcement learning approach able to learn distributed multi-robot control policies that are both scalable and make use of all the available information to each robot. Our approach has three key characteristics. First, it imposes a port-Hamiltonian structure on the policy representation, respecting energy conservation properties of physical robot systems and the networked nature of robot team interactions. Second, it uses self-attention to ensure a sparse policy representation able to handle time-varying information at each robot from the interaction graph. Third, we present a soft actor-critic reinforcement learning algorithm parameterized by our self-attention port-Hamiltonian control policy, which accounts for the correlation among robots during training while overcoming the need of value function factorization. Extensive simulations in different multi-robot scenarios demonstrate the success of the proposed approach, surpassing previous multi-robot reinforcement learning solutions in scalability, while achieving similar or superior performance (with averaged cumulative reward up to x2 greater than the state-of-the-art with robot teams x6 larger than the number of robots at training time). We also validate our approach on multiple real robots in the Georgia Tech Robotarium under imperfect communication, demonstrating zero-shot sim-to-real transfer and scalability across number of robots.
comment: Paper accepted and published at IEEE T-RO
♻ ☆ Active Fine-Tuning of Multi-Task Policies
Pre-trained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks. This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning. However, as soon as several tasks need to be learned, we must decide which tasks should be demonstrated and how often? We study this multi-task problem and explore an interactive framework in which the agent adaptively selects the tasks to be demonstrated. We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy. We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.
♻ ☆ SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation MICCAI 2025
Surgical video generation can enhance medical education and research, but existing methods lack fine-grained motion control and realism. We introduce SurgSora, a framework that generates high-fidelity, motion-controllable surgical videos from a single input frame and user-specified motion cues. Unlike prior approaches that treat objects indiscriminately or rely on ground-truth segmentation masks, SurgSora leverages self-predicted object features and depth information to refine RGB appearance and optical flow for precise video synthesis. It consists of three key modules: (1) the Dual Semantic Injector, which extracts object-specific RGB-D features and segmentation cues to enhance spatial representations; (2) the Decoupled Flow Mapper, which fuses multi-scale optical flow with semantic features for realistic motion dynamics; and (3) the Trajectory Controller, which estimates sparse optical flow and enables user-guided object movement. By conditioning these enriched features within the Stable Video Diffusion, SurgSora achieves state-of-the-art visual authenticity and controllability in advancing surgical video synthesis, as demonstrated by extensive quantitative and qualitative comparisons. Our human evaluation in collaboration with expert surgeons further demonstrates the high realism of SurgSora-generated videos, highlighting the potential of our method for surgical training and education. Our project is available at https://surgsora.github.io/surgsora.github.io.
comment: MICCAI 2025
♻ ☆ Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.
♻ ☆ POPGym Arcade: Parallel Pixelated POMDPs
We present the POPGym Arcade, a collection of hardware-accelerated, pixel-based environments with shared observation and action spaces. Each environment includes fully and partially observable variants, enabling counterfactual studies on partial observability. We also introduce mathematical tools for analyzing policies under partial observability, which reveal how agents recall past information to make decisions. Our analysis shows (1) that controlling for partial observability is critical and (2) that agents with long-term memory learn brittle policies that struggle to generalize. Finally, we demonstrate that recurrent policies can be "poisoned" by old, out-of-distribution observations, with implications for sim-to-real transfer, imitation learning, and offline reinforcement learning.
♻ ☆ An Efficient Method for Extracting the Shortest Path from the Dubins Set for Short Distances Between Initial and Final Positions
Path planning is crucial for the efficient operation of Autonomous Mobile Robots (AMRs) in factory environments. Many existing algorithms rely on Dubins paths, which have been adapted for various applications. However, an efficient method for directly determining the shortest Dubins path remains underdeveloped. This paper presents a comprehensive approach to efficiently identify the shortest path within the Dubins set. We classify the initial and final configurations into six equivalency groups based on the quadrants formed by their orientation angle pairs. Paths within each group exhibit shared topological properties, enabling a reduction in the number of candidate cases to analyze. This pre-classification step simplifies the problem and eliminates the need to explicitly compute and compare the lengths of all possible paths. As a result, the proposed method significantly lowers computational complexity. Extensive experiments confirm that our approach consistently outperforms existing methods in terms of computational efficiency.
comment: 9 pages, 10 figures
♻ ☆ A real-time anomaly detection method for robots based on a flexible and sparse latent space
The growing demand for robots to operate effectively in diverse environments necessitates the need for robust real-time anomaly detection techniques during robotic operations. However, deep learning-based models in robotics face significant challenges due to limited training data and highly noisy signal features. In this paper, we present Sparse Masked Autoregressive Flow-based Adversarial AutoEncoder model to address these problems. This approach integrates Masked Autoregressive Flow model into Adversarial AutoEncoders to construct a flexible latent space and utilize Sparse autoencoder to efficiently focus on important features, even in scenarios with limited feature space. Our experiments demonstrate that the proposed model achieves a 4.96% to 9.75% higher area under the receiver operating characteristic curve for pick-and-place robotic operations with randomly placed cans, compared to existing state-of-the-art methods. Notably, it showed up to 19.67% better performance in scenarios involving collisions with lightweight objects. Additionally, unlike the existing state-of-the-art model, our model performs inferences within 1 millisecond, ensuring real-time anomaly detection. These capabilities make our model highly applicable to machine learning-based robotic safety systems in dynamic environments. The code is available at https://github.com/twkang43/sparse-maf-aae.
comment: 20 pages, 11 figures
♻ ☆ DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning
In complex driving environments, autonomous vehicles must navigate safely. Relying on a single predicted path, as in regression-based approaches, usually does not explicitly assess the safety of the predicted trajectory. Selection-based methods address this by generating and scoring multiple trajectory candidates and predicting the safety score for each, but face optimization challenges in precisely selecting the best option from thousands of possibilities and distinguishing subtle but safety-critical differences, especially in rare or underrepresented scenarios. We propose DriveSuprim to overcome these challenges and advance the selection-based paradigm through a coarse-to-fine paradigm for progressive candidate filtering, a rotation-based augmentation method to improve robustness in out-of-distribution scenarios, and a self-distillation framework to stabilize training. DriveSuprim achieves state-of-the-art performance, reaching 93.5% PDMS in NAVSIM v1 and 87.1% EPDMS in NAVSIM v2 without extra data, demonstrating superior safetycritical capabilities, including collision avoidance and compliance with rules, while maintaining high trajectory quality in various driving scenarios.
comment: 15 pages, 6 figures
♻ ☆ G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation CVPR 2025
Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.
comment: Webpage: https://tianxingchen.github.io/G3Flow/, accepted to CVPR 2025
Robotics 18
☆ Generative Grasp Detection and Estimation with Concept Learning-based Safety Criteria
Neural networks are often regarded as universal equations that can estimate any function. This flexibility, however, comes with the drawback of high complexity, rendering these networks into black box models, which is especially relevant in safety-centric applications. To that end, we propose a pipeline for a collaborative robot (Cobot) grasping algorithm that detects relevant tools and generates the optimal grasp. To increase the transparency and reliability of this approach, we integrate an explainable AI method that provides an explanation for the underlying prediction of a model by extracting the learned features and correlating them to corresponding classes from the input. These concepts are then used as additional criteria to ensure the safe handling of work tools. In this paper, we show the consistency of this approach and the criterion for improving the handover position. This approach was tested in an industrial environment, where a camera system was set up to enable a robot to pick up certain tools and objects.
comment: RAAD 2025: 34th International Conference on Robotics in Alpe-Adria-Danube Region
☆ Leveling the Playing Field: Carefully Comparing Classical and Learned Controllers for Quadrotor Trajectory Tracking
Learning-based control approaches like reinforcement learning (RL) have recently produced a slew of impressive results for tasks like quadrotor trajectory tracking and drone racing. Naturally, it is common to demonstrate the advantages of these new controllers against established methods like analytical controllers. We observe, however, that reliably comparing the performance of such very different classes of controllers is more complicated than might appear at first sight. As a case study, we take up the problem of agile tracking of an end-effector for a quadrotor with a fixed arm. We develop a set of best practices for synthesizing the best-in-class RL and geometric controllers (GC) for benchmarking. In the process, we resolve widespread RL-favoring biases in prior studies that provide asymmetric access to: (1) the task definition, in the form of an objective function, (2) representative datasets, for parameter optimization, and (3) feedforward information, describing the desired future trajectory. The resulting findings are the following: our improvements to the experimental protocol for comparing learned and classical controllers are critical, and each of the above asymmetries can yield misleading conclusions. Prior works have claimed that RL outperforms GC, but we find the gaps between the two controller classes are much smaller than previously published when accounting for symmetric comparisons. Geometric control achieves lower steady-state error than RL, while RL has better transient performance, resulting in GC performing better in relatively slow or less agile tasks, but RL performing better when greater agility is required. Finally, we open-source implementations of geometric and RL controllers for these aerial vehicles, implementing best practices for future development. Website and code is available at https://pratikkunapuli.github.io/rl-vs-gc/
comment: Accepted for publication to RSS 2025. 10 pages, 5 figures. Project website: https://pratikkunapuli.github.io/rl-vs-gc/
☆ Engagement and Disclosures in LLM-Powered Cognitive Behavioral Therapy Exercises: A Factorial Design Comparing the Influence of a Robot vs. Chatbot Over Time
Many researchers are working to address the worldwide mental health crisis by developing therapeutic technologies that increase the accessibility of care, including leveraging large language model (LLM) capabilities in chatbots and socially assistive robots (SARs) used for therapeutic applications. Yet, the effects of these technologies over time remain unexplored. In this study, we use a factorial design to assess the impact of embodiment and time spent engaging in therapeutic exercises on participant disclosures. We assessed transcripts gathered from a two-week study in which 26 university student participants completed daily interactive Cognitive Behavioral Therapy (CBT) exercises in their residences using either an LLM-powered SAR or a disembodied chatbot. We evaluated the levels of active engagement and high intimacy of their disclosures (opinions, judgments, and emotions) during each session and over time. Our findings show significant interactions between time and embodiment for both outcome measures: participant engagement and intimacy increased over time in the physical robot condition, while both measures decreased in the chatbot condition.
☆ Learning to Dock: A Simulation-based Study on Closing the Sim2Real Gap in Autonomous Underwater Docking
Autonomous Underwater Vehicle (AUV) docking in dynamic and uncertain environments is a critical challenge for underwater robotics. Reinforcement learning is a promising method for developing robust controllers, but the disparity between training simulations and the real world, or the sim2real gap, often leads to a significant deterioration in performance. In this work, we perform a simulation study on reducing the sim2real gap in autonomous docking through training various controllers and then evaluating them under realistic disturbances. In particular, we focus on the real-world challenge of docking under different payloads that are potentially outside the original training distribution. We explore existing methods for improving robustness including randomization techniques and history-conditioned controllers. Our findings provide insights into mitigating the sim2real gap when training docking controllers. Furthermore, our work indicates areas of future research that may be beneficial to the marine robotics community.
comment: Advancing Quantitative and Qualitative Simulators for Marine Applications Workshop Paper at International Conference on Robotics and Automation 2025
☆ RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a Vision Language Model (VLM)-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such VLM-based action verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.
☆ Optimizing Exploration with a New Uncertainty Framework for Active SLAM Systems
Accurate reconstruction of the environment is a central goal of Simultaneous Localization and Mapping (SLAM) systems. However, the agent's trajectory can significantly affect estimation accuracy. This paper presents a new method to model map uncertainty in Active SLAM systems using an Uncertainty Map (UM). The UM uses probability distributions to capture where the map is uncertain, allowing Uncertainty Frontiers (UF) to be defined as key exploration-exploitation objectives and potential stopping criteria. In addition, the method introduces the Signed Relative Entropy (SiREn), based on the Kullback-Leibler divergence, to measure both coverage and uncertainty together. This helps balance exploration and exploitation through an easy-to-understand parameter. Unlike methods that depend on particular SLAM setups, the proposed approach is compatible with different types of sensors, such as cameras, LiDARs, and multi-sensor fusion. It also addresses common problems in exploration planning and stopping conditions. Furthermore, integrating this map modeling approach with a UF-based planning system enables the agent to autonomously explore open spaces, a behavior not previously observed in the Active SLAM literature. Code and implementation details are available as a ROS node, and all generated data are openly available for public use, facilitating broader adoption and validation of the proposed approach.
☆ Quantification of Sim2Real Gap via Neural Simulation Gap Function
In this paper, we introduce the notion of neural simulation gap functions, which formally quantifies the gap between the mathematical model and the model in the high-fidelity simulator, which closely resembles reality. Many times, a controller designed for a mathematical model does not work in reality because of the unmodelled gap between the two systems. With the help of this simulation gap function, one can use existing model-based tools to design controllers for the mathematical system and formally guarantee a decent transition from the simulation to the real world. Although in this work, we have quantified this gap using a neural network, which is trained using a finite number of data points, we give formal guarantees on the simulation gap function for the entire state space including the unseen data points. We collect data from high-fidelity simulators leveraging recent advancements in Real-to-Sim transfer to ensure close alignment with reality. We demonstrate our results through two case studies - a Mecanum bot and a Pendulum.
☆ RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models
Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and promising potential in solving complex robotic manipulation tasks. However, their substantial parameter sizes and high inference latency pose significant challenges for real-world deployment, particularly on resource-constrained robotic platforms. To address this issue, we begin by conducting an extensive empirical study to explore the effectiveness of model compression techniques when applied to VLAs. Building on the insights gained from these preliminary experiments, we propose RLRC, a three-stage recovery method for compressed VLAs, including structured pruning, performance recovery based on SFT and RL, and further quantization. RLRC achieves up to an 8x reduction in memory usage and a 2.3x improvement in inference throughput, while maintaining or even surpassing the original VLA's task success rate. Extensive experiments show that RLRC consistently outperforms existing compression baselines, demonstrating strong potential for on-device deployment of VLAs. Project website: https://rlrc-vla.github.io
☆ Imitation Learning for Active Neck Motion Enabling Robot Manipulation beyond the Field of View
Most prior research in deep imitation learning has predominantly utilized fixed cameras for image input, which constrains task performance to the predefined field of view. However, enabling a robot to actively maneuver its neck can significantly expand the scope of imitation learning to encompass a wider variety of tasks and expressive actions such as neck gestures. To facilitate imitation learning in robots capable of neck movement while simultaneously performing object manipulation, we propose a teaching system that systematically collects datasets incorporating neck movements while minimizing discomfort caused by dynamic viewpoints during teleoperation. In addition, we present a novel network model for learning manipulation tasks including active neck motion. Experimental results showed that our model can achieve a high success rate of around 90\%, regardless of the distraction from the viewpoint variations by active neck motion. Moreover, the proposed model proved particularly effective in challenging scenarios, such as when objects were situated at the periphery or beyond the standard field of view, where traditional models struggled. The proposed approach contributes to the efficiency of dataset collection and extends the applicability of imitation learning to more complex and dynamic scenarios.
comment: 6 pages
☆ Risk-Guided Diffusion: Toward Deploying Robot Foundation Models in Space, Where Failure Is Not An Option
Safe, reliable navigation in extreme, unfamiliar terrain is required for future robotic space exploration missions. Recent generative-AI methods learn semantically aware navigation policies from large, cross-embodiment datasets, but offer limited safety guarantees. Inspired by human cognitive science, we propose a risk-guided diffusion framework that fuses a fast, learned "System-1" with a slow, physics-based "System-2", sharing computation at both training and inference to couple adaptability with formal safety. Hardware experiments conducted at the NASA JPL's Mars-analog facility, Mars Yard, show that our approach reduces failure rates by up to $4\times$ while matching the goal-reaching performance of learning-based robotic models by leveraging inference-time compute without any additional training.
☆ DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving
Understanding the short-term motion of vulnerable road users (VRUs) like pedestrians and cyclists is critical for safe autonomous driving, especially in urban scenarios with ambiguous or high-risk behaviors. While vision-language models (VLMs) have enabled open-vocabulary perception, their utility for fine-grained intent reasoning remains underexplored. Notably, no existing benchmark evaluates multi-class intent prediction in safety-critical situations, To address this gap, we introduce DRAMA-X, a fine-grained benchmark constructed from the DRAMA dataset via an automated annotation pipeline. DRAMA-X contains 5,686 accident-prone frames labeled with object bounding boxes, a nine-class directional intent taxonomy, binary risk scores, expert-generated action suggestions for the ego vehicle, and descriptive motion summaries. These annotations enable a structured evaluation of four interrelated tasks central to autonomous decision-making: object detection, intent prediction, risk assessment, and action suggestion. As a reference baseline, we propose SGG-Intent, a lightweight, training-free framework that mirrors the ego vehicle's reasoning pipeline. It sequentially generates a scene graph from visual input using VLM-backed detectors, infers intent, assesses risk, and recommends an action using a compositional reasoning stage powered by a large language model. We evaluate a range of recent VLMs, comparing performance across all four DRAMA-X tasks. Our experiments demonstrate that scene-graph-based reasoning enhances intent prediction and risk assessment, especially when contextual cues are explicitly modeled.
comment: 19 pages, 5 figures, Preprint under review. Code available at: https://github.com/taco-group/DRAMA-X
☆ Accelerating Residual Reinforcement Learning with Uncertainty Estimation
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned polices in the real world to demonstrate their robustness with zero-shot sim-to-real transfer.
☆ VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models
Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.
♻ ☆ Trajectory Prediction for Autonomous Driving: Progress, Limitations, and Future Directions
As the potential for autonomous vehicles to be integrated on a large scale into modern traffic systems continues to grow, ensuring safe navigation in dynamic environments is crucial for smooth integration. To guarantee safety and prevent collisions, autonomous vehicles must be capable of accurately predicting the trajectories of surrounding traffic agents. Over the past decade, significant efforts from both academia and industry have been dedicated to designing solutions for precise trajectory forecasting. These efforts have produced a diverse range of approaches, raising questions about the differences between these methods and whether trajectory prediction challenges have been fully addressed. This paper reviews a substantial portion of recent trajectory prediction methods proposing a taxonomy to classify existing solutions. A general overview of the prediction pipeline is also provided, covering input and output modalities, modeling features, and prediction paradigms existing in the literature. In addition, the paper discusses active research areas within trajectory prediction, addresses the posed research questions, and highlights the remaining research gaps and challenges.
♻ ☆ Learning Aerodynamics for the Control of Flying Humanoid Robots
Robots with multi-modal locomotion are an active research field due to their versatility in diverse environments. In this context, additional actuation can provide humanoid robots with aerial capabilities. Flying humanoid robots face challenges in modeling and control, particularly with aerodynamic forces. This paper addresses these challenges from a technological and scientific standpoint. The technological contribution includes the mechanical design of iRonCub-Mk1, a jet-powered humanoid robot, optimized for jet engine integration, and hardware modifications for wind tunnel experiments on humanoid robots for precise aerodynamic forces and surface pressure measurements. The scientific contribution offers a comprehensive approach to model and control aerodynamic forces using classical and learning techniques. Computational Fluid Dynamics (CFD) simulations calculate aerodynamic forces, validated through wind tunnel experiments on iRonCub-Mk1. An automated CFD framework expands the aerodynamic dataset, enabling the training of a Deep Neural Network and a linear regression model. These models are integrated into a simulator for designing aerodynamic-aware controllers, validated through flight simulations and balancing experiments on the iRonCub-Mk1 physical prototype.
♻ ☆ Autonomous Navigation of Quadrupeds Using Coverage Path Planning with Morphological Skeleton Map
This paper proposes a novel method of coverage path planning for the purpose of scanning an unstructured environment autonomously. The method uses the morphological skeleton of the prior 2D navigation map via SLAM to generate a sequence of points of interest (POIs). This sequence is then ordered to create an optimal path given the robot's current position. To control the high-level operation, a finite state machine is used to switch between two modes: navigating towards a POI using Nav2, and scanning the local surrounding. We validate the method in a leveled indoor obstacle-free non-convex environment on time efficiency and reachability over five trials. The map reader and the path planner can quickly process maps of width and height ranging between [196,225] pixels and [185,231] pixels in 2.52 ms/pixel and 1.7 ms/pixel, respectively, where their computation time increases with 22.0 ns/pixel and 8.17 $\mu$s/pixel, respectively. The robot managed to reach 86.5% of all waypoints over all five runs. The proposed method suffers from drift occurring in the 2D navigation map.
comment: 15 pages, published to Fronters In Robotics (currently in production), major revision: title change, abstract revised, grammar fixed, mathematical notations fixed and made consistent, conclusion revised, related works extended, Algorithm 1-3 revised
♻ ☆ Object State Estimation Through Robotic Active Interaction for Biological Autonomous Drilling
Estimating the state of biological specimens is challenging due to limited observation through microscopic vision. For instance, during mouse skull drilling, the appearance alters little when thinning bone tissue because of its semi-transparent property and the high-magnification microscopic vision. To obtain the object's state, we introduce an object state estimation method for biological specimens through active interaction based on the deflection. The method is integrated to enhance the autonomous drilling system developed in our previous work. The method and integrated system were evaluated through 12 autonomous eggshell drilling experiment trials. The results show that the system achieved a 91.7% successful ratio and 75% detachable ratio, showcasing its potential applicability in more complex surgical procedures such as mouse skull craniotomy. This research paves the way for further development of autonomous robotic systems capable of estimating the object's state through active interaction.
comment: 6 pages, 5 figures, accepted by RA-L. Please refer to the DOI to access the accepted version
♻ ☆ OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis
The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https://github.com/HHYHRHY/OWMM-Agent
comment: 9 pages of main content, 19 pages in total
Robotics 43
☆ Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation
An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen
comment: Preprint. Project page: https://orangesodahub.github.io/InfGen Code: https://github.com/OrangeSodahub/infgen
☆ Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part$^{2}$GS consistently outperforms state-of-the-art methods by up to 10$\times$ in Chamfer Distance for movable parts.
☆ Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation
Generating large-scale demonstrations for dexterous hand manipulation remains challenging, and several approaches have been proposed in recent years to address this. Among them, generative models have emerged as a promising paradigm, enabling the efficient creation of diverse and physically plausible demonstrations. In this paper, we introduce Dex1B, a large-scale, diverse, and high-quality demonstration dataset produced with generative models. The dataset contains one billion demonstrations for two fundamental tasks: grasping and articulation. To construct it, we propose a generative model that integrates geometric constraints to improve feasibility and applies additional conditions to enhance diversity. We validate the model on both established and newly introduced simulation benchmarks, where it significantly outperforms prior state-of-the-art methods. Furthermore, we demonstrate its effectiveness and robustness through real-world robot experiments. Our project page is at https://jianglongye.com/dex1b
comment: Accepted to RSS 2025. Project page: https://jianglongye.com/dex1b
☆ Judo: A User-Friendly Open-Source Package for Sampling-Based Model Predictive Control
Recent advancements in parallel simulation and successful robotic applications are spurring a resurgence in sampling-based model predictive control. To build on this progress, however, the robotics community needs common tooling for prototyping, evaluating, and deploying sampling-based controllers. We introduce Judo, a software package designed to address this need. To facilitate rapid prototyping and evaluation, Judo provides robust implementations of common sampling-based MPC algorithms and standardized benchmark tasks. It further emphasizes usability with simple but extensible interfaces for controller and task definitions, asynchronous execution for straightforward simulation-to-hardware transfer, and a highly customizable interactive GUI for tuning controllers interactively. While written in Python, the software leverages MuJoCo as its physics backend to achieve real-time performance, which we validate across both consumer and server-grade hardware. Code at https://github.com/bdaiinstitute/judo.
comment: Accepted at the 2025 RSS Workshop on Fast Motion Planning and Control in the Era of Parallelism. 5 Pages
☆ RGBTrack: Fast, Robust Depth-Free 6D Pose Estimation and Tracking IROS 2025
We introduce a robust framework, RGBTrack, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTrack integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTrack's scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTrack's novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for application areas including robotics, augmented reality, and computer vision. The source code for our implementation will be made publicly available at https://github.com/GreatenAnoymous/RGBTrack.git.
comment: Accepted to IROS 2025
☆ Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping IROS 2025
Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, Monocular One-shot Metric-depth Alignment (MOMA), to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments during camera calibration, guided by sparse ground-truth depth points, enabling accurate depth estimation without additional data collection or model retraining on the testing setup. MOMA supports fine-tuning the MDEM on transparent objects, demonstrating strong generalization capabilities. Real-world experiments on tabletop 2-finger grasping and suction-based bin-picking applications show MOMA achieves high success rates in diverse tasks, confirming its effectiveness.
comment: Accepted to IROS 2025
☆ Learning Accurate Whole-body Throwing with High-frequency Residual Policy and Pullback Tube Acceleration IROS 2025
Throwing is a fundamental skill that enables robots to manipulate objects in ways that extend beyond the reach of their arms. We present a control framework that combines learning and model-based control for prehensile whole-body throwing with legged mobile manipulators. Our framework consists of three components: a nominal tracking policy for the end-effector, a high-frequency residual policy to enhance tracking accuracy, and an optimization-based module to improve end-effector acceleration control. The proposed controller achieved the average of 0.28 m landing error when throwing at targets located 6 m away. Furthermore, in a comparative study with university students, the system achieved a velocity tracking error of 0.398 m/s and a success rate of 56.8%, hitting small targets randomly placed at distances of 3-5 m while throwing at a specified speed of 6 m/s. In contrast, humans have a success rate of only 15.2%. This work provides an early demonstration of prehensile throwing with quantified accuracy on hardware, contributing to progress in dynamic whole-body manipulation.
comment: 8 pages, IROS 2025
☆ SDDiff: Boost Radar Perception via Spatial-Doppler Diffusion
Point cloud extraction (PCE) and ego velocity estimation (EVE) are key capabilities gaining attention in 3D radar perception. However, existing work typically treats these two tasks independently, which may neglect the interplay between radar's spatial and Doppler domain features, potentially introducing additional bias. In this paper, we observe an underlying correlation between 3D points and ego velocity, which offers reciprocal benefits for PCE and EVE. To fully unlock such inspiring potential, we take the first step to design a Spatial-Doppler Diffusion (SDDiff) model for simultaneously dense PCE and accurate EVE. To seamlessly tailor it to radar perception, SDDiff improves the conventional latent diffusion process in three major aspects. First, we introduce a representation that embodies both spatial occupancy and Doppler features. Second, we design a directional diffusion with radar priors to streamline the sampling. Third, we propose Iterative Doppler Refinement to enhance the model's adaptability to density variations and ghosting effects. Extensive evaluations show that SDDiff significantly outperforms state-of-the-art baselines by achieving 59% higher in EVE accuracy, 4X greater in valid generation density while boosting PCE effectiveness and reliability.
☆ Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.
comment: 14 pages, 6 figures, under review
☆ Orbital Collision: An Indigenously Developed Web-based Space Situational Awareness Platform SC 2025
This work presents an indigenous web based platform Orbital Collision (OrCo), created by the Space Systems Laboratory at IIIT Delhi, to enhance Space Situational Awareness (SSA) by predicting collision probabilities of space objects using Two Line Elements (TLE) data. The work highlights the growing challenges of congestion in the Earth's orbital environment, mainly due to space debris and defunct satellites, which increase collision risks. It employs several methods for propagating orbital uncertainty and calculating the collision probability. The performance of the platform is evaluated through accuracy assessments and efficiency metrics, in order to improve the tracking of space objects and ensure the safety of the satellite in congested space.
comment: This work has been already submitted for STEP-IPSC 2025 Conference Proceedings
☆ ROS 2 Agnocast: Supporting Unsized Message Types for True Zero-Copy Publish/Subscribe IPC
Robot applications, comprising independent components that mutually publish/subscribe messages, are built on inter-process communication (IPC) middleware such as Robot Operating System 2 (ROS 2). In large-scale ROS 2 systems like autonomous driving platforms, true zero-copy communication -- eliminating serialization and deserialization -- is crucial for efficiency and real-time performance. However, existing true zero-copy middleware solutions lack widespread adoption as they fail to meet three essential requirements: 1) Support for all ROS 2 message types including unsized ones; 2) Minimal modifications to existing application code; 3) Selective implementation of zero-copy communication between specific nodes while maintaining conventional communication mechanisms for other inter-node communications including inter-host node communications. This first requirement is critical, as production-grade ROS 2 projects like Autoware rely heavily on unsized message types throughout their codebase to handle diverse use cases (e.g., various sensors), and depend on the broader ROS 2 ecosystem, where unsized message types are pervasive in libraries. The remaining requirements facilitate seamless integration with existing projects. While IceOryx middleware, a practical true zero-copy solution, meets all but the first requirement, other studies achieving the first requirement fail to satisfy the remaining criteria. This paper presents Agnocast, a true zero-copy IPC framework applicable to ROS 2 C++ on Linux that fulfills all these requirements. Our evaluation demonstrates that Agnocast maintains constant IPC overhead regardless of message size, even for unsized message types. In Autoware PointCloud Preprocessing, Agnocast achieves a 16% improvement in average response time and a 25% improvement in worst-case response time.
comment: 10 pages, 13 figures. Accepted for IEEE ISORC 2025; this is the author-accepted manuscript
☆ Vision-Based Multirotor Control for Spherical Target Tracking: A Bearing-Angle Approach
This work addresses the problem of designing a visual servo controller for a multirotor vehicle, with the end goal of tracking a moving spherical target with unknown radius. To address this problem, we first transform two bearing measurements provided by a camera sensor into a bearing-angle pair. We then use this information to derive the system's dynamics in a new set of coordinates, where the angle measurement is used to quantify a relative distance to the target. Building on this system representation, we design an adaptive nonlinear control algorithm that takes advantage of the properties of the new system geometry and assumes that the target follows a constant acceleration model. Simulation results illustrate the performance of the proposed control algorithm.
comment: This paper has been accepted for presentation at the 2025 IEEE European Control Conference (ECC)
☆ Camera Calibration via Circular Patterns: A Comprehensive Framework with Measurement Uncertainty and Unbiased Projection Model
Camera calibration using planar targets has been widely favored, and two types of control points have been mainly considered as measurements: the corners of the checkerboard and the centroid of circles. Since a centroid is derived from numerous pixels, the circular pattern provides more precise measurements than the checkerboard. However, the existing projection model of circle centroids is biased under lens distortion, resulting in low performance. To surmount this limitation, we propose an unbiased projection model of the circular pattern and demonstrate its superior accuracy compared to the checkerboard. Complementing this, we introduce uncertainty into circular patterns to enhance calibration robustness and completeness. Defining centroid uncertainty improves the performance of calibration components, including pattern detection, optimization, and evaluation metrics. We also provide guidelines for performing good camera calibration based on the evaluation metric. The core concept of this approach is to model the boundary points of a two-dimensional shape as a Markov random field, considering its connectivity. The shape distribution is propagated to the centroid uncertainty through an appropriate shape representation based on the Green theorem. Consequently, the resulting framework achieves marked gains in calibration accuracy and robustness. The complete source code and demonstration video are available at https://github.com/chaehyeonsong/discocal.
☆ AnyTraverse: An off-road traversability framework with VLM and human operator in the loop
Off-road traversability segmentation enables autonomous navigation with applications in search-and-rescue, military operations, wildlife exploration, and agriculture. Current frameworks struggle due to significant variations in unstructured environments and uncertain scene changes, and are not adaptive to be used for different robot types. We present AnyTraverse, a framework combining natural language-based prompts with human-operator assistance to determine navigable regions for diverse robotic vehicles. The system segments scenes for a given set of prompts and calls the operator only when encountering previously unexplored scenery or unknown class not part of the prompt in its region-of-interest, thus reducing active supervision load while adapting to varying outdoor scenes. Our zero-shot learning approach eliminates the need for extensive data collection or retraining. Our experimental validation includes testing on RELLIS-3D, Freiburg Forest, and RUGD datasets and demonstrate real-world deployment on multiple robot platforms. The results show that AnyTraverse performs better than GA-NAV and Off-seg while offering a vehicle-agnostic approach to off-road traversability that balances automation with targeted human supervision.
☆ Learning Dexterous Object Handover
Object handover is an important skill that we use daily when interacting with other humans. To deploy robots in collaborative setting, like houses, being able to receive and handing over objects safely and efficiently becomes a crucial skill. In this work, we demonstrate the use of Reinforcement Learning (RL) for dexterous object handover between two multi-finger hands. Key to this task is the use of a novel reward function based on dual quaternions to minimize the rotation distance, which outperforms other rotation representations such as Euler and rotation matrices. The robustness of the trained policy is experimentally evaluated by testing w.r.t. objects that are not included in the training distribution, and perturbations during the handover process. The results demonstrate that the trained policy successfully perform this task, achieving a total success rate of 94% in the best-case scenario after 100 experiments, thereby showing the robustness of our policy with novel objects. In addition, the best-case performance of the policy decreases by only 13.8% when the other robot moves during the handover, proving that our policy is also robust to this type of perturbation, which is common in real-world object handovers.
comment: Paper accepted for presentation in RoMan 2025
☆ Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation ICML2025
Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.
comment: ICML2025 poster, 39 pages, 6 figures, 13 tables. arXiv admin note: text overlap with arXiv:2409.00418
☆ A Scalable Post-Processing Pipeline for Large-Scale Free-Space Multi-Agent Path Planning with PiBT
Free-space multi-agent path planning remains challenging at large scales. Most existing methods either offer optimality guarantees but do not scale beyond a few dozen agents, or rely on grid-world assumptions that do not generalize well to continuous space. In this work, we propose a hybrid, rule-based planning framework that combines Priority Inheritance with Backtracking (PiBT) with a novel safety-aware path smoothing method. Our approach extends PiBT to 8-connected grids and selectively applies string-pulling based smoothing while preserving collision safety through local interaction awareness and a fallback collision resolution step based on Safe Interval Path Planning (SIPP). This design allows us to reduce overall path lengths while maintaining real-time performance. We demonstrate that our method can scale to over 500 agents in large free-space environments, outperforming existing any-angle and optimal methods in terms of runtime, while producing near-optimal trajectories in sparse domains. Our results suggest this framework is a promising building block for scalable, real-time multi-agent navigation in robotics systems operating beyond grid constraints.
☆ IsoNet: Causal Analysis of Multimodal Transformers for Neuromuscular Gesture Classification
Hand gestures are a primary output of the human motor system, yet the decoding of their neuromuscular signatures remains a bottleneck for basic neuroscience and assistive technologies such as prosthetics. Traditional human-machine interface pipelines rely on a single biosignal modality, but multimodal fusion can exploit complementary information from sensors. We systematically compare linear and attention-based fusion strategies across three architectures: a Multimodal MLP, a Multimodal Transformer, and a Hierarchical Transformer, evaluating performance on scenarios with unimodal and multimodal inputs. Experiments use two publicly available datasets: NinaPro DB2 (sEMG and accelerometer) and HD-sEMG 65-Gesture (high-density sEMG and force). Across both datasets, the Hierarchical Transformer with attention-based fusion consistently achieved the highest accuracy, surpassing the multimodal and best single-modality linear-fusion MLP baseline by over 10% on NinaPro DB2 and 3.7% on HD-sEMG. To investigate how modalities interact, we introduce an Isolation Network that selectively silences unimodal or cross-modal attention pathways, quantifying each group of token interactions' contribution to downstream decisions. Ablations reveal that cross-modal interactions contribute approximately 30% of the decision signal across transformer layers, highlighting the importance of attention-driven fusion in harnessing complementary modality information. Together, these findings reveal when and how multimodal fusion would enhance biosignal classification and also provides mechanistic insights of human muscle activities. The study would be beneficial in the design of sensor arrays for neurorobotic systems.
☆ DRARL: Disengagement-Reason-Augmented Reinforcement Learning for Efficient Improvement of Autonomous Driving Policy
With the increasing presence of automated vehicles on open roads under driver supervision, disengagement cases are becoming more prevalent. While some data-driven planning systems attempt to directly utilize these disengagement cases for policy improvement, the inherent scarcity of disengagement data (often occurring as a single instances) restricts training effectiveness. Furthermore, some disengagement data should be excluded since the disengagement may not always come from the failure of driving policies, e.g. the driver may casually intervene for a while. To this end, this work proposes disengagement-reason-augmented reinforcement learning (DRARL), which enhances driving policy improvement process according to the reason of disengagement cases. Specifically, the reason of disengagement is identified by a out-of-distribution (OOD) state estimation model. When the reason doesn't exist, the case will be identified as a casual disengagement case, which doesn't require additional policy adjustment. Otherwise, the policy can be updated under a reason-augmented imagination environment, improving the policy performance of disengagement cases with similar reasons. The method is evaluated using real-world disengagement cases collected by autonomous driving robotaxi. Experimental results demonstrate that the method accurately identifies policy-related disengagement reasons, allowing the agent to handle both original and semantically similar cases through reason-augmented training. Furthermore, the approach prevents the agent from becoming overly conservative after policy adjustments. Overall, this work provides an efficient way to improve driving policy performance with disengagement cases.
☆ Experimental Setup and Software Pipeline to Evaluate Optimization based Autonomous Multi-Robot Search Algorithms
Signal source localization has been a problem of interest in the multi-robot systems domain given its applications in search \& rescue and hazard localization in various industrial and outdoor settings. A variety of multi-robot search algorithms exist that usually formulate and solve the associated autonomous motion planning problem as a heuristic model-free or belief model-based optimization process. Most of these algorithms however remains tested only in simulation, thereby losing the opportunity to generate knowledge about how such algorithms would compare/contrast in a real physical setting in terms of search performance and real-time computing performance. To address this gap, this paper presents a new lab-scale physical setup and associated open-source software pipeline to evaluate and benchmark multi-robot search algorithms. The presented physical setup innovatively uses an acoustic source (that is safe and inexpensive) and small ground robots (e-pucks) operating in a standard motion-capture environment. This setup can be easily recreated and used by most robotics researchers. The acoustic source also presents interesting uncertainty in terms of its noise-to-signal ratio, which is useful to assess sim-to-real gaps. The overall software pipeline is designed to readily interface with any multi-robot search algorithm with minimal effort and is executable in parallel asynchronous form. This pipeline includes a framework for distributed implementation of multi-robot or swarm search algorithms, integrated with a ROS (Robotics Operating System)-based software stack for motion capture supported localization. The utility of this novel setup is demonstrated by using it to evaluate two state-of-the-art multi-robot search algorithms, based on swarm optimization and batch-Bayesian Optimization (called Bayes-Swarm), as well as a random walk baseline.
comment: to be published in IDETC 2025 conference proceedings
☆ VLM-Empowered Multi-Mode System for Efficient and Safe Planetary Navigation IROS 2025
The increasingly complex and diverse planetary exploration environment requires more adaptable and flexible rover navigation strategy. In this study, we propose a VLM-empowered multi-mode system to achieve efficient while safe autonomous navigation for planetary rovers. Vision-Language Model (VLM) is used to parse scene information by image inputs to achieve a human-level understanding of terrain complexity. Based on the complexity classification, the system switches to the most suitable navigation mode, composing of perception, mapping and planning modules designed for different terrain types, to traverse the terrain ahead before reaching the next waypoint. By integrating the local navigation system with a map server and a global waypoint generation module, the rover is equipped to handle long-distance navigation tasks in complex scenarios. The navigation system is evaluated in various simulation environments. Compared to the single-mode conservative navigation method, our multi-mode system is able to bootstrap the time and energy efficiency in a long-distance traversal with varied type of obstacles, enhancing efficiency by 79.5%, while maintaining its avoidance capabilities against terrain hazards to guarantee rover safety. More system information is shown at https://chengsn1234.github.io/multi-mode-planetary-navigation/.
comment: accepted by IROS 2025
☆ Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections
We address key challenges in Dataset Aggregation (DAgger) for real-world contact-rich manipulation: how to collect informative human correction data and how to effectively update policies with this new data. We introduce Compliant Residual DAgger (CR-DAgger), which contains two novel components: 1) a Compliant Intervention Interface that leverages compliance control, allowing humans to provide gentle, accurate delta action corrections without interrupting the ongoing robot policy execution; and 2) a Compliant Residual Policy formulation that learns from human corrections while incorporating force feedback and force control. Our system significantly enhances performance on precise contact-rich manipulation tasks using minimal correction data, improving base policy success rates by over 50\% on two challenging tasks (book flipping and belt assembly) while outperforming both retraining-from-scratch and finetuning approaches. Through extensive real-world experiments, we provide practical guidance for implementing effective DAgger in real-world robot learning tasks. Result videos are available at: https://compliant-residual-dagger.github.io/
☆ PPTP: Performance-Guided Physiological Signal-Based Trust Prediction in Human-Robot Collaboration
Trust prediction is a key issue in human-robot collaboration, especially in construction scenarios where maintaining appropriate trust calibration is critical for safety and efficiency. This paper introduces the Performance-guided Physiological signal-based Trust Prediction (PPTP), a novel framework designed to improve trust assessment. We designed a human-robot construction scenario with three difficulty levels to induce different trust states. Our approach integrates synchronized multimodal physiological signals (ECG, GSR, and EMG) with collaboration performance evaluation to predict human trust levels. Individual physiological signals are processed using collaboration performance information as guiding cues, leveraging the standardized nature of collaboration performance to compensate for individual variations in physiological responses. Extensive experiments demonstrate the efficacy of our cross-modality fusion method in significantly improving trust classification performance. Our model achieves over 81% accuracy in three-level trust classification, outperforming the best baseline method by 6.7%, and notably reaches 74.3% accuracy in high-resolution seven-level classification, which is a first in trust prediction research. Ablation experiments further validate the superiority of physiological signal processing guided by collaboration performance assessment.
☆ On the Power of Spatial Locality on Online Routing Problems
We consider the online versions of two fundamental routing problems, traveling salesman (TSP) and dial-a-ride (DARP), which have a variety of relevant applications in logistics and robotics. The online versions of these problems concern with efficiently serving a sequence of requests presented in a real-time on-line fashion located at points of a metric space by servers (salesmen/vehicles/robots). In this paper, motivated from real-world applications, such as Uber/Lyft rides, where some limited knowledge is available on the future requests, we propose the {\em spatial locality} model that provides in advance the distance within which new request(s) will be released from the current position of server(s). We study the usefulness of this advanced information on achieving the improved competitive ratios for both the problems with $k\geq 1$ servers, compared to the competitive results established in the literature without such spatial locality consideration. We show that small locality is indeed useful in obtaining improved competitive ratios irrespective of the metric space.
comment: 13 pages
☆ EASE: Embodied Active Event Perception via Self-Supervised Energy Minimization
Active event perception, the ability to dynamically detect, track, and summarize events in real time, is essential for embodied intelligence in tasks such as human-AI collaboration, assistive robotics, and autonomous navigation. However, existing approaches often depend on predefined action spaces, annotated datasets, and extrinsic rewards, limiting their adaptability and scalability in dynamic, real-world scenarios. Inspired by cognitive theories of event perception and predictive coding, we propose EASE, a self-supervised framework that unifies spatiotemporal representation learning and embodied control through free energy minimization. EASE leverages prediction errors and entropy as intrinsic signals to segment events, summarize observations, and actively track salient actors, operating without explicit annotations or external rewards. By coupling a generative perception model with an action-driven control policy, EASE dynamically aligns predictions with observations, enabling emergent behaviors such as implicit memory, target continuity, and adaptability to novel environments. Extensive evaluations in simulation and real-world settings demonstrate EASE's ability to achieve privacy-preserving and scalable event perception, providing a robust foundation for embodied systems in unscripted, dynamic tasks.
comment: Accepted to IEEE Robotics and Automation Letters, 2025
☆ Public Perceptions of Autonomous Vehicles: A Survey of Pedestrians and Cyclists in Pittsburgh
This study investigates how autonomous vehicle(AV) technology is perceived by pedestrians and bicyclists in Pittsburgh. Using survey data from over 1200 respondents, the research explores the interplay between demographics, AV interactions, infrastructural readiness, safety perceptions, and trust. Findings highlight demographic divides, infrastructure gaps, and the crucial role of communication and education in AV adoption.
☆ Online Adaptation for Flying Quadrotors in Tight Formations
The task of flying in tight formations is challenging for teams of quadrotors because the complex aerodynamic wake interactions can destabilize individual team members as well as the team. Furthermore, these aerodynamic effects are highly nonlinear and fast-paced, making them difficult to model and predict. To overcome these challenges, we present L1 KNODE-DW MPC, an adaptive, mixed expert learning based control framework that allows individual quadrotors to accurately track trajectories while adapting to time-varying aerodynamic interactions during formation flights. We evaluate L1 KNODE-DW MPC in two different three-quadrotor formations and show that it outperforms several MPC baselines. Our results show that the proposed framework is capable of enabling the three-quadrotor team to remain vertically aligned in close proximity throughout the flight. These findings show that the L1 adaptive module compensates for unmodeled disturbances most effectively when paired with an accurate dynamics model. A video showcasing our framework and the physical experiments is available here: https://youtu.be/9QX1Q5Ut9Rs
comment: 10 pages, 4 figures
☆ Distilling On-device Language Models for Robot Planning with Minimal Human Intervention
Large language models (LLMs) provide robots with powerful contextual reasoning abilities and a natural human interface. Yet, current LLM-enabled robots typically depend on cloud-hosted models, limiting their usability in environments with unreliable communication infrastructure, such as outdoor or industrial settings. We present PRISM, a framework for distilling small language model (SLM)-enabled robot planners that run on-device with minimal human supervision. Starting from an existing LLM-enabled planner, PRISM automatically synthesizes diverse tasks and environments, elicits plans from the LLM, and uses this synthetic dataset to distill a compact SLM as a drop-in replacement of the source model. We apply PRISM to three LLM-enabled planners for mapping and exploration, manipulation, and household assistance, and we demonstrate that PRISM improves the performance of Llama-3.2-3B from 10-20% of GPT-4o's performance to over 93% - using only synthetic data. We further demonstrate that the distilled planners generalize across heterogeneous robotic platforms (ground and aerial) and diverse environments (indoor and outdoor). We release all software, trained models, and datasets at https://zacravichandran.github.io/PRISM.
☆ DiLQR: Differentiable Iterative Linear Quadratic Regulator via Implicit Differentiation ICML 2025
While differentiable control has emerged as a powerful paradigm combining model-free flexibility with model-based efficiency, the iterative Linear Quadratic Regulator (iLQR) remains underexplored as a differentiable component. The scalability of differentiating through extended iterations and horizons poses significant challenges, hindering iLQR from being an effective differentiable controller. This paper introduces DiLQR, a framework that facilitates differentiation through iLQR, allowing it to serve as a trainable and differentiable module, either as or within a neural network. A novel aspect of this framework is the analytical solution that it provides for the gradient of an iLQR controller through implicit differentiation, which ensures a constant backward cost regardless of iteration, while producing an accurate gradient. We evaluate our framework on imitation tasks on famous control benchmarks. Our analytical method demonstrates superior computational performance, achieving up to 128x speedup and a minimum of 21x speedup compared to automatic differentiation. Our method also demonstrates superior learning performance ($10^6$x) compared to traditional neural network policies and better model loss with differentiable controllers that lack exact analytical gradients. Furthermore, we integrate our module into a larger network with visual inputs to demonstrate the capacity of our method for high-dimensional, fully end-to-end tasks. Codes can be found on the project homepage https://sites.google.com/view/dilqr/.
comment: Accepted at ICML 2025. Official conference page: https://icml.cc/virtual/2025/poster/44176. OpenReview page: https://openreview.net/forum?id=m2EfTrbv4o
☆ General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting
Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed data flows, limiting generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge suitable for reasoning and planning. Yet, prior LVLM-robot integrations typically depend on pre-mapped spaces, hard-coded representations, and myopic exploration. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose navigation framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools available within modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query the robotic modules, reason over multimodal inputs, and select appropriate navigation actions. This approach enables robust navigation and reasoning in previously unmapped environments, providing a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA achieves state-of-the-art performance, demonstrating effective exploration, navigation, and embodied question answering without relying on handcrafted plans, fixed input representations, or pre-existing maps.
☆ Kinematic Model Optimization via Differentiable Contact Manifold for In-Space Manipulation
Robotic manipulation in space is essential for emerging applications such as debris removal and in-space servicing, assembly, and manufacturing (ISAM). A key requirement for these tasks is the ability to perform precise, contact-rich manipulation under significant uncertainty. In particular, thermal-induced deformation of manipulator links and temperature-dependent encoder bias introduce kinematic parameter errors that significantly degrade end-effector accuracy. Traditional calibration techniques rely on external sensors or dedicated calibration procedures, which can be infeasible or risky in dynamic, space-based operational scenarios. This paper proposes a novel method for kinematic parameter estimation that only requires encoder measurements and binary contact detection. The approach focuses on estimating link thermal deformation strain and joint encoder biases by leveraging information of the contact manifold - the set of relative SE(3) poses at which contact between the manipulator and environment occurs. We present two core contributions: (1) a differentiable, learning-based model of the contact manifold, and (2) an optimization-based algorithm for estimating kinematic parameters from encoder measurements at contact instances. By enabling parameter estimation using only encoder measurements and contact detection, this method provides a robust, interpretable, and data-efficient solution for safe and accurate manipulation in the challenging conditions of space.
comment: Accepted and presented in RSS 2025 Space Robotics Workshop (https://albee.github.io/space-robotics-rss/). 3 pages with 1 figure
☆ A workflow for generating synthetic LiDAR datasets in simulation environments
This paper presents a simulation workflow for generating synthetic LiDAR datasets to support autonomous vehicle perception, robotics research, and sensor security analysis. Leveraging the CoppeliaSim simulation environment and its Python API, we integrate time-of-flight LiDAR, image sensors, and two dimensional scanners onto a simulated vehicle platform operating within an urban scenario. The workflow automates data capture, storage, and annotation across multiple formats (PCD, PLY, CSV), producing synchronized multimodal datasets with ground truth pose information. We validate the pipeline by generating large-scale point clouds and corresponding RGB and depth imagery. The study examines potential security vulnerabilities in LiDAR data, such as adversarial point injection and spoofing attacks, and demonstrates how synthetic datasets can facilitate the evaluation of defense strategies. Finally, limitations related to environmental realism, sensor noise modeling, and computational scalability are discussed, and future research directions, such as incorporating weather effects, real-world terrain models, and advanced scanner configurations, are proposed. The workflow provides a versatile, reproducible framework for generating high-fidelity synthetic LiDAR datasets to advance perception research and strengthen sensor security in autonomous systems. Documentation and examples accompany this framework; samples of animated cloud returns and image sensor data can be found at this Link.
♻ ☆ Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.
♻ ☆ ASAP-MO:Advanced Situational Awareness and Perception for Mission-critical Operations ICRA
Deploying robotic missions can be challenging due to the complexity of controlling robots with multiple degrees of freedom, fusing diverse sensory inputs, and managing communication delays and interferences. In nuclear inspection, robots can be crucial in assessing environments where human presence is limited, requiring precise teleoperation and coordination. Teleoperation requires extensive training, as operators must process multiple outputs while ensuring safe interaction with critical assets. These challenges are amplified when operating a fleet of heterogeneous robots across multiple environments, as each robot may have distinct control interfaces, sensory systems, and operational constraints. Efficient coordination in such settings remains an open problem. This paper presents a field report on how we integrated robot fleet capabilities - including mapping, localization, and telecommunication - toward a joint mission. We simulated a nuclear inspection scenario for exposed areas, using lights to represent a radiation source. We deployed two Unmanned Ground Vehicles (UGVs) tasked with mapping indoor and outdoor environments while remotely controlled from a single base station. Despite having distinct operational goals, the robots produced a unified map output, demonstrating the feasibility of coordinated multi-robot missions. Our results highlight key operational challenges and provide insights into improving adaptability and situational awareness in remote robotic deployments.
comment: 6 pages + references, 7 figures, Presented at the 2025 IEEE ICRA Workshop on Field Robotics
♻ ☆ Safe Guaranteed Exploration for Non-linear Systems
Safely exploring environments with a-priori unknown constraints is a fundamental challenge that restricts the autonomy of robots. While safety is paramount, guarantees on sufficient exploration are also crucial for ensuring autonomous task completion. To address these challenges, we propose a novel safe guaranteed exploration framework using optimal control, which achieves first-of-its-kind results: guaranteed exploration for non-linear systems with finite time sample complexity bounds, while being provably safe with arbitrarily high probability. The framework is general and applicable to many real-world scenarios with complex non-linear dynamics and unknown domains. We improve the efficiency of this general framework by proposing an algorithm, SageMPC, SAfe Guaranteed Exploration using Model Predictive Control. SageMPC leverages three key techniques: i) exploiting a Lipschitz bound, ii) goal-directed exploration, and iii) receding horizon style re-planning, all while maintaining the desired sample complexity, safety and exploration guarantees of the framework. Lastly, we demonstrate safe efficient exploration in challenging unknown environments using SageMPC with a car model.
comment: Accepted paper in IEEE Transactions on Automatic Control, 2025
♻ ☆ Problem Space Transformations for Out-of-Distribution Generalisation in Behavioural Cloning
The combination of behavioural cloning and neural networks has driven significant progress in robotic manipulation. As these algorithms may require a large number of demonstrations for each task of interest, they remain fundamentally inefficient in complex scenarios, in which finite datasets can hardly cover the state space. One of the remaining challenges is thus out-of-distribution (OOD) generalisation, i.e. the ability to predict correct actions for states with a low likelihood with respect to the state occupancy induced by the dataset. This issue is aggravated when the system to control is treated as a black-box, ignoring its physical properties. This work characterises widespread properties of robotic manipulation, specifically pose equivariance and locality. We investigate the effect of the choice of problem space on OOD performance of BC policies and how transformations arising from characteristic properties of manipulation could be employed for its improvement. We empirically demonstrate that these transformations allow behaviour cloning policies, using either standard MLP-based one-step action prediction or diffusion-based action-sequence prediction, to generalise better to OOD problem instances.
♻ ☆ SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping
Recent advancements in 3D robotic manipulation have improved grasping of everyday objects, but transparent and specular materials remain challenging due to depth sensing limitations. While several 3D reconstruction and depth completion approaches address these challenges, they suffer from setup complexity or limited observation information utilization. To address this, leveraging the power of single view 3D object reconstruction approaches, we propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation. Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh based on RGB image. Then, the key idea is to determine the 3D object's pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene. Therefore, we propose view matching and keypoint matching mechanisms,which leverage both the 2D and 3D's inherent semantic and geometric information in the observation to determine the object's 3D state within the scene, thereby reconstructing an accurate 3D depth map for effective grasp detection. Experiments in both simulation and real world show the reconstruction effectiveness of SR3D.
♻ ☆ Rejecting Outliers in 2D-3D Point Correspondences from 2D Forward-Looking Sonar Observations
Rejecting outliers before applying classical robust methods is a common approach to increase the success rate of estimation, particularly when the outlier ratio is extremely high (e.g. 90%). However, this method often relies on sensor- or task-specific characteristics, which may not be easily transferable across different scenarios. In this paper, we focus on the problem of rejecting 2D-3D point correspondence outliers from 2D forward-looking sonar (2D FLS) observations, which is one of the most popular perception device in the underwater field but has a significantly different imaging mechanism compared to widely used perspective cameras and LiDAR. We fully leverage the narrow field of view in the elevation of 2D FLS and develop two compatibility tests for different 3D point configurations: (1) In general cases, we design a pairwise length in-range test to filter out overly long or short edges formed from point sets; (2) In coplanar cases, we design a coplanarity test to check if any four correspondences are compatible under a coplanar setting. Both tests are integrated into outlier rejection pipelines, where they are followed by maximum clique searching to identify the largest consistent measurement set as inliers. Extensive simulations demonstrate that the proposed methods for general and coplanar cases perform effectively under outlier ratios of 80% and 90%, respectively.
♻ ☆ DualGuard MPPI: Safe and Performant Optimal Control by Combining Sampling-Based MPC and Hamilton-Jacobi Reachability
Designing controllers that are both safe and performant is inherently challenging. This co-optimization can be formulated as a constrained optimal control problem, where the cost function represents the performance criterion and safety is specified as a constraint. While sampling-based methods, such as Model Predictive Path Integral (MPPI) control, have shown great promise in tackling complex optimal control problems, they often struggle to enforce safety constraints. To address this limitation, we propose DualGuard-MPPI, a novel framework for solving safety-constrained optimal control problems. Our approach integrates Hamilton-Jacobi reachability analysis within the MPPI sampling process to ensure that all generated samples are provably safe for the system. On the one hand, this integration allows DualGuard-MPPI to enforce strict safety constraints; at the same time, it facilitates a more effective exploration of the environment with the same number of samples, reducing the effective sampling variance and leading to better performance optimization. Through several simulations and hardware experiments, we demonstrate that the proposed approach achieves much higher performance compared to existing MPPI methods, without compromising safety.
comment: 8 pages, 7 figures
♻ ☆ Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies
Recent years have witnessed impressive robotic manipulation systems driven by advances in imitation learning and generative modeling, such as diffusion- and flow-based approaches. As robot policy performance increases, so does the complexity and time horizon of achievable tasks, inducing unexpected and diverse failure modes that are difficult to predict a priori. To enable trustworthy policy deployment in safety-critical human environments, reliable runtime failure detection becomes important during policy inference. However, most existing failure detection approaches rely on prior knowledge of failure modes and require failure data during training, which imposes a significant challenge in practicality and scalability. In response to these limitations, we present FAIL-Detect, a modular two-stage approach for failure detection in imitation learning-based robotic manipulation. To accurately identify failures from successful training data alone, we frame the problem as sequential out-of-distribution (OOD) detection. We first distill policy inputs and outputs into scalar signals that correlate with policy failures and capture epistemic uncertainty. FAIL-Detect then employs conformal prediction (CP) as a versatile framework for uncertainty quantification with statistical guarantees. Empirically, we thoroughly investigate both learned and post-hoc scalar signal candidates on diverse robotic manipulation tasks. Our experiments show learned signals to be mostly consistently effective, particularly when using our novel flow-based density estimator. Furthermore, our method detects failures more accurately and faster than state-of-the-art (SOTA) failure detection baselines. These results highlight the potential of FAIL-Detect to enhance the safety and reliability of imitation learning-based robotic systems as they progress toward real-world deployment.
comment: Accepted by Robotics: Science and Systems 2025
♻ ☆ SD++: Enhancing Standard Definition Maps by Incorporating Road Knowledge using LLMs
High-definition maps (HD maps) are detailed and informative maps capturing lane centerlines and road elements. Although very useful for autonomous driving, HD maps are costly to build and maintain. Furthermore, access to these high-quality maps is usually limited to the firms that build them. On the other hand, standard definition (SD) maps provide road centerlines with an accuracy of a few meters. In this paper, we explore the possibility of enhancing SD maps by incorporating information from road manuals using LLMs. We develop SD++, an end-to-end pipeline to enhance SD maps with location-dependent road information obtained from a road manual. We suggest and compare several ways of using LLMs for such a task. Furthermore, we show the generalization ability of SD++ by showing results from both California and Japan.
comment: 7 pages, 8 figures, 1 table, Accepted at IEEE Intelligent Vehicles Symposium 2025
♻ ☆ Using Language and Road Manuals to Inform Map Reconstruction for Autonomous Driving
Lane-topology prediction is a critical component of safe and reliable autonomous navigation. An accurate understanding of the road environment aids this task. We observe that this information often follows conventions encoded in natural language, through design codes that reflect the road structure and road names that capture the road functionality. We augment this information in a lightweight manner to SMERF, a map-prior-based online lane-topology prediction model, by combining structured road metadata from OSM maps and lane-width priors from Road design manuals with the road centerline encodings. We evaluate our method on two geo-diverse complex intersection scenarios. Our method shows improvement in both lane and traffic element detection and their association. We report results using four topology-aware metrics to comprehensively assess the model performance. These results demonstrate the ability of our approach to generalize and scale to diverse topologies and conditions.
comment: 4 pages, 3 figures, Accepted at RSS 2025 Workshop - RobotEvaluation@RSS 2025
♻ ☆ Uncertainty-Aware Planning for Heterogeneous Robot Teams using Dynamic Topological Graphs and Mixed-Integer Programming
Multi-robot planning and coordination in uncertain environments is a fundamental computational challenge, since the belief space increases exponentially with the number of robots. In this paper, we address the problem of planning in uncertain environments with a heterogeneous robot team of fast scout vehicles for information gathering and more risk-averse carrier robots from which the scouts vehicles are deployed. To overcome the computational challenges, we represent the environment and operational scenario using a topological graph, where the parameters of the edge weight distributions vary with the state of the robot team on the graph, and we formulate a computationally efficient mixed-integer program which removes the dependence on the number of robots from its decision space. Our formulation results in the capability to generate optimal multi-robot, long-horizon plans in seconds that could otherwise be computationally intractable. Ultimately our approach enables real-time re-planning, since the computation time is significantly faster than the time to execute one step. We evaluate our algorithm in a scenario where the robot team must traverse an environment while minimizing detection by observers in positions that are uncertain to the robot team. We demonstrate that our method is computationally tractable, can improve performance in the presence of imperfect information, and can be adjusted for different risk profiles.
comment: \copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Robotics 53
☆ CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity
Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction "Hang a mug on the mug tree" may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-Language Model (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code interfaces with the perception module to produce 3D attention maps that highlight task-relevant regions by integrating spatial and semantic information, effectively resolving ambiguities in instructions. Through extensive experiments, we identify key limitations of current imitation learning methods, such as poor adaptation to language and environmental variations. We show that our approach excels across challenging manipulation tasks involving language ambiguity, contact-rich manipulation, and multi-object interactions.
comment: Accepted to Robotics: Science and Systems (RSS) 2025. The first three authors contributed equally. Project Page: https://robopil.github.io/code-diffuser/
☆ See What I Mean? Expressiveness and Clarity in Robot Display Design
Nonverbal visual symbols and displays play an important role in communication when humans and robots work collaboratively. However, few studies have investigated how different types of non-verbal cues affect objective task performance, especially in a dynamic environment that requires real time decision-making. In this work, we designed a collaborative navigation task where the user and the robot only had partial information about the map on each end and thus the users were forced to communicate with a robot to complete the task. We conducted our study in a public space and recruited 37 participants who randomly passed by our setup. Each participant collaborated with a robot utilizing either animated anthropomorphic eyes and animated icons, or static anthropomorphic eyes and static icons. We found that participants that interacted with a robot with animated displays reported the greatest level of trust and satisfaction; that participants interpreted static icons the best; and that participants with a robot with static eyes had the highest completion success. These results suggest that while animation can foster trust with robots, human-robot communication can be optimized by the addition of familiar static icons that may be easier for users to interpret. We published our code, designed symbols, and collected results online at: https://github.com/mattufts/huamn_Cozmo_interaction.
☆ History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation
Object Goal Navigation (ObjectNav) challenges robots to find objects in unseen environments, demanding sophisticated reasoning. While Vision-Language Models (VLMs) show potential, current ObjectNav methods often employ them superficially, primarily using vision-language embeddings for object-scene similarity checks rather than leveraging deeper reasoning. This limits contextual understanding and leads to practical issues like repetitive navigation behaviors. This paper introduces a novel zero-shot ObjectNav framework that pioneers the use of dynamic, history-aware prompting to more deeply integrate VLM reasoning into frontier-based exploration. Our core innovation lies in providing the VLM with action history context, enabling it to generate semantic guidance scores for navigation actions while actively avoiding decision loops. We also introduce a VLM-assisted waypoint generation mechanism for refining the final approach to detected objects. Evaluated on the HM3D dataset within Habitat, our approach achieves a 46% Success Rate (SR) and 24.8% Success weighted by Path Length (SPL). These results are comparable to state-of-the-art zero-shot methods, demonstrating the significant potential of our history-augmented VLM prompting strategy for more robust and context-aware robotic navigation.
☆ DRIVE Through the Unpredictability:From a Protocol Investigating Slip to a Metric Estimating Command Uncertainty
Off-road autonomous navigation is a challenging task as it is mainly dependent on the accuracy of the motion model. Motion model performances are limited by their ability to predict the interaction between the terrain and the UGV, which an onboard sensor can not directly measure. In this work, we propose using the DRIVE protocol to standardize the collection of data for system identification and characterization of the slip state space. We validated this protocol by acquiring a dataset with two platforms (from 75 kg to 470 kg) on six terrains (i.e., asphalt, grass, gravel, ice, mud, sand) for a total of 4.9 hours and 14.7 km. Using this data, we evaluate the DRIVE protocol's ability to explore the velocity command space and identify the reachable velocities for terrain-robot interactions. We investigated the transfer function between the command velocity space and the resulting steady-state slip for an SSMR. An unpredictability metric is proposed to estimate command uncertainty and help assess risk likelihood and severity in deployment. Finally, we share our lessons learned on running system identification on large UGV to help the community.
comment: This version is the preprint of a journal article with the same title, accepted in the IEEE Transactions on Field Robotics. To have a look at the early access version, use the following link https://ieeexplore.ieee.org/document/11037776
☆ Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control
World models enable robots to "imagine" future observations given current observations and planned actions, and have been increasingly adopted as generalized dynamics models to facilitate robot learning. Despite their promise, these models remain brittle when encountering novel visual distractors such as objects and background elements rarely seen during training. Specifically, novel distractors can corrupt action outcome predictions, causing downstream failures when robots rely on the world model imaginations for planning or action verification. In this work, we propose Reimagination with Observation Intervention (ReOI), a simple yet effective test-time strategy that enables world models to predict more reliable action outcomes in open-world scenarios where novel and unanticipated visual distractors are inevitable. Given the current robot observation, ReOI first detects visual distractors by identifying which elements of the scene degrade in physically implausible ways during world model prediction. Then, it modifies the current observation to remove these distractors and bring the observation closer to the training distribution. Finally, ReOI "reimagines" future outcomes with the modified observation and reintroduces the distractors post-hoc to preserve visual consistency for downstream planning and verification. We validate our approach on a suite of robotic manipulation tasks in the context of action verification, where the verifier needs to select desired action plans based on predictions from a world model. Our results show that ReOI is robust to both in-distribution and out-of-distribution visual distractors. Notably, it improves task success rates by up to 3x in the presence of novel distractors, significantly outperforming action verification that relies on world model predictions without imagination interventions.
☆ An Optimization-Augmented Control Framework for Single and Coordinated Multi-Arm Robotic Manipulation IROS 2025
Robotic manipulation demands precise control over both contact forces and motion trajectories. While force control is essential for achieving compliant interaction and high-frequency adaptation, it is limited to operations in close proximity to the manipulated object and often fails to maintain stable orientation during extended motion sequences. Conversely, optimization-based motion planning excels in generating collision-free trajectories over the robot's configuration space but struggles with dynamic interactions where contact forces play a crucial role. To address these limitations, we propose a multi-modal control framework that combines force control and optimization-augmented motion planning to tackle complex robotic manipulation tasks in a sequential manner, enabling seamless switching between control modes based on task requirements. Our approach decomposes complex tasks into subtasks, each dynamically assigned to one of three control modes: Pure optimization for global motion planning, pure force control for precise interaction, or hybrid control for tasks requiring simultaneous trajectory tracking and force regulation. This framework is particularly advantageous for bimanual and multi-arm manipulation, where synchronous motion and coordination among arms are essential while considering both the manipulated object and environmental constraints. We demonstrate the versatility of our method through a range of long-horizon manipulation tasks, including single-arm, bimanual, and multi-arm applications, highlighting its ability to handle both free-space motion and contact-rich manipulation with robustness and precision.
comment: 8 pages, 8 figures, accepted for oral presentation at IROS 2025. Supplementary site: https://sites.google.com/view/komo-force/home
☆ BIDA: A Bi-level Interaction Decision-making Algorithm for Autonomous Vehicles in Dynamic Traffic Scenarios
In complex real-world traffic environments, autonomous vehicles (AVs) need to interact with other traffic participants while making real-time and safety-critical decisions accordingly. The unpredictability of human behaviors poses significant challenges, particularly in dynamic scenarios, such as multi-lane highways and unsignalized T-intersections. To address this gap, we design a bi-level interaction decision-making algorithm (BIDA) that integrates interactive Monte Carlo tree search (MCTS) with deep reinforcement learning (DRL), aiming to enhance interaction rationality, efficiency and safety of AVs in dynamic key traffic scenarios. Specifically, we adopt three types of DRL algorithms to construct a reliable value network and policy network, which guide the online deduction process of interactive MCTS by assisting in value update and node selection. Then, a dynamic trajectory planner and a trajectory tracking controller are designed and implemented in CARLA to ensure smooth execution of planned maneuvers. Experimental evaluations demonstrate that our BIDA not only enhances interactive deduction and reduces computational costs, but also outperforms other latest benchmarks, which exhibits superior safety, efficiency and interaction rationality under varying traffic conditions.
comment: 6 pages, 3 figures, 4 tables, accepted for IEEE Intelligent Vehicles (IV) Symposium 2025
☆ Agile, Autonomous Spacecraft Constellations with Disruption Tolerant Networking to Monitor Precipitation and Urban Floods
Fully re-orientable small spacecraft are now supported by commercial technologies, allowing them to point their instruments in any direction and capture images, with short notice. When combined with improved onboard processing, and implemented on a constellation of inter-communicable satellites, this intelligent agility can significantly increase responsiveness to transient or evolving phenomena. We demonstrate a ground-based and onboard algorithmic framework that combines orbital mechanics, attitude control, inter-satellite communication, intelligent prediction and planning to schedule the time-varying, re-orientation of agile, small satellites in a constellation. Planner intelligence is improved by updating the predictive value of future space-time observations based on shared observations of evolving episodic precipitation and urban flood forecasts. Reliable inter-satellite communication within a fast, dynamic constellation topology is modeled in the physical, access control and network layer. We apply the framework on a representative 24-satellite constellation observing 5 global regions. Results show appropriately low latency in information exchange (average within 1/3rd available time for implicit consensus), enabling the onboard scheduler to observe ~7% more flood magnitude than a ground-based implementation. Both onboard and offline versions performed ~98% better than constellations without agility.
☆ eCAV: An Edge-Assisted Evaluation Platform for Connected Autonomous Vehicles
As autonomous vehicles edge closer to widespread adoption, enhancing road safety through collision avoidance and minimization of collateral damage becomes imperative. Vehicle-to-everything (V2X) technologies, which include vehicle-to-vehicle (V2V), vehicle-to-infrastructure (V2I), and vehicle-to-cloud (V2C), are being proposed as mechanisms to achieve this safety improvement. Simulation-based testing is crucial for early-stage evaluation of Connected Autonomous Vehicle (CAV) control systems, offering a safer and more cost-effective alternative to real-world tests. However, simulating large 3D environments with many complex single- and multi-vehicle sensors and controllers is computationally intensive. There is currently no evaluation framework that can effectively evaluate realistic scenarios involving large numbers of autonomous vehicles. We propose eCAV -- an efficient, modular, and scalable evaluation platform to facilitate both functional validation of algorithmic approaches to increasing road safety, as well as performance prediction of algorithms of various V2X technologies, including a futuristic Vehicle-to-Edge control plane and correspondingly designed control algorithms. eCAV can model up to 256 vehicles running individual control algorithms without perception enabled, which is $8\times$ more vehicles than what is possible with state-of-the-art alternatives. %faster than state-of-the-art alternatives that can simulate $8\times$ fewer vehicles. With perception enabled, eCAV simulates up to 64 vehicles with a step time under 800ms, which is $4\times$ more and $1.5\times$ faster than the state-of-the-art OpenCDA framework.
☆ Grounding Language Models with Semantic Digital Twins for Robotic Planning
We introduce a novel framework that integrates Semantic Digital Twins (SDTs) with Large Language Models (LLMs) to enable adaptive and goal-driven robotic task execution in dynamic environments. The system decomposes natural language instructions into structured action triplets, which are grounded in contextual environmental data provided by the SDT. This semantic grounding allows the robot to interpret object affordances and interaction rules, enabling action planning and real-time adaptability. In case of execution failures, the LLM utilizes error feedback and SDT insights to generate recovery strategies and iteratively revise the action plan. We evaluate our approach using tasks from the ALFRED benchmark, demonstrating robust performance across various household scenarios. The proposed framework effectively combines high-level reasoning with semantic environment understanding, achieving reliable task completion in the face of uncertainty and failure.
☆ Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining
Quadrupedal robots have demonstrated impressive locomotion capabilities in complex environments, but equipping them with autonomous versatile manipulation skills in a scalable way remains a significant challenge. In this work, we introduce a cross-embodiment imitation learning system for quadrupedal manipulation, leveraging data collected from both humans and LocoMan, a quadruped equipped with multiple manipulation modes. Specifically, we develop a teleoperation and data collection pipeline, which unifies and modularizes the observation and action spaces of the human and the robot. To effectively leverage the collected data, we propose an efficient modularized architecture that supports co-training and pretraining on structured modality-aligned data across different embodiments. Additionally, we construct the first manipulation dataset for the LocoMan robot, covering various household tasks in both unimanual and bimanual modes, supplemented by a corresponding human dataset. We validate our system on six real-world manipulation tasks, where it achieves an average success rate improvement of 41.9% overall and 79.7% under out-of-distribution (OOD) settings compared to the baseline. Pretraining with human data contributes a 38.6% success rate improvement overall and 82.7% under OOD settings, enabling consistently better performance with only half the amount of robot data. Our code, hardware, and data are open-sourced at: https://human2bots.github.io.
☆ Full-Pose Tracking via Robust Control for Over-Actuated Multirotors
This paper presents a robust cascaded control architecture for over-actuated multirotors. It extends the Incremental Nonlinear Dynamic Inversion (INDI) control combined with structured H_inf control, initially proposed for under-actuated multirotors, to a broader range of multirotor configurations. To achieve precise and robust attitude and position tracking, we employ a weighted least-squares geometric guidance control allocation method, formulated as a quadratic optimization problem, enabling full-pose tracking. The proposed approach effectively addresses key challenges, such as preventing infeasible pose references and enhancing robustness against disturbances, as well as considering multirotor's actual physical limitations. Numerical simulations with an over-actuated hexacopter validate the method's effectiveness, demonstrating its adaptability to diverse mission scenarios and its potential for real-world aerial applications.
☆ IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.
☆ CSC-MPPI: A Novel Constrained MPPI Framework with DBSCAN for Reliable Obstacle Avoidance
This paper proposes Constrained Sampling Cluster Model Predictive Path Integral (CSC-MPPI), a novel constrained formulation of MPPI designed to enhance trajectory optimization while enforcing strict constraints on system states and control inputs. Traditional MPPI, which relies on a probabilistic sampling process, often struggles with constraint satisfaction and generates suboptimal trajectories due to the weighted averaging of sampled trajectories. To address these limitations, the proposed framework integrates a primal-dual gradient-based approach and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to steer sampled input trajectories into feasible regions while mitigating risks associated with weighted averaging. First, to ensure that sampled trajectories remain within the feasible region, the primal-dual gradient method is applied to iteratively shift sampled inputs while enforcing state and control constraints. Then, DBSCAN groups the sampled trajectories, enabling the selection of representative control inputs within each cluster. Finally, among the representative control inputs, the one with the lowest cost is chosen as the optimal action. As a result, CSC-MPPI guarantees constraint satisfaction, improves trajectory selection, and enhances robustness in complex environments. Simulation and real-world experiments demonstrate that CSC-MPPI outperforms traditional MPPI in obstacle avoidance, achieving improved reliability and efficiency. The experimental videos are available at https://cscmppi.github.io
☆ Comparison between External and Internal Single Stage Planetary gearbox actuators for legged robots
Legged robots, such as quadrupeds and humanoids, require high-performance actuators for efficient locomotion. Quasi-Direct-Drive (QDD) actuators with single-stage planetary gearboxes offer low inertia, high efficiency, and transparency. Among planetary gearbox architectures, Internal (ISSPG) and External Single-Stage Planetary Gearbox (ESSPG) are the two predominant designs. While ISSPG is often preferred for its compactness and high torque density at certain gear ratios, no objective comparison between the two architectures exists. Additionally, existing designs rely on heuristics rather than systematic optimization. This paper presents a design framework for optimally selecting actuator parameters based on given performance requirements and motor specifications. Using this framework, we generate and analyze various optimized gearbox designs for both architectures. Our results demonstrate that for the T-motor U12, ISSPG is the superior choice within the lower gear ratio range of 5:1 to 7:1, offering a lighter design. However, for gear ratios exceeding 7:1, ISSPG becomes infeasible, making ESSPG the better option in the 7:1 to 11:1 range. To validate our approach, we designed and optimized two actuators for manufacturing: an ISSPG with a 6.0:1 gear ratio and an ESSPG with a 7.2:1 gear ratio. Their respective masses closely align with our optimization model predictions, confirming the effectiveness of our methodology.
comment: 6 pages, 5 figures, Accepted at Advances in Robotics 2025
☆ Goal-conditioned Hierarchical Reinforcement Learning for Sample-efficient and Safe Autonomous Driving at Intersections
Reinforcement learning (RL) exhibits remarkable potential in addressing autonomous driving tasks. However, it is difficult to train a sample-efficient and safe policy in complex scenarios. In this article, we propose a novel hierarchical reinforcement learning (HRL) framework with a goal-conditioned collision prediction (GCCP) module. In the hierarchical structure, the GCCP module predicts collision risks according to different potential subgoals of the ego vehicle. A high-level decision-maker choose the best safe subgoal. A low-level motion-planner interacts with the environment according to the subgoal. Compared to traditional RL methods, our algorithm is more sample-efficient, since its hierarchical structure allows reusing the policies of subgoals across similar tasks for various navigation scenarios. In additional, the GCCP module's ability to predict both the ego vehicle's and surrounding vehicles' future actions according to different subgoals, ensures the safety of the ego vehicle throughout the decision-making process. Experimental results demonstrate that the proposed method converges to an optimal policy faster and achieves higher safety than traditional RL methods.
☆ M-Predictive Spliner: Enabling Spatiotemporal Multi-Opponent Overtaking for Autonomous Racing
Unrestricted multi-agent racing presents a significant research challenge, requiring decision-making at the limits of a robot's operational capabilities. While previous approaches have either ignored spatiotemporal information in the decision-making process or been restricted to single-opponent scenarios, this work enables arbitrary multi-opponent head-to-head racing while considering the opponents' future intent. The proposed method employs a KF-based multi-opponent tracker to effectively perform opponent ReID by associating them across observations. Simultaneously, spatial and velocity GPR is performed on all observed opponent trajectories, providing predictive information to compute the overtaking maneuvers. This approach has been experimentally validated on a physical 1:10 scale autonomous racing car, achieving an overtaking success rate of up to 91.65% and demonstrating an average 10.13%-point improvement in safety at the same speed as the previous SotA. These results highlight its potential for high-performance autonomous racing.
☆ Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images
Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at https://github.com/zhaoyiww/fusion4landslide.
comment: 20 pages, 16 figures. Preprint under peer review. Example data and code available at [GitHub](https://github.com/zhaoyiww/fusion4landslide)
☆ CapsDT: Diffusion-Transformer for Capsule Robot Manipulation IROS 2025
Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator. Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation.
comment: IROS 2025
☆ Probabilistic Collision Risk Estimation for Pedestrian Navigation
Intelligent devices for supporting persons with vision impairment are becoming more widespread, but they are lacking behind the advancements in intelligent driver assistant system. To make a first step forward, this work discusses the integration of the risk model technology, previously used in autonomous driving and advanced driver assistance systems, into an assistance device for persons with vision impairment. The risk model computes a probabilistic collision risk given object trajectories which has previously been shown to give better indications of an object's collision potential compared to distance or time-to-contact measures in vehicle scenarios. In this work, we show that the risk model is also superior in warning persons with vision impairment about dangerous objects. Our experiments demonstrate that the warning accuracy of the risk model is 67% while both distance and time-to-contact measures reach only 51% accuracy for real-world data.
☆ ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models
Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.
comment: Website: https://controlvla.github.io
☆ FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation
Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which allows adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, simplifying the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.
☆ Single-Microphone-Based Sound Source Localization for Mobile Robots in Reverberant Environments IROS
Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localization method that uses a single microphone mounted on a mobile robot in reverberant environments. Specifically, we develop a lightweight neural network model with only 43k parameters to perform real-time distance estimation by extracting temporal information from reverberant signals. The estimated distances are then processed using an extended Kalman filter to achieve online sound source localization. To the best of our knowledge, this is the first work to achieve online sound source localization using a single microphone on a moving robot, a gap that we aim to fill in this work. Extensive experiments demonstrate the effectiveness and merits of our approach. To benefit the broader research community, we have open-sourced our code at https://github.com/JiangWAV/single-mic-SSL.
comment: This paper was accepted and going to appear in the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
☆ From Theory to Practice: Identifying the Optimal Approach for Offset Point Tracking in the Context of Agricultural Robotics ICRA
Modern agriculture faces escalating challenges: increasing demand for food, labor shortages, and the urgent need to reduce environmental impact. Agricultural robotics has emerged as a promising response to these pressures, enabling the automation of precise and suitable field operations. In particular, robots equipped with implements for tasks such as weeding or sowing must interact delicately and accurately with the crops and soil. Unlike robots in other domains, these agricultural platforms typically use rigidly mounted implements, where the implement's position is more critical than the robot's center in determining task success. Yet, most control strategies in the literature focus on the vehicle body, often neglecting the acctual working point of the system. This is particularly important when considering new agriculture practices where crops row are not necessary straights. This paper presents a predictive control strategy targeting the implement's reference point. The method improves tracking performance by anticipating the motion of the implement, which, due to its offset from the vehicle's center of rotation, is prone to overshooting during turns if not properly accounted for.
comment: Presented at the 2025 IEEE ICRA Workshop on Field Robotics
☆ Investigating Lagrangian Neural Networks for Infinite Horizon Planning in Quadrupedal Locomotion
Lagrangian Neural Networks (LNNs) present a principled and interpretable framework for learning the system dynamics by utilizing inductive biases. While traditional dynamics models struggle with compounding errors over long horizons, LNNs intrinsically preserve the physical laws governing any system, enabling accurate and stable predictions essential for sustainable locomotion. This work evaluates LNNs for infinite horizon planning in quadrupedal robots through four dynamics models: (1) full-order forward dynamics (FD) training and inference, (2) diagonalized representation of Mass Matrix in full order FD, (3) full-order inverse dynamics (ID) training with FD inference, (4) reduced-order modeling via torso centre-of-mass (CoM) dynamics. Experiments demonstrate that LNNs bring improvements in sample efficiency (10x) and superior prediction accuracy (up to 2-10x) compared to baseline methods. Notably, the diagonalization approach of LNNs reduces computational complexity while retaining some interpretability, enabling real-time receding horizon control. These findings highlight the advantages of LNNs in capturing the underlying structure of system dynamics in quadrupeds, leading to improved performance and efficiency in locomotion planning and control. Additionally, our approach achieves a higher control frequency than previous LNN methods, demonstrating its potential for real-world deployment on quadrupeds.
comment: 6 pages, 5 figures, Accepted at Advances in Robotics (AIR) Conference 2025
☆ Noise Fusion-based Distillation Learning for Anomaly Detection in Complex Industrial Environments IROS 2025
Anomaly detection and localization in automated industrial manufacturing can significantly enhance production efficiency and product quality. Existing methods are capable of detecting surface defects in pre-defined or controlled imaging environments. However, accurately detecting workpiece defects in complex and unstructured industrial environments with varying views, poses and illumination remains challenging. We propose a novel anomaly detection and localization method specifically designed to handle inputs with perturbative patterns. Our approach introduces a new framework based on a collaborative distillation heterogeneous teacher network (HetNet), an adaptive local-global feature fusion module, and a local multivariate Gaussian noise generation module. HetNet can learn to model the complex feature distribution of normal patterns using limited information about local disruptive changes. We conducted extensive experiments on mainstream benchmarks. HetNet demonstrates superior performance with approximately 10% improvement across all evaluation metrics on MSC-AD under industrial conditions, while achieving state-of-the-art results on other datasets, validating its resilience to environmental fluctuations and its capability to enhance the reliability of industrial anomaly detection systems across diverse scenarios. Tests in real-world environments further confirm that HetNet can be effectively integrated into production lines to achieve robust and real-time anomaly detection. Codes, images and videos are published on the project website at: https://zihuatanejoyu.github.io/HetNet/
comment: IROS 2025 Oral
☆ Human-Centered Shared Autonomy for Motor Planning, Learning, and Control Applications
With recent advancements in AI and computational tools, intelligent paradigms have emerged to enhance fields like shared autonomy and human-machine teaming in healthcare. Advanced AI algorithms (e.g., reinforcement learning) can autonomously make decisions to achieve planning and motion goals. However, in healthcare, where human intent is crucial, fully independent machine decisions may not be ideal. This chapter presents a comprehensive review of human-centered shared autonomy AI frameworks, focusing on upper limb biosignal-based machine interfaces and associated motor control systems, including computer cursors, robotic arms, and planar platforms. We examine motor planning, learning (rehabilitation), and control, covering conceptual foundations of human-machine teaming in reach-and-grasp tasks and analyzing both theoretical and practical implementations. Each section explores how human and machine inputs can be blended for shared autonomy in healthcare applications. Topics include human factors, biosignal processing for intent detection, shared autonomy in brain-computer interfaces (BCI), rehabilitation, assistive robotics, and Large Language Models (LLMs) as the next frontier. We propose adaptive shared autonomy AI as a high-performance paradigm for collaborative human-AI systems, identify key implementation challenges, and outline future directions, particularly regarding AI reasoning agents. This analysis aims to bridge neuroscientific insights with robotics to create more intuitive, effective, and ethical human-machine teaming frameworks.
☆ EndoMUST: Monocular Depth Estimation for Robotic Endoscopy via End-to-end Multi-step Self-supervised Training IROS 2025
Monocular depth estimation and ego-motion estimation are significant tasks for scene perception and navigation in stable, accurate and efficient robot-assisted endoscopy. To tackle lighting variations and sparse textures in endoscopic scenes, multiple techniques including optical flow, appearance flow and intrinsic image decomposition have been introduced into the existing methods. However, the effective training strategy for multiple modules are still critical to deal with both illumination issues and information interference for self-supervised depth estimation in endoscopy. Therefore, a novel framework with multistep efficient finetuning is proposed in this work. In each epoch of end-to-end training, the process is divided into three steps, including optical flow registration, multiscale image decomposition and multiple transformation alignments. At each step, only the related networks are trained without interference of irrelevant information. Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance on self-supervised depth estimation on SCARED dataset and zero-shot depth estimation on Hamlyn dataset, with 4\%$\sim$10\% lower error. The evaluation code of this work has been published on https://github.com/BaymaxShao/EndoMUST.
comment: Accepted by IROS 2025
☆ DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning
Developing embodied agents capable of performing complex interactive tasks in real-world scenarios remains a fundamental challenge in embodied AI. Although recent advances in simulation platforms have greatly enhanced task diversity to train embodied Vision Language Models (VLMs), most platforms rely on simplified robot morphologies and bypass the stochastic nature of low-level execution, which limits their transferability to real-world robots. To address these issues, we present a physics-based simulation platform DualTHOR for complex dual-arm humanoid robots, built upon an extended version of AI2-THOR. Our simulator includes real-world robot assets, a task suite for dual-arm collaboration, and inverse kinematics solvers for humanoid robots. We also introduce a contingency mechanism that incorporates potential failures through physics-based low-level execution, bridging the gap to real-world scenarios. Our simulator enables a more comprehensive evaluation of the robustness and generalization of VLMs in household environments. Extensive evaluations reveal that current VLMs struggle with dual-arm coordination and exhibit limited robustness in realistic environments with contingencies, highlighting the importance of using our simulator to develop more capable VLMs for embodied tasks. The code is available at https://github.com/ds199895/DualTHOR.git.
☆ Quantum Artificial Intelligence for Secure Autonomous Vehicle Navigation: An Architectural Proposal
Navigation is a very crucial aspect of autonomous vehicle ecosystem which heavily relies on collecting and processing large amounts of data in various states and taking a confident and safe decision to define the next vehicle maneuver. In this paper, we propose a novel architecture based on Quantum Artificial Intelligence by enabling quantum and AI at various levels of navigation decision making and communication process in Autonomous vehicles : Quantum Neural Networks for multimodal sensor fusion, Nav-Q for Quantum reinforcement learning for navigation policy optimization and finally post-quantum cryptographic protocols for secure communication. Quantum neural networks uses quantum amplitude encoding to fuse data from various sensors like LiDAR, radar, camera, GPS and weather etc., This approach gives a unified quantum state representation between heterogeneous sensor modalities. Nav-Q module processes the fused quantum states through variational quantum circuits to learn optimal navigation policies under swift dynamic and complex conditions. Finally, post quantum cryptographic protocols are used to secure communication channels for both within vehicle communication and V2X (Vehicle to Everything) communications and thus secures the autonomous vehicle communication from both classical and quantum security threats. Thus, the proposed framework addresses fundamental challenges in autonomous vehicles navigation by providing quantum performance and future proof security. Index Terms Quantum Computing, Autonomous Vehicles, Sensor Fusion
comment: 5 pages, 2 figures, 17 references. Architectural proposal for quantum AI integration in autonomous vehicle navigation systems for secured navigation
☆ Adversarial Attacks and Detection in Visual Place Recognition for Safer Robot Navigation
Stand-alone Visual Place Recognition (VPR) systems have little defence against a well-designed adversarial attack, which can lead to disastrous consequences when deployed for robot navigation. This paper extensively analyzes the effect of four adversarial attacks common in other perception tasks and four novel VPR-specific attacks on VPR localization performance. We then propose how to close the loop between VPR, an Adversarial Attack Detector (AAD), and active navigation decisions by demonstrating the performance benefit of simulated AADs in a novel experiment paradigm -- which we detail for the robotics community to use as a system framework. In the proposed experiment paradigm, we see the addition of AADs across a range of detection accuracies can improve performance over baseline; demonstrating a significant improvement -- such as a ~50% reduction in the mean along-track localization error -- can be achieved with True Positive and False Positive detection rates of only 75% and up to 25% respectively. We examine a variety of metrics including: Along-Track Error, Percentage of Time Attacked, Percentage of Time in an `Unsafe' State, and Longest Continuous Time Under Attack. Expanding further on these results, we provide the first investigation into the efficacy of the Fast Gradient Sign Method (FGSM) adversarial attack for VPR. The analysis in this work highlights the need for AADs in real-world systems for trustworthy navigation, and informs quantitative requirements for system design.
☆ A Low-Cost Portable Lidar-based Mobile Mapping System on an Android Smartphone SP
The rapid advancement of the metaverse, digital twins, and robotics underscores the demand for low-cost, portable mapping systems for reality capture. Current mobile solutions, such as the Leica BLK2Go and lidar-equipped smartphones, either come at a high cost or are limited in range and accuracy. Leveraging the proliferation and technological evolution of mobile devices alongside recent advancements in lidar technology, we introduce a novel, low-cost, portable mobile mapping system. Our system integrates a lidar unit, an Android smartphone, and an RTK-GNSS stick. Running on the Android platform, it features lidar-inertial odometry built with the NDK, and logs data from the lidar, wide-angle camera, IMU, and GNSS. With a total bill of materials (BOM) cost under 2,000 USD and a weight of about 1 kilogram, the system achieves a good balance between affordability and portability. We detail the system design, multisensor calibration, synchronization, and evaluate its performance for tracking and mapping. To further contribute to the community, the system's design and software are made open source at: https://github.com/OSUPCVLab/marslogger_android/releases/tag/v2.1
comment: ISPRS GSW2025 Dubai UAE
☆ Contactless Precision Steering of Particles in a Fluid inside a Cube with Rotating Walls
Contactless manipulation of small objects is essential for biomedical and chemical applications, such as cell analysis, assisted fertilisation, and precision chemistry. Established methods, including optical, acoustic, and magnetic tweezers, are now complemented by flow control techniques that use flow-induced motion to enable precise and versatile manipulation. However, trapping multiple particles in fluid remains a challenge. This study introduces a novel control algorithm capable of steering multiple particles in flow. The system uses rotating disks to generate flow fields that transport particles to precise locations. Disk rotations are governed by a feedback control policy based on the Optimising a Discrete Loss (ODIL) framework, which combines fluid dynamics equations with path objectives into a single loss function. Our experiments, conducted in both simulations and with the physical device, demonstrate the capability of the approach to transport two beads simultaneously to predefined locations, advancing robust contactless particle manipulation for biomedical applications.
☆ ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation
Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.
☆ KARL: Kalman-Filter Assisted Reinforcement Learner for Dynamic Object Tracking and Grasping
We present Kalman-filter Assisted Reinforcement Learner (KARL) for dynamic object tracking and grasping over eye-on-hand (EoH) systems, significantly expanding such systems capabilities in challenging, realistic environments. In comparison to the previous state-of-the-art, KARL (1) incorporates a novel six-stage RL curriculum that doubles the system's motion range, thereby greatly enhancing the system's grasping performance, (2) integrates a robust Kalman filter layer between the perception and reinforcement learning (RL) control modules, enabling the system to maintain an uncertain but continuous 6D pose estimate even when the target object temporarily exits the camera's field-of-view or undergoes rapid, unpredictable motion, and (3) introduces mechanisms to allow retries to gracefully recover from unavoidable policy execution failures. Extensive evaluations conducted in both simulation and real-world experiments qualitatively and quantitatively corroborate KARL's advantage over earlier systems, achieving higher grasp success rates and faster robot execution speed. Source code and supplementary materials for KARL will be made available at: https://github.com/arc-l/karl.
♻ ☆ Context Matters: Learning Generalizable Rewards via Calibrated Features
A key challenge in reward learning from human input is that desired agent behavior often changes based on context. Traditional methods typically treat each new context as a separate task with its own reward function. For example, if a previously ignored stove becomes too hot to be around, the robot must learn a new reward from scratch, even though the underlying preference for prioritizing safety over efficiency remains unchanged. We observe that context influences not the underlying preference itself, but rather the $\textit{saliency}$--or importance--of reward features. For instance, stove heat affects the importance of the robot's proximity, yet the human's safety preference stays the same. Existing multi-task and meta IRL methods learn context-dependent representations $\textit{implicitly}$--without distinguishing between preferences and feature importance--resulting in substantial data requirements. Instead, we propose $\textit{explicitly}$ modeling context-invariant preferences separately from context-dependent feature saliency, creating modular reward representations that adapt to new contexts. To achieve this, we introduce $\textit{calibrated features}$--representations that capture contextual effects on feature saliency--and present specialized paired comparison queries that isolate saliency from preference for efficient learning. Experiments with simulated users show our method significantly improves sample efficiency, requiring 10x fewer preference queries than baselines to achieve equivalent reward accuracy, with up to 15% better performance in low-data regimes (5-10 queries). An in-person user study (N=12) demonstrates that participants can effectively teach their unique personal contextual preferences using our method, enabling more adaptable and personalized reward learning.
comment: 30 pages, 21 figures
♻ ☆ AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment
Current service robots suffer from limited natural language communication abilities, heavy reliance on predefined commands, ongoing human intervention, and, most notably, a lack of proactive collaboration awareness in human-populated environments. This results in narrow applicability and low utility. In this paper, we introduce AssistantX, an LLM-powered proactive assistant designed for autonomous operation in realworld scenarios with high accuracy. AssistantX employs a multi-agent framework consisting of 4 specialized LLM agents, each dedicated to perception, planning, decision-making, and reflective review, facilitating advanced inference capabilities and comprehensive collaboration awareness, much like a human assistant by your side. We built a dataset of 210 real-world tasks to validate AssistantX, which includes instruction content and status information on whether relevant personnel are available. Extensive experiments were conducted in both text-based simulations and a real office environment over the course of a month and a half. Our experiments demonstrate the effectiveness of the proposed framework, showing that AssistantX can reactively respond to user instructions, actively adjust strategies to adapt to contingencies, and proactively seek assistance from humans to ensure successful task completion. More details and videos can be found at https://assistantx-agent.github.io/AssistantX/.
comment: 8 pages, 10 figures, 6 tables
♻ ☆ ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes
Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization. We propose ClutterDexGrasp, a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a geometry and spatially-embedded scene representation and a novel comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher's knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations. To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target-oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts. More details and videos are available at https://clutterdexgrasp.github.io/.
♻ ☆ LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action "vocabulary" such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body control (WBC) tasks. To capture this gap in the literature, we start by introducing the first sim-to-real-ready, vision-language, closed-loop benchmark for humanoid WBC, comprising over 150 tasks from 10 categories. We then propose LeVERB: Latent Vision-Language-Encoded Robot Behavior, a hierarchical latent instruction-following framework for humanoid vision-language WBC, the first of its kind. At the top level, a vision-language policy learns a latent action vocabulary from synthetically rendered kinematic demonstrations; at the low level, a reinforcement-learned WBC policy consumes these latent verbs to generate dynamics-level commands. In our benchmark, LeVERB can zero-shot attain a 80% success rate on simple visual navigation tasks, and 58.5% success rate overall, outperforming naive hierarchical whole-body VLA implementation by 7.8 times.
comment: https://ember-lab-berkeley.github.io/LeVERB-Website/
♻ ☆ From Movement Primitives to Distance Fields to Dynamical Systems
Developing autonomous robots capable of learning and reproducing complex motions from demonstrations remains a fundamental challenge in robotics. On the one hand, movement primitives (MPs) provide a compact and modular representation of continuous trajectories. On the other hand, autonomous systems provide control policies that are time independent. We propose in this paper a simple and flexible approach that gathers the advantages of both representations by transforming MPs into autonomous systems. The key idea is to transform the explicit representation of a trajectory as an implicit shape encoded as a distance field. This conversion from a time-dependent motion to a spatial representation enables the definition of an autonomous dynamical system with modular reactions to perturbation. Asymptotic stability guarantees are provided by using Bernstein basis functions in the MPs, representing trajectories as concatenated quadratic B\'ezier curves, which provide an analytical method for computing distance fields. This approach bridges conventional MPs with distance fields, ensuring smooth and precise motion encoding, while maintaining a continuous spatial representation. By simply leveraging the analytic gradients of the curve and its distance field, a stable dynamical system can be computed to reproduce the demonstrated trajectories while handling perturbations, without requiring a model of the dynamical system to be estimated. Numerical simulations and real-world robotic experiments validate our method's ability to encode complex motion patterns while ensuring trajectory stability, together with the flexibility of designing the desired reaction to perturbations. An interactive project page demonstrating our approach is available at https://mp-df-ds.github.io/.
comment: 8 pages, 8 Figures
♻ ☆ Your Ride, Your Rules: Psychology and Cognition Enabled Automated Driving Systems
Despite rapid advances in autonomous driving technology, current autonomous vehicles (AVs) lack effective bidirectional human-machine communication, limiting their ability to personalize the riding experience and recover from uncertain or immobilized states. This limitation undermines occupant comfort and trust, potentially hindering the adoption of AV technologies. We propose PACE-ADS (Psychology and Cognition Enabled Automated Driving Systems), a human-centered autonomy framework enabling AVs to sense, interpret, and respond to both external traffic conditions and internal occupant states. PACE-ADS uses an agentic workflow where three foundation model agents collaborate: the Driver Agent interprets the external environment; the Psychologist Agent decodes passive psychological signals (e.g., EEG, heart rate, facial expressions) and active cognitive inputs (e.g., verbal commands); and the Coordinator Agent synthesizes these inputs to generate high-level decisions that enhance responsiveness and personalize the ride. PACE-ADS complements, rather than replaces, conventional AV modules. It operates at the semantic planning layer, while delegating low-level control to native systems. The framework activates only when changes in the rider's psychological state are detected or when occupant instructions are issued. It integrates into existing AV platforms with minimal adjustments, positioning PACE-ADS as a scalable enhancement. We evaluate it in closed-loop simulations across diverse traffic scenarios, including intersections, pedestrian interactions, work zones, and car-following. Results show improved ride comfort, dynamic behavioral adjustment, and safe recovery from edge-case scenarios via autonomous reasoning or rider input. PACE-ADS bridges the gap between technical autonomy and human-centered mobility.
comment: 10 figures,13 pages, two colummns
♻ ☆ A Survey of World Models for Autonomous Driving
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision-making. World models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three-tiered taxonomy: (i) Generation of Future Physical World, covering Image-, BEV-, OG-, and PC-based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; (ii) Behavior Planning for Intelligent Agents, combining rule-driven and learning-based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; (ii) Interaction between Prediction and Planning, achieving multi-agent collaborative decision-making through latent space diffusion and memory-augmented architectures. The study further analyzes training paradigms, including self-supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self-supervised representation learning, long-tail scenario generation, and multimodal fusion to advance the practical deployment of world models in complex urban environments. Overall, the comprehensive analysis provides a technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.
comment: Ongoing project. Paper list: https://github.com/FengZicai/AwesomeWMAD; Benchmark: https://github.com/FengZicai/WMAD-Benchmarks
♻ ☆ Learning Attentive Neural Processes for Planning with Pushing Actions
Our goal is to enable robots to plan sequences of tabletop actions to push a block with unknown physical properties to a desired goal pose. We approach this problem by learning the constituent models of a Partially-Observable Markov Decision Process (POMDP), where the robot can observe the outcome of a push, but the physical properties of the block that govern the dynamics remain unknown. A common solution approach is to train an observation model in a supervised fashion, and do inference with a general inference technique such as particle filters. However, supervised training requires knowledge of the relevant physical properties that determine the problem dynamics, which we do not assume to be known. Planning also requires simulating many belief updates, which becomes expensive when using particle filters to represent the belief. We propose to learn an Attentive Neural Process that computes the belief over a learned latent representation of the relevant physical properties given a history of actions. To address the pushing planning problem, we integrate a trained Neural Process with a double-progressive widening sampling strategy. Simulation results indicate that Neural Process Tree with Double Progressive Widening (NPT-DPW) generates better-performing plans faster than traditional particle-filter methods that use a supervised-trained observation model, even in complex pushing scenarios.
comment: To be presented at RSS 2025 RoboReps Workshop
♻ ☆ An Iterative Task-Driven Framework for Resilient LiDAR Place Recognition in Adverse Weather
LiDAR place recognition (LPR) plays a vital role in autonomous navigation. However, existing LPR methods struggle to maintain robustness under adverse weather conditions such as rain, snow, and fog, where weather-induced noise and point cloud degradation impair LiDAR reliability and perception accuracy. To tackle these challenges, we propose an Iterative Task-Driven Framework (ITDNet), which integrates a LiDAR Data Restoration (LDR) module and a LiDAR Place Recognition (LPR) module through an iterative learning strategy. These modules are jointly trained end-to-end, with alternating optimization to enhance performance. The core rationale of ITDNet is to leverage the LDR module to recover the corrupted point clouds while preserving structural consistency with clean data, thereby improving LPR accuracy in adverse weather. Simultaneously, the LPR task provides feature pseudo-labels to guide the LDR module's training, aligning it more effectively with the LPR task. To achieve this, we first design a task-driven LPR loss and a reconstruction loss to jointly supervise the optimization of the LDR module. Furthermore, for the LDR module, we propose a Dual-Domain Mixer (DDM) block for frequency-spatial feature fusion and a Semantic-Aware Generator (SAG) block for semantic-guided restoration. In addition, for the LPR module, we introduce a Multi-Frequency Transformer (MFT) block and a Wavelet Pyramid NetVLAD (WPN) block to aggregate multi-scale, robust global descriptors. Finally, extensive experiments on Weather-KITTI, Boreas, and our proposed Weather-Apollo datasets demonstrate that, ITDNet outperforms existing LPR methods, achieving state-of-the-art performance in adverse weather.
comment: Submitted to IEEE TVT
♻ ☆ Dual-arm Motion Generation for Repositioning Care based on Deep Predictive Learning with Somatosensory Attention Mechanism
Caregiving is a vital role for domestic robots, especially the repositioning care has immense societal value, critically improving the health and quality of life of individuals with limited mobility. However, repositioning task is a challenging area of research, as it requires robots to adapt their motions while interacting flexibly with patients. The task involves several key challenges: (1) applying appropriate force to specific target areas; (2) performing multiple actions seamlessly, each requiring different force application policies; and (3) motion adaptation under uncertain positional conditions. To address these, we propose a deep neural network (DNN)-based architecture utilizing proprioceptive and visual attention mechanisms, along with impedance control to regulate the robot's movements. Using the dual-arm humanoid robot Dry-AIREC, the proposed model successfully generated motions to insert the robot's hand between the bed and a mannequin's back without applying excessive force, and it supported the transition from a supine to a lifted-up position. The project page is here: https://sites.google.com/view/caregiving-robot-airec/repositioning
♻ ☆ JammingSnake: A follow-the-leader continuum robot with variable stiffness based on fiber jamming
Follow-the-leader (FTL) motion is essential for continuum robots operating in fragile and confined environments. It allows the robot to exert minimal force on its surroundings, reducing the risk of damage. This paper presents a novel design of a snake-like robot capable of achieving FTL motion by integrating fiber jamming modules (FJMs). The proposed robot can dynamically adjust its stiffness during propagation and interaction with the environment. An algorithm is developed to independently control the tendon and FJM insertion movements, allowing the robot to maintain its shape while minimizing the forces exerted on surrounding structures. To validate the proposed design, comparative tests were conducted between a traditional tendon-driven robot and the novel design under different configurations. The results demonstrate that our design relies significantly less on contact with the surroundings to maintain its shape. This highlights its potential for safer and more effective operations in delicate environments, such as minimally invasive surgery (MIS) or industrial in-situ inspection.
comment: 8 pages, 4 figures, published in T-MECH
♻ ☆ General Force Sensation for Tactile Robot
Robotic tactile sensors, including vision-based and taxel-based sensors, enable agile manipulation and safe human-robot interaction through force sensation. However, variations in structural configurations, measured signals, and material properties create domain gaps that limit the transferability of learned force sensation across different tactile sensors. Here, we introduce GenForce, a general framework for achieving transferable force sensation across both homogeneous and heterogeneous tactile sensors in robotic systems. By unifying tactile signals into marker-based binary tactile images, GenForce enables the transfer of existing force labels to arbitrary target sensors using a marker-to-marker translation technique with a few paired data. This process equips uncalibrated tactile sensors with force prediction capabilities through spatiotemporal force prediction models trained on the transferred data. Extensive experimental results validate GenForce's generalizability, accuracy, and robustness across sensors with diverse marker patterns, structural designs, material properties, and sensing principles. The framework significantly reduces the need for costly and labor-intensive labeled data collection, enabling the rapid deployment of multiple tactile sensors on robotic hands requiring force sensing capabilities.
♻ ☆ Traversability-aware path planning in dynamic environments
Planning in environments with moving obstacles remains a significant challenge in robotics. While many works focus on navigation and path planning in obstacle-dense spaces, traversing such congested regions is often avoidable by selecting alternative routes. This paper presents Traversability-aware FMM (Tr-FMM), a path planning method that computes paths in dynamic environments, avoiding crowded regions. The method operates in two steps: first, it discretizes the environment, identifying regions and their distribution; second, it computes the traversability of regions, aiming to minimize both obstacle risks and goal deviation. The path is then computed by propagating the wavefront through regions with higher traversability. Simulated and real-world experiments demonstrate that the approach enhances significant safety by keeping the robot away from regions with obstacles while reducing unnecessary deviations from the goal.
♻ ☆ Enhanced Trust Region Sequential Convex Optimization for Multi-Drone Thermal Screening Trajectory Planning in Urban Environments
The rapid detection of abnormal body temperatures in urban populations is essential for managing public health risks, especially during outbreaks of infectious diseases. Multi-drone thermal screening systems offer promising solutions for fast, large-scale, and non-intrusive human temperature monitoring. However, trajectory planning for multiple drones in complex urban environments poses significant challenges, including collision avoidance, coverage efficiency, and constrained flight environments. In this study, we propose an enhanced trust region sequential convex optimization (TR-SCO) algorithm for optimal trajectory planning of multiple drones performing thermal screening tasks. Our improved algorithm integrates a refined convex optimization formulation within a trust region framework, effectively balancing trajectory smoothness, obstacle avoidance, altitude constraints, and maximum screening coverage. Simulation results demonstrate that our approach significantly improves trajectory optimality and computational efficiency compared to conventional convex optimization methods. This research provides critical insights and practical contributions toward deploying efficient multi-drone systems for real-time thermal screening in urban areas. For reader who are interested in our research, we release our source code at https://github.com/Cherry0302/Enhanced-TR-SCO.
♻ ☆ ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning
We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy's deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants, including Rectified Flow [35] and Shortcut Models [19], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long-horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135.36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82.63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [43]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40.34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23.20%. Project webpage: https://reinflow.github.io/
comment: 31 pages, 13 figures, 10 tables
♻ ☆ From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots
Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world.
♻ ☆ V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models
Vehicle-to-everything (V2X) cooperation has emerged as a promising paradigm to overcome the perception limitations of classical autonomous driving by leveraging information from both ego-vehicle and infrastructure sensors. However, effectively fusing heterogeneous visual and semantic information while ensuring robust trajectory planning remains a significant challenge. This paper introduces V2X-VLM, a novel end-to-end (E2E) cooperative autonomous driving framework based on vision-language models (VLMs). V2X-VLM integrates multiperspective camera views from vehicles and infrastructure with text-based scene descriptions to enable a more comprehensive understanding of driving environments. Specifically, we propose a contrastive learning-based mechanism to reinforce the alignment of heterogeneous visual and textual characteristics, which enhances the semantic understanding of complex driving scenarios, and employ a knowledge distillation strategy to stabilize training. Experiments on a large real-world dataset demonstrate that V2X-VLM achieves state-of-the-art trajectory planning accuracy, significantly reducing L2 error and collision rate compared to existing cooperative autonomous driving baselines. Ablation studies validate the contributions of each component. Moreover, the evaluation of robustness and efficiency highlights the practicality of V2X-VLM for real-world deployment to enhance overall autonomous driving safety and decision-making.
♻ ☆ Hierarchical and Modular Network on Non-prehensile Manipulation in General Environments
For robots to operate in general environments like households, they must be able to perform non-prehensile manipulation actions such as toppling and rolling to manipulate ungraspable objects. However, prior works on non-prehensile manipulation cannot yet generalize across environments with diverse geometries. The main challenge lies in adapting to varying environmental constraints: within a cabinet, the robot must avoid walls and ceilings; to lift objects to the top of a step, the robot must account for the step's pose and extent. While deep reinforcement learning (RL) has demonstrated impressive success in non-prehensile manipulation, accounting for such variability presents a challenge for the generalist policy, as it must learn diverse strategies for each new combination of constraints. To address this, we propose a modular and reconfigurable architecture that adaptively reconfigures network modules based on task requirements. To capture the geometric variability in environments, we extend the contact-based object representation (CORN) to environment geometries, and propose a procedural algorithm for generating diverse environments to train our agent. Taken together, the resulting policy can zero-shot transfer to novel real-world environments and objects despite training entirely within a simulator. We additionally release a simulation-based benchmark featuring nine digital twins of real-world scenes with 353 objects to facilitate non-prehensile manipulation research in realistic domains.
comment: http://unicorn-hamnet.github.io/
Robotics 58
☆ Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos
Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .
comment: Project page: https://kywind.github.io/pgnd
☆ Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.
☆ Vision in Action: Learning Active Perception from Human Demonstrations
We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.
☆ FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.
comment: Our dataset and code will be made available at: https://findingdory-benchmark.github.io/
☆ GRIM: Task-Oriented Grasping with Conditioning on Generative Examples
Task-Oriented Grasping (TOG) presents a significant challenge, requiring a nuanced understanding of task semantics, object affordances, and the functional constraints dictating how an object should be grasped for a specific task. To address these challenges, we introduce GRIM (Grasp Re-alignment via Iterative Matching), a novel training-free framework for task-oriented grasping. Initially, a coarse alignment strategy is developed using a combination of geometric cues and principal component analysis (PCA)-reduced DINO features for similarity scoring. Subsequently, the full grasp pose associated with the retrieved memory instance is transferred to the aligned scene object and further refined against a set of task-agnostic, geometrically stable grasps generated for the scene object, prioritizing task compatibility. In contrast to existing learning-based methods, GRIM demonstrates strong generalization capabilities, achieving robust performance with only a small number of conditioning examples.
☆ RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation
Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.
comment: 9 pages, 7 figures
☆ Aerial Grasping via Maximizing Delta-Arm Workspace Utilization
The workspace limits the operational capabilities and range of motion for the systems with robotic arms. Maximizing workspace utilization has the potential to provide more optimal solutions for aerial manipulation tasks, increasing the system's flexibility and operational efficiency. In this paper, we introduce a novel planning framework for aerial grasping that maximizes workspace utilization. We formulate an optimization problem to optimize the aerial manipulator's trajectory, incorporating task constraints to achieve efficient manipulation. To address the challenge of incorporating the delta arm's non-convex workspace into optimization constraints, we leverage a Multilayer Perceptron (MLP) to map position points to feasibility probabilities.Furthermore, we employ Reversible Residual Networks (RevNet) to approximate the complex forward kinematics of the delta arm, utilizing efficient model gradients to eliminate workspace constraints. We validate our methods in simulations and real-world experiments to demonstrate their effectiveness.
comment: 8 pages, 7 figures
☆ Real-Time Initialization of Unknown Anchors for UWB-aided Navigation
This paper presents a framework for the real-time initialization of unknown Ultra-Wideband (UWB) anchors in UWB-aided navigation systems. The method is designed for localization solutions where UWB modules act as supplementary sensors. Our approach enables the automatic detection and calibration of previously unknown anchors during operation, removing the need for manual setup. By combining an online Positional Dilution of Precision (PDOP) estimation, a lightweight outlier detection method, and an adaptive robust kernel for non-linear optimization, our approach significantly improves robustness and suitability for real-world applications compared to state-of-the-art. In particular, we show that our metric which triggers an initialization decision is more conservative than current ones commonly based on initial linear or non-linear initialization guesses. This allows for better initialization geometry and subsequently lower initialization errors. We demonstrate the proposed approach on two different mobile robots: an autonomous forklift and a quadcopter equipped with a UWB-aided Visual-Inertial Odometry (VIO) framework. The results highlight the effectiveness of the proposed method with robust initialization and low positioning error. We open-source our code in a C++ library including a ROS wrapper.
☆ SurfAAV: Design and Implementation of a Novel Multimodal Surfing Aquatic-Aerial Vehicle
Despite significant advancements in the research of aquatic-aerial robots, existing configurations struggle to efficiently perform underwater, surface, and aerial movement simultaneously. In this paper, we propose a novel multimodal surfing aquatic-aerial vehicle, SurfAAV, which efficiently integrates underwater navigation, surface gliding, and aerial flying capabilities. Thanks to the design of the novel differential thrust vectoring hydrofoil, SurfAAV can achieve efficient surface gliding and underwater navigation without the need for a buoyancy adjustment system. This design provides flexible operational capabilities for both surface and underwater tasks, enabling the robot to quickly carry out underwater monitoring activities. Additionally, when it is necessary to reach another water body, SurfAAV can switch to aerial mode through a gliding takeoff, flying to the target water area to perform corresponding tasks. The main contribution of this letter lies in proposing a new solution for underwater, surface, and aerial movement, designing a novel hybrid prototype concept, developing the required control laws, and validating the robot's ability to successfully perform surface gliding and gliding takeoff. SurfAAV achieves a maximum surface gliding speed of 7.96 m/s and a maximum underwater speed of 3.1 m/s. The prototype's surface gliding maneuverability and underwater cruising maneuverability both exceed those of existing aquatic-aerial vehicles.
☆ Model Predictive Path-Following Control for a Quadrotor
Automating drone-assisted processes is a complex task. Many solutions rely on trajectory generation and tracking, whereas in contrast, path-following control is a particularly promising approach, offering an intuitive and natural approach to automate tasks for drones and other vehicles. While different solutions to the path-following problem have been proposed, most of them lack the capability to explicitly handle state and input constraints, are formulated in a conservative two-stage approach, or are only applicable to linear systems. To address these challenges, the paper is built upon a Model Predictive Control-based path-following framework and extends its application to the Crazyflie quadrotor, which is investigated in hardware experiments. A cascaded control structure including an underlying attitude controller is included in the Model Predictive Path-Following Control formulation to meet the challenging real-time demands of quadrotor control. The effectiveness of the proposed method is demonstrated through real-world experiments, representing, to the best of the authors' knowledge, a novel application of this MPC-based path-following approach to the quadrotor. Additionally, as an extension to the original method, to allow for deviations of the path in cases where the precise following of the path might be overly restrictive, a corridor path-following approach is presented.
comment: 15 pages, 11 figures, submitted to PAMM 2025
☆ MCOO-SLAM: A Multi-Camera Omnidirectional Object SLAM System
Object-level SLAM offers structured and semantically meaningful environment representations, making it more interpretable and suitable for high-level robotic tasks. However, most existing approaches rely on RGB-D sensors or monocular views, which suffer from narrow fields of view, occlusion sensitivity, and limited depth perception-especially in large-scale or outdoor environments. These limitations often restrict the system to observing only partial views of objects from limited perspectives, leading to inaccurate object modeling and unreliable data association. In this work, we propose MCOO-SLAM, a novel Multi-Camera Omnidirectional Object SLAM system that fully leverages surround-view camera configurations to achieve robust, consistent, and semantically enriched mapping in complex outdoor scenarios. Our approach integrates point features and object-level landmarks enhanced with open-vocabulary semantics. A semantic-geometric-temporal fusion strategy is introduced for robust object association across multiple views, leading to improved consistency and accurate object modeling, and an omnidirectional loop closure module is designed to enable viewpoint-invariant place recognition using scene-level descriptors. Furthermore, the constructed map is abstracted into a hierarchical 3D scene graph to support downstream reasoning tasks. Extensive experiments in real-world demonstrate that MCOO-SLAM achieves accurate localization and scalable object-level mapping with improved robustness to occlusion, pose variation, and environmental complexity.
☆ Efficient Navigation Among Movable Obstacles using a Mobile Manipulator via Hierarchical Policy Learning IROS 2025
We propose a hierarchical reinforcement learning (HRL) framework for efficient Navigation Among Movable Obstacles (NAMO) using a mobile manipulator. Our approach combines interaction-based obstacle property estimation with structured pushing strategies, facilitating the dynamic manipulation of unforeseen obstacles while adhering to a pre-planned global path. The high-level policy generates pushing commands that consider environmental constraints and path-tracking objectives, while the low-level policy precisely and stably executes these commands through coordinated whole-body movements. Comprehensive simulation-based experiments demonstrate improvements in performing NAMO tasks, including higher success rates, shortened traversed path length, and reduced goal-reaching times, compared to baselines. Additionally, ablation studies assess the efficacy of each component, while a qualitative analysis further validates the accuracy and reliability of the real-time obstacle property estimation.
comment: 8 pages, 6 figures, Accepted to IROS 2025. Supplementary Video: https://youtu.be/sZ8_z7sYVP0
☆ Comparison of Innovative Strategies for the Coverage Problem: Path Planning, Search Optimization, and Applications in Underwater Robotics
In many applications, including underwater robotics, the coverage problem requires an autonomous vehicle to systematically explore a defined area while minimizing redundancy and avoiding obstacles. This paper investigates coverage path planning strategies to enhance the efficiency of underwater gliders, particularly in maximizing the probability of detecting a radioactive source while ensuring safe navigation. We evaluate three path-planning approaches: the Traveling Salesman Problem (TSP), Minimum Spanning Tree (MST), and Optimal Control Problem (OCP). Simulations were conducted in MATLAB, comparing processing time, uncovered areas, path length, and traversal time. Results indicate that OCP is preferable when traversal time is constrained, although it incurs significantly higher computational costs. Conversely, MST-based approaches provide faster but less optimal solutions. These findings offer insights into selecting appropriate algorithms based on mission priorities, balancing efficiency and computational feasibility.
☆ Offensive Robot Cybersecurity
Offensive Robot Cybersecurity introduces a groundbreaking approach by advocating for offensive security methods empowered by means of automation. It emphasizes the necessity of understanding attackers' tactics and identifying vulnerabilities in advance to develop effective defenses, thereby improving robots' security posture. This thesis leverages a decade of robotics experience, employing Machine Learning and Game Theory to streamline the vulnerability identification and exploitation process. Intrinsically, the thesis uncovers a profound connection between robotic architecture and cybersecurity, highlighting that the design and creation aspect of robotics deeply intertwines with its protection against attacks. This duality -- whereby the architecture that shapes robot behavior and capabilities also necessitates a defense mechanism through offensive and defensive cybersecurity strategies -- creates a unique equilibrium. Approaching cybersecurity with a dual perspective of defense and attack, rooted in an understanding of systems architecture, has been pivotal. Through comprehensive analysis, including ethical considerations, the development of security tools, and executing cyber attacks on robot software, hardware, and industry deployments, this thesis proposes a novel architecture for cybersecurity cognitive engines. These engines, powered by advanced game theory and machine learning, pave the way for autonomous offensive cybersecurity strategies for robots, marking a significant shift towards self-defending robotic systems. This research not only underscores the importance of offensive measures in enhancing robot cybersecurity but also sets the stage for future advancements where robots are not just resilient to cyber threats but are equipped to autonomously safeguard themselves.
comment: Doctoral thesis
☆ Designing Intent: A Multimodal Framework for Human-Robot Cooperation in Industrial Workspaces
As robots enter collaborative workspaces, ensuring mutual understanding between human workers and robotic systems becomes a prerequisite for trust, safety, and efficiency. In this position paper, we draw on the cooperation scenario of the AIMotive project in which a human and a cobot jointly perform assembly tasks to argue for a structured approach to intent communication. Building on the Situation Awareness-based Agent Transparency (SAT) framework and the notion of task abstraction levels, we propose a multidimensional design space that maps intent content (SAT1, SAT3), planning horizon (operational to strategic), and modality (visual, auditory, haptic). We illustrate how this space can guide the design of multimodal communication strategies tailored to dynamic collaborative work contexts. With this paper, we lay the conceptual foundation for a future design toolkit aimed at supporting transparent human-robot interaction in the workplace. We highlight key open questions and design challenges, and propose a shared agenda for multimodal, adaptive, and trustworthy robotic collaboration in hybrid work environments.
comment: 9 pages
☆ Minimizing Structural Vibrations via Guided Flow Matching Design Optimization
Structural vibrations are a source of unwanted noise in engineering systems like cars, trains or airplanes. Minimizing these vibrations is crucial for improving passenger comfort. This work presents a novel design optimization approach based on guided flow matching for reducing vibrations by placing beadings (indentations) in plate-like structures. Our method integrates a generative flow matching model and a surrogate model trained to predict structural vibrations. During the generation process, the flow matching model pushes towards manufacturability while the surrogate model pushes to low-vibration solutions. The flow matching model and its training data implicitly define the design space, enabling a broader exploration of potential solutions as no optimization of manually-defined design parameters is required. We apply our method to a range of differentiable optimization objectives, including direct optimization of specific eigenfrequencies through careful construction of the objective function. Results demonstrate that our method generates diverse and manufacturable plate designs with reduced structural vibrations compared to designs from random search, a criterion-based design heuristic and genetic optimization. The code and data are available from https://github.com/ecker-lab/Optimizing_Vibrating_Plates.
☆ Context-Aware Deep Lagrangian Networks for Model Predictive Control IROS
Controlling a robot based on physics-informed dynamic models, such as deep Lagrangian networks (DeLaN), can improve the generalizability and interpretability of the resulting behavior. However, in complex environments, the number of objects to potentially interact with is vast, and their physical properties are often uncertain. This complexity makes it infeasible to employ a single global model. Therefore, we need to resort to online system identification of context-aware models that capture only the currently relevant aspects of the environment. While physical principles such as the conservation of energy may not hold across varying contexts, ensuring physical plausibility for any individual context-aware model can still be highly desirable, particularly when using it for receding horizon control methods such as Model Predictive Control (MPC). Hence, in this work, we extend DeLaN to make it context-aware, combine it with a recurrent network for online system identification, and integrate it with a MPC for adaptive, physics-informed control. We also combine DeLaN with a residual dynamics model to leverage the fact that a nominal model of the robot is typically available. We evaluate our method on a 7-DOF robot arm for trajectory tracking under varying loads. Our method reduces the end-effector tracking error by 39%, compared to a 21% improvement achieved by a baseline that uses an extended Kalman filter.
comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
☆ Multi-Agent Reinforcement Learning for Autonomous Multi-Satellite Earth Observation: A Realistic Case Study
The exponential growth of Low Earth Orbit (LEO) satellites has revolutionised Earth Observation (EO) missions, addressing challenges in climate monitoring, disaster management, and more. However, autonomous coordination in multi-satellite systems remains a fundamental challenge. Traditional optimisation approaches struggle to handle the real-time decision-making demands of dynamic EO missions, necessitating the use of Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL). In this paper, we investigate RL-based autonomous EO mission planning by modelling single-satellite operations and extending to multi-satellite constellations using MARL frameworks. We address key challenges, including energy and data storage limitations, uncertainties in satellite observations, and the complexities of decentralised coordination under partial observability. By leveraging a near-realistic satellite simulation environment, we evaluate the training stability and performance of state-of-the-art MARL algorithms, including PPO, IPPO, MAPPO, and HAPPO. Our results demonstrate that MARL can effectively balance imaging and resource management while addressing non-stationarity and reward interdependency in multi-satellite coordination. The insights gained from this study provide a foundation for autonomous satellite operations, offering practical guidelines for improving policy learning in decentralised EO missions.
☆ SHeRLoc: Synchronized Heterogeneous Radar Place Recognition for Cross-Modal Localization
Despite the growing adoption of radar in robotics, the majority of research has been confined to homogeneous sensor types, overlooking the integration and cross-modality challenges inherent in heterogeneous radar technologies. This leads to significant difficulties in generalizing across diverse radar data types, with modality-aware approaches that could leverage the complementary strengths of heterogeneous radar remaining unexplored. To bridge these gaps, we propose SHeRLoc, the first deep network tailored for heterogeneous radar, which utilizes RCS polar matching to align multimodal radar data. Our hierarchical optimal transport-based feature aggregation method generates rotationally robust multi-scale descriptors. By employing FFT-similarity-based data mining and adaptive margin-based triplet loss, SHeRLoc enables FOV-aware metric learning. SHeRLoc achieves an order of magnitude improvement in heterogeneous radar place recognition, increasing recall@1 from below 0.1 to 0.9 on a public dataset and outperforming state of-the-art methods. Also applicable to LiDAR, SHeRLoc paves the way for cross-modal place recognition and heterogeneous sensor SLAM. The source code will be available upon acceptance.
comment: This work has been submitted to the IEEE for possible publication
☆ Robust Instant Policy: Leveraging Student's t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation IROS
Imitation learning (IL) aims to enable robots to perform tasks autonomously by observing a few human demonstrations. Recently, a variant of IL, called In-Context IL, utilized off-the-shelf large language models (LLMs) as instant policies that understand the context from a few given demonstrations to perform a new task, rather than explicitly updating network models with large-scale demonstrations. However, its reliability in the robotics domain is undermined by hallucination issues such as LLM-based instant policy, which occasionally generates poor trajectories that deviate from the given demonstrations. To alleviate this problem, we propose a new robust in-context imitation learning algorithm called the robust instant policy (RIP), which utilizes a Student's t-regression model to be robust against the hallucinated trajectories of instant policies to allow reliable trajectory generation. Specifically, RIP generates several candidate robot trajectories to complete a given task from an LLM and aggregates them using the Student's t-distribution, which is beneficial for ignoring outliers (i.e., hallucinations); thereby, a robust trajectory against hallucinations is generated. Our experiments, conducted in both simulated and real-world environments, show that RIP significantly outperforms state-of-the-art IL methods, with at least $26\%$ improvement in task success rates, particularly in low-data scenarios for everyday tasks. Video results available at https://sites.google.com/view/robustinstantpolicy.
comment: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025 accepted
☆ Human Locomotion Implicit Modeling Based Real-Time Gait Phase Estimation
Gait phase estimation based on inertial measurement unit (IMU) signals facilitates precise adaptation of exoskeletons to individual gait variations. However, challenges remain in achieving high accuracy and robustness, particularly during periods of terrain changes. To address this, we develop a gait phase estimation neural network based on implicit modeling of human locomotion, which combines temporal convolution for feature extraction with transformer layers for multi-channel information fusion. A channel-wise masked reconstruction pre-training strategy is proposed, which first treats gait phase state vectors and IMU signals as joint observations of human locomotion, thus enhancing model generalization. Experimental results demonstrate that the proposed method outperforms existing baseline approaches, achieving a gait phase RMSE of $2.729 \pm 1.071%$ and phase rate MAE of $0.037 \pm 0.016%$ under stable terrain conditions with a look-back window of 2 seconds, and a phase RMSE of $3.215 \pm 1.303%$ and rate MAE of $0.050 \pm 0.023%$ under terrain transitions. Hardware validation on a hip exoskeleton further confirms that the algorithm can reliably identify gait cycles and key events, adapting to various continuous motion scenarios. This research paves the way for more intelligent and adaptive exoskeleton systems, enabling safer and more efficient human-robot interaction across diverse real-world environments.
☆ Probabilistic Trajectory GOSPA: A Metric for Uncertainty-Aware Multi-Object Tracking Performance Evaluation
This paper presents a generalization of the trajectory general optimal sub-pattern assignment (GOSPA) metric for evaluating multi-object tracking algorithms that provide trajectory estimates with track-level uncertainties. This metric builds on the recently introduced probabilistic GOSPA metric to account for both the existence and state estimation uncertainties of individual object states. Similar to trajectory GOSPA (TGOSPA), it can be formulated as a multidimensional assignment problem, and its linear programming relaxation--also a valid metric--is computable in polynomial time. Additionally, this metric retains the interpretability of TGOSPA, and we show that its decomposition yields intuitive costs terms associated to expected localization error and existence probability mismatch error for properly detected objects, expected missed and false detection error, and track switch error. The effectiveness of the proposed metric is demonstrated through a simulation study.
comment: 7 pages, 4 figures
☆ TACT: Humanoid Whole-body Contact Manipulation through Deep Imitation Learning with Tactile Modality
Manipulation with whole-body contact by humanoid robots offers distinct advantages, including enhanced stability and reduced load. On the other hand, we need to address challenges such as the increased computational cost of motion generation and the difficulty of measuring broad-area contact. We therefore have developed a humanoid control system that allows a humanoid robot equipped with tactile sensors on its upper body to learn a policy for whole-body manipulation through imitation learning based on human teleoperation data. This policy, named tactile-modality extended ACT (TACT), has a feature to take multiple sensor modalities as input, including joint position, vision, and tactile measurements. Furthermore, by integrating this policy with retargeting and locomotion control based on a biped model, we demonstrate that the life-size humanoid robot RHP7 Kaleido is capable of achieving whole-body contact manipulation while maintaining balance and walking. Through detailed experimental verification, we show that inputting both vision and tactile modalities into the policy contributes to improving the robustness of manipulation involving broad and delicate contact.
☆ Booster Gym: An End-to-End Reinforcement Learning Framework for Humanoid Robot Locomotion
Recent advancements in reinforcement learning (RL) have led to significant progress in humanoid robot locomotion, simplifying the design and training of motion policies in simulation. However, the numerous implementation details make transferring these policies to real-world robots a challenging task. To address this, we have developed a comprehensive code framework that covers the entire process from training to deployment, incorporating common RL training methods, domain randomization, reward function design, and solutions for handling parallel structures. This library is made available as a community resource, with detailed descriptions of its design and experimental results. We validate the framework on the Booster T1 robot, demonstrating that the trained policies seamlessly transfer to the physical platform, enabling capabilities such as omnidirectional walking, disturbance resistance, and terrain adaptability. We hope this work provides a convenient tool for the robotics community, accelerating the development of humanoid robots. The code can be found in https://github.com/BoosterRobotics/booster_gym.
☆ VIMS: A Visual-Inertial-Magnetic-Sonar SLAM System in Underwater Environments IROS 2025
In this study, we present a novel simultaneous localization and mapping (SLAM) system, VIMS, designed for underwater navigation. Conventional visual-inertial state estimators encounter significant practical challenges in perceptually degraded underwater environments, particularly in scale estimation and loop closing. To address these issues, we first propose leveraging a low-cost single-beam sonar to improve scale estimation. Then, VIMS integrates a high-sampling-rate magnetometer for place recognition by utilizing magnetic signatures generated by an economical magnetic field coil. Building on this, a hierarchical scheme is developed for visual-magnetic place recognition, enabling robust loop closure. Furthermore, VIMS achieves a balance between local feature tracking and descriptor-based loop closing, avoiding additional computational burden on the front end. Experimental results highlight the efficacy of the proposed VIMS, demonstrating significant improvements in both the robustness and accuracy of state estimation within underwater environments.
comment: This work has been accepted for publication at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
☆ I Know You're Listening: Adaptive Voice for HRI IROS 23
While the use of social robots for language teaching has been explored, there remains limited work on a task-specific synthesized voices for language teaching robots. Given that language is a verbal task, this gap may have severe consequences for the effectiveness of robots for language teaching tasks. We address this lack of L2 teaching robot voices through three contributions: 1. We address the need for a lightweight and expressive robot voice. Using a fine-tuned version of Matcha-TTS, we use emoji prompting to create an expressive voice that shows a range of expressivity over time. The voice can run in real time with limited compute resources. Through case studies, we found this voice more expressive, socially appropriate, and suitable for long periods of expressive speech, such as storytelling. 2. We explore how to adapt a robot's voice to physical and social ambient environments to deploy our voices in various locations. We found that increasing pitch and pitch rate in noisy and high-energy environments makes the robot's voice appear more appropriate and makes it seem more aware of its current environment. 3. We create an English TTS system with improved clarity for L2 listeners using known linguistic properties of vowels that are difficult for these listeners. We used a data-driven, perception-based approach to understand how L2 speakers use duration cues to interpret challenging words with minimal tense (long) and lax (short) vowels in English. We found that the duration of vowels strongly influences the perception for L2 listeners and created an "L2 clarity mode" for Matcha-TTS that applies a lengthening to tense vowels while leaving lax vowels unchanged. Our clarity mode was found to be more respectful, intelligible, and encouraging than base Matcha-TTS while reducing transcription errors in these challenging tense/lax minimal pairs.
comment: PhD Thesis Simon Fraser University https://summit.sfu.ca/item/39353 Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts IROS 23 Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation INTERSPEECH 24 Emojivoice: Towards long-term controllable expressivity in robot speech RO-MAN 25
☆ DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory
We present DyNaVLM, an end-to-end vision-language navigation framework using Vision-Language Models (VLM). In contrast to prior methods constrained by fixed angular or distance intervals, our system empowers agents to freely select navigation targets via visual-language reasoning. At its core lies a self-refining graph memory that 1) stores object locations as executable topological relations, 2) enables cross-robot memory sharing through distributed graph updates, and 3) enhances VLM's decision-making via retrieval augmentation. Operating without task-specific training or fine-tuning, DyNaVLM demonstrates high performance on GOAT and ObjectNav benchmarks. Real-world tests further validate its robustness and generalization. The system's three innovations: dynamic action space formulation, collaborative graph memory, and training-free deployment, establish a new paradigm for scalable embodied robot, bridging the gap between discrete VLN tasks and continuous real-world navigation.
☆ 3D Vision-tactile Reconstruction from Infrared and Visible Images for Robotic Fine-grained Tactile Perception
To achieve human-like haptic perception in anthropomorphic grippers, the compliant sensing surfaces of vision tactile sensor (VTS) must evolve from conventional planar configurations to biomimetically curved topographies with continuous surface gradients. However, planar VTSs have challenges when extended to curved surfaces, including insufficient lighting of surfaces, blurring in reconstruction, and complex spatial boundary conditions for surface structures. With an end goal of constructing a human-like fingertip, our research (i) develops GelSplitter3D by expanding imaging channels with a prism and a near-infrared (NIR) camera, (ii) proposes a photometric stereo neural network with a CAD-based normal ground truth generation method to calibrate tactile geometry, and (iii) devises a normal integration method with boundary constraints of depth prior information to correcting the cumulative error of surface integrals. We demonstrate better tactile sensing performance, a 40$\%$ improvement in normal estimation accuracy, and the benefits of sensor shapes in grasping and manipulation tasks.
☆ EmojiVoice: Towards long-term controllable expressivity in robot speech
Humans vary their expressivity when speaking for extended periods to maintain engagement with their listener. Although social robots tend to be deployed with ``expressive'' joyful voices, they lack this long-term variation found in human speech. Foundation model text-to-speech systems are beginning to mimic the expressivity in human speech, but they are difficult to deploy offline on robots. We present EmojiVoice, a free, customizable text-to-speech (TTS) toolkit that allows social roboticists to build temporally variable, expressive speech on social robots. We introduce emoji-prompting to allow fine-grained control of expressivity on a phase level and use the lightweight Matcha-TTS backbone to generate speech in real-time. We explore three case studies: (1) a scripted conversation with a robot assistant, (2) a storytelling robot, and (3) an autonomous speech-to-speech interactive agent. We found that using varied emoji prompting improved the perception and expressivity of speech over a long period in a storytelling task, but expressive voice was not preferred in the assistant use case.
comment: Accepted to RO-MAN 2025, Demo at HRI 2025 : https://dl.acm.org/doi/10.5555/3721488.3721774
☆ HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models
Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts. Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.
☆ Assigning Multi-Robot Tasks to Multitasking Robots
One simplifying assumption in existing and well-performing task allocation methods is that the robots are single-tasking: each robot operates on a single task at any given time. While this assumption is harmless to make in some situations, it can be inefficient or even infeasible in others. In this paper, we consider assigning multi-robot tasks to multitasking robots. The key contribution is a novel task allocation framework that incorporates the consideration of physical constraints introduced by multitasking. This is in contrast to the existing work where such constraints are largely ignored. After formulating the problem, we propose a compilation to weighted MAX-SAT, which allows us to leverage existing solvers for a solution. A more efficient greedy heuristic is then introduced. For evaluation, we first compare our methods with a modern baseline that is efficient for single-tasking robots to validate the benefits of multitasking in synthetic domains. Then, using a site-clearing scenario in simulation, we further illustrate the complex task interaction considered by the multitasking robots in our approach to demonstrate its performance. Finally, we demonstrate a physical experiment to show how multitasking enabled by our approach can benefit task efficiency in a realistic setting.
☆ Learning from Planned Data to Improve Robotic Pick-and-Place Planning Efficiency
This work proposes a learning method to accelerate robotic pick-and-place planning by predicting shared grasps. Shared grasps are defined as grasp poses feasible to both the initial and goal object configurations in a pick-and-place task. Traditional analytical methods for solving shared grasps evaluate grasp candidates separately, leading to substantial computational overhead as the candidate set grows. To overcome the limitation, we introduce an Energy-Based Model (EBM) that predicts shared grasps by combining the energies of feasible grasps at both object poses. This formulation enables early identification of promising candidates and significantly reduces the search space. Experiments show that our method improves grasp selection performance, offers higher data efficiency, and generalizes well to unseen grasps and similarly shaped objects.
☆ Optimal Navigation in Microfluidics via the Optimization of a Discrete Loss
Optimal path planning and control of microscopic devices navigating in fluid environments is essential for applications ranging from targeted drug delivery to environmental monitoring. These tasks are challenging due to the complexity of microdevice-flow interactions. We introduce a closed-loop control method that optimizes a discrete loss (ODIL) in terms of dynamics and path objectives. In comparison with reinforcement learning, ODIL is more robust, up to three orders faster, and excels in high-dimensional action/state spaces, making it a powerful tool for navigating complex flow environments.
comment: 21 pages, 13 figures
☆ Advancing Autonomous Racing: A Comprehensive Survey of the RoboRacer (F1TENTH) Platform
The RoboRacer (F1TENTH) platform has emerged as a leading testbed for advancing autonomous driving research, offering a scalable, cost-effective, and community-driven environment for experimentation. This paper presents a comprehensive survey of the platform, analyzing its modular hardware and software architecture, diverse research applications, and role in autonomous systems education. We examine critical aspects such as bridging the simulation-to-reality (Sim2Real) gap, integration with simulation environments, and the availability of standardized datasets and benchmarks. Furthermore, the survey highlights advancements in perception, planning, and control algorithms, as well as insights from global competitions and collaborative research efforts. By consolidating these contributions, this study positions RoboRacer as a versatile framework for accelerating innovation and bridging the gap between theoretical research and real-world deployment. The findings underscore the platform's significance in driving forward developments in autonomous racing and robotics.
☆ Challenges and Research Directions from the Operational Use of a Machine Learning Damage Assessment System via Small Uncrewed Aerial Systems at Hurricanes Debby and Helene
This paper details four principal challenges encountered with machine learning (ML) damage assessment using small uncrewed aerial systems (sUAS) at Hurricanes Debby and Helene that prevented, degraded, or delayed the delivery of data products during operations and suggests three research directions for future real-world deployments. The presence of these challenges is not surprising given that a review of the literature considering both datasets and proposed ML models suggests this is the first sUAS-based ML system for disaster damage assessment actually deployed as a part of real-world operations. The sUAS-based ML system was applied by the State of Florida to Hurricanes Helene (2 orthomosaics, 3.0 gigapixels collected over 2 sorties by a Wintra WingtraOne sUAS) and Debby (1 orthomosaic, 0.59 gigapixels collected via 1 sortie by a Wintra WingtraOne sUAS) in Florida. The same model was applied to crewed aerial imagery of inland flood damage resulting from post-tropical remnants of Hurricane Debby in Pennsylvania (436 orthophotos, 136.5 gigapixels), providing further insights into the advantages and limitations of sUAS for disaster response. The four challenges (variationin spatial resolution of input imagery, spatial misalignment between imagery and geospatial data, wireless connectivity, and data product format) lead to three recommendations that specify research needed to improve ML model capabilities to accommodate the wide variation of potential spatial resolutions used in practice, handle spatial misalignment, and minimize the dependency on wireless connectivity. These recommendations are expected to improve the effective operational use of sUAS and sUAS-based ML damage assessment systems for disaster response.
comment: 6 pages, 5 Figures, 1 Table
☆ A Small-Scale Robot for Autonomous Driving: Design, Challenges, and Best Practices
Small-scale autonomous vehicle platforms provide a cost-effective environment for developing and testing advanced driving systems. However, specific configurations within this scale are underrepresented, limiting full awareness of their potential. This paper focuses on a one-sixth-scale setup, offering a high-level overview of its design, hardware and software integration, and typical challenges encountered during development. We discuss methods for addressing mechanical and electronic issues common to this scale and propose guidelines for improving reliability and performance. By sharing these insights, we aim to expand the utility of small-scale vehicles for testing autonomous driving algorithms and to encourage further research in this domain.
☆ CooperRisk: A Driving Risk Quantification Pipeline with Multi-Agent Cooperative Perception and Prediction IROS2025
Risk quantification is a critical component of safe autonomous driving, however, constrained by the limited perception range and occlusion of single-vehicle systems in complex and dense scenarios. Vehicle-to-everything (V2X) paradigm has been a promising solution to sharing complementary perception information, nevertheless, how to ensure the risk interpretability while understanding multi-agent interaction with V2X remains an open question. In this paper, we introduce the first V2X-enabled risk quantification pipeline, CooperRisk, to fuse perception information from multiple agents and quantify the scenario driving risk in future multiple timestamps. The risk is represented as a scenario risk map to ensure interpretability based on risk severity and exposure, and the multi-agent interaction is captured by the learning-based cooperative prediction model. We carefully design a risk-oriented transformer-based prediction model with multi-modality and multi-agent considerations. It aims to ensure scene-consistent future behaviors of multiple agents and avoid conflicting predictions that could lead to overly conservative risk quantification and cause the ego vehicle to become overly hesitant to drive. Then, the temporal risk maps could serve to guide a model predictive control planner. We evaluate the CooperRisk pipeline in a real-world V2X dataset V2XPnP, and the experiments demonstrate its superior performance in risk quantification, showing a 44.35% decrease in conflict rate between the ego vehicle and background traffic participants.
comment: IROS2025
☆ Improving Robotic Manipulation: Techniques for Object Pose Estimation, Accommodating Positional Uncertainty, and Disassembly Tasks from Examples
To use robots in more unstructured environments, we have to accommodate for more complexities. Robotic systems need more awareness of the environment to adapt to uncertainty and variability. Although cameras have been predominantly used in robotic tasks, the limitations that come with them, such as occlusion, visibility and breadth of information, have diverted some focus to tactile sensing. In this thesis, we explore the use of tactile sensing to determine the pose of the object using the temporal features. We then use reinforcement learning with tactile collisions to reduce the number of attempts required to grasp an object resulting from positional uncertainty from camera estimates. Finally, we use information provided by these tactile sensors to a reinforcement learning agent to determine the trajectory to take to remove an object from a restricted passage while reducing training time by pertaining from human examples.
comment: Thesis
☆ Semantic and Feature Guided Uncertainty Quantification of Visual Localization for Autonomous Vehicles ICRA 2025
The uncertainty quantification of sensor measurements coupled with deep learning networks is crucial for many robotics systems, especially for safety-critical applications such as self-driving cars. This paper develops an uncertainty quantification approach in the context of visual localization for autonomous driving, where locations are selected based on images. Key to our approach is to learn the measurement uncertainty using light-weight sensor error model, which maps both image feature and semantic information to 2-dimensional error distribution. Our approach enables uncertainty estimation conditioned on the specific context of the matched image pair, implicitly capturing other critical, unannotated factors (e.g., city vs highway, dynamic vs static scenes, winter vs summer) in a latent manner. We demonstrate the accuracy of our uncertainty prediction framework using the Ithaca365 dataset, which includes variations in lighting and weather (sunny, night, snowy). Both the uncertainty quantification of the sensor+network is evaluated, along with Bayesian localization filters using unique sensor gating method. Results show that the measurement error does not follow a Gaussian distribution with poor weather and lighting conditions, and is better predicted by our Gaussian Mixture model.
comment: Accepted by ICRA 2025
☆ PRISM-Loc: a Lightweight Long-range LiDAR Localization in Urban Environments with Topological Maps IROS 2025
Localization in the environment is one of the crucial tasks of navigation of a mobile robot or a self-driving vehicle. For long-range routes, performing localization within a dense global lidar map in real time may be difficult, and the creation of such a map may require much memory. To this end, leveraging topological maps may be useful. In this work, we propose PRISM-Loc -- a topological map-based approach for localization in large environments. The proposed approach leverages a twofold localization pipeline, which consists of global place recognition and estimation of the local pose inside the found location. For local pose estimation, we introduce an original lidar scan matching algorithm, which is based on 2D features and point-based optimization. We evaluate the proposed method on the ITLP-Campus dataset on a 3 km route, and compare it against the state-of-the-art metric map-based and place recognition-based competitors. The results of the experiments show that the proposed method outperforms its competitors both quality-wise and computationally-wise.
comment: This version was submitted and rejected from IROS 2025 conference
☆ SafeMimic: Towards Safe and Autonomous Human-to-Robot Imitation for Mobile Manipulation
For robots to become efficient helpers in the home, they must learn to perform new mobile manipulation tasks simply by watching humans perform them. Learning from a single video demonstration from a human is challenging as the robot needs to first extract from the demo what needs to be done and how, translate the strategy from a third to a first-person perspective, and then adapt it to be successful with its own morphology. Furthermore, to mitigate the dependency on costly human monitoring, this learning process should be performed in a safe and autonomous manner. We present SafeMimic, a framework to learn new mobile manipulation skills safely and autonomously from a single third-person human video. Given an initial human video demonstration of a multi-step mobile manipulation task, SafeMimic first parses the video into segments, inferring both the semantic changes caused and the motions the human executed to achieve them and translating them to an egocentric reference. Then, it adapts the behavior to the robot's own morphology by sampling candidate actions around the human ones, and verifying them for safety before execution in a receding horizon fashion using an ensemble of safety Q-functions trained in simulation. When safe forward progression is not possible, SafeMimic backtracks to previous states and attempts a different sequence of actions, adapting both the trajectory and the grasping modes when required for its morphology. As a result, SafeMimic yields a strategy that succeeds in the demonstrated behavior and learns task-specific actions that reduce exploration in future attempts. Our experiments show that our method allows robots to safely and efficiently learn multi-step mobile manipulation behaviors from a single human demonstration, from different users, and in different environments, with improvements over state-of-the-art baselines across seven tasks
☆ Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning
Classical planning in AI and Robotics addresses complex tasks by shifting from imperative to declarative approaches (e.g., PDDL). However, these methods often fail in real scenarios due to limited robot perception and the need to ground perceptions to planning predicates. This often results in heavily hard-coded behaviors that struggle to adapt, even with scenarios where goals can be achieved through relaxed planning. Meanwhile, Large Language Models (LLMs) lead to planning systems that leverage commonsense reasoning but often at the cost of generating unfeasible and/or unsafe plans. To address these limitations, we present an approach integrating classical planning with LLMs, leveraging their ability to extract commonsense knowledge and ground actions. We propose a hierarchical formulation that enables robots to make unfeasible tasks tractable by defining functionally equivalent goals through gradual relaxation. This mechanism supports partial achievement of the intended objective, suited to the agent's specific context. Our method demonstrates its ability to adapt and execute tasks effectively within environments modeled using 3D Scene Graphs through comprehensive qualitative and quantitative evaluations. We also show how this method succeeds in complex scenarios where other benchmark methods are more likely to fail. Code, dataset, and additional material are released to the community.
☆ Steering Your Diffusion Policy with Latent Space Reinforcement Learning
Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.
☆ Robust control for multi-legged elongate robots in noisy environments
Modern two and four legged robots exhibit impressive mobility on complex terrain, largely attributed to advancement in learning algorithms. However, these systems often rely on high-bandwidth sensing and onboard computation to perceive/respond to terrain uncertainties. Further, current locomotion strategies typically require extensive robot-specific training, limiting their generalizability across platforms. Building on our prior research connecting robot-environment interaction and communication theory, we develop a new paradigm to construct robust and simply controlled multi-legged elongate robots (MERs) capable of operating effectively in cluttered, unstructured environments. In this framework, each leg-ground contact is thought of as a basic active contact (bac), akin to bits in signal transmission. Reliable locomotion can be achieved in open-loop on "noisy" landscapes via sufficient redundancy in bacs. In such situations, robustness is achieved through passive mechanical responses. We term such processes as those displaying mechanical intelligence (MI) and analogize these processes to forward error correction (FEC) in signal transmission. To augment MI, we develop feedback control schemes, which we refer to as computational intelligence (CI) and such processes analogize automatic repeat request (ARQ) in signal transmission. Integration of these analogies between locomotion and communication theory allow analysis, design, and prediction of embodied intelligence control schemes (integrating MI and CI) in MERs, showing effective and reliable performance (approximately half body lengths per cycle) on complex landscapes with terrain "noise" over twice the robot's height. Our work provides a foundation for systematic development of MER control, paving the way for terrain-agnostic, agile, and resilient robotic systems capable of operating in extreme environments.
♻ ☆ An Advanced Framework for Ultra-Realistic Simulation and Digital Twinning for Autonomous Vehicles
Simulation is a fundamental tool in developing autonomous vehicles, enabling rigorous testing without the logistical and safety challenges associated with real-world trials. As autonomous vehicle technologies evolve and public safety demands increase, advanced, realistic simulation frameworks are critical. Current testing paradigms employ a mix of general-purpose and specialized simulators, such as CARLA and IVRESS, to achieve high-fidelity results. However, these tools often struggle with compatibility due to differing platform, hardware, and software requirements, severely hampering their combined effectiveness. This paper introduces BlueICE, an advanced framework for ultra-realistic simulation and digital twinning, to address these challenges. BlueICE's innovative architecture allows for the decoupling of computing platforms, hardware, and software dependencies while offering researchers customizable testing environments to meet diverse fidelity needs. Key features include containerization to ensure compatibility across different systems, a unified communication bridge for seamless integration of various simulation tools, and synchronized orchestration of input and output across simulators. This framework facilitates the development of sophisticated digital twins for autonomous vehicle testing and sets a new standard in simulation accuracy and flexibility. The paper further explores the application of BlueICE in two distinct case studies: the ICAT indoor testbed and the STAR campus outdoor testbed at the University of Delaware. These case studies demonstrate BlueICE's capability to create sophisticated digital twins for autonomous vehicle testing and underline its potential as a standardized testbed for future autonomous driving technologies.
comment: 6 Pages. 5 Figures, 1 Table
♻ ☆ Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer via Skill Library and Tactile Representation
Developing general robotic systems capable of manipulating in unstructured environments is a significant challenge, particularly as the tasks involved are typically long-horizon and rich-contact, requiring efficient skill transfer across different task scenarios. To address these challenges, we propose knowledge graph-based skill library construction method. This method hierarchically organizes manipulation knowledge using "task graph" and "scene graph" to represent task-specific and scene-specific information, respectively. Additionally, we introduce "state graph" to facilitate the interaction between high-level task planning and low-level scene information. Building upon this foundation, we further propose a novel hierarchical skill transfer framework based on the skill library and tactile representation, which integrates high-level reasoning for skill transfer and low-level precision for execution. At the task level, we utilize large language models (LLMs) and combine contextual learning with a four-stage chain-of-thought prompting paradigm to achieve subtask sequence transfer. At the motion level, we develop an adaptive trajectory transfer method based on the skill library and the heuristic path planning algorithm. At the physical level, we propose an adaptive contour extraction and posture perception method based on tactile representation. This method dynamically acquires high-precision contour and posture information from visual-tactile images, adjusting parameters such as contact position and posture to ensure the effectiveness of transferred skills in new environments. Experiments demonstrate the skill transfer and adaptability capabilities of the proposed methods across different task scenarios. Project website: https://github.com/MingchaoQi/skill_transfer
♻ ☆ PP-Tac: Paper Picking Using Tactile Feedback in Dexterous Robotic Hands
Robots are increasingly envisioned as human companions, assisting with everyday tasks that often involve manipulating deformable objects. Although recent advances in robotic hardware and embodied AI have expanded their capabilities, current systems still struggle with handling thin, flat, and deformable objects such as paper and fabric. This limitation arises from the lack of suitable perception techniques for robust state estimation under diverse object appearances, as well as the absence of planning techniques for generating appropriate grasp motions. To bridge these gaps, this paper introduces PP-Tac, a robotic system for picking up paper-like objects. PP-Tac features a multi-fingered robotic hand with high-resolution omnidirectional tactile sensors \sensorname. This hardware configuration enables real-time slip detection and online frictional force control that mitigates such slips. Furthermore, grasp motion generation is achieved through a trajectory synthesis pipeline, which first constructs a dataset of finger's pinching motions. Based on this dataset, a diffusion-based policy is trained to control the hand-arm robotic system. Experiments demonstrate that PP-Tac can effectively grasp paper-like objects of varying material, thickness, and stiffness, achieving an overall success rate of 87.5\%. To our knowledge, this work is the first attempt to grasp paper-like deformable objects using a tactile dexterous hand. Our project webpage can be found at: https://peilin-666.github.io/projects/PP-Tac/
comment: accepted by Robotics: Science and Systems(RSS) 2025 url: https://peilin-666.github.io/projects/PP-Tac/
♻ ☆ Learning the Geometric Mechanics of Robot Motion Using Gaussian Mixtures
Data-driven models of robot motion constructed using principles from Geometric Mechanics have been shown to produce useful predictions of robot motion for a variety of robots. For robots with a useful number of DoF, these geometric mechanics models can only be constructed in the neighborhood of a gait. Here we show how Gaussian Mixture Models (GMM) can be used as a form of manifold learning that learns the structure of the Geometric Mechanics "motility map" and demonstrate: [i] a sizable improvement in prediction quality when compared to the previously published methods; [ii] a method that can be applied to any motion dataset and not only periodic gait data; [iii] a way to pre-process the data-set to facilitate extrapolation in places where the motility map is known to be linear. Our results can be applied anywhere a data-driven geometric motion model might be useful.
comment: 16 pages, 10 figures
♻ ☆ An Actionable Hierarchical Scene Representation Enhancing Autonomous Inspection Missions in Unknown Environments IROS 2025
In this article, we present the Layered Semantic Graphs (LSG), a novel actionable hierarchical scene graph, fully integrated with a multi-modal mission planner, the FLIE: A First-Look based Inspection and Exploration planner. The novelty of this work stems from aiming to address the task of maintaining an intuitive and multi-resolution scene representation, while simultaneously offering a tractable foundation for planning and scene understanding during an ongoing inspection mission of apriori unknown targets-of-interest in an unknown environment. The proposed LSG scheme is composed of locally nested hierarchical graphs, at multiple layers of abstraction, with the abstract concepts grounded on the functionality of the integrated FLIE planner. Furthermore, LSG encapsulates real-time semantic segmentation models that offer extraction and localization of desired semantic elements within the hierarchical representation. This extends the capability of the inspection planner, which can then leverage LSG to make an informed decision to inspect a particular semantic of interest. We also emphasize the hierarchical and semantic path-planning capabilities of LSG, which could extend inspection missions by improving situational awareness for human operators in an unknown environment. The validity of the proposed scheme is proven through extensive evaluations of the proposed architecture in simulations, as well as experimental field deployments on a Boston Dynamics Spot quadruped robot in urban outdoor environment settings.
comment: Accepted to IROS 2025
♻ ☆ LLM-as-BT-Planner: Leveraging LLMs for Behavior Tree Generation in Robot Task Planning ICRA 2025
Robotic assembly tasks remain an open challenge due to their long horizon nature and complex part relations. Behavior trees (BTs) are increasingly used in robot task planning for their modularity and flexibility, but creating them manually can be effort-intensive. Large language models (LLMs) have recently been applied to robotic task planning for generating action sequences, yet their ability to generate BTs has not been fully investigated. To this end, we propose LLM-as-BT-Planner, a novel framework that leverages LLMs for BT generation in robotic assembly task planning. Four in-context learning methods are introduced to utilize the natural language processing and inference capabilities of LLMs for producing task plans in BT format, reducing manual effort while ensuring robustness and comprehensibility. Additionally, we evaluate the performance of fine-tuned smaller LLMs on the same tasks. Experiments in both simulated and real-world settings demonstrate that our framework enhances LLMs' ability to generate BTs, improving success rate through in-context learning and supervised fine-tuning.
comment: 7 pages. presented in ICRA 2025
♻ ☆ A compact neuromorphic system for ultra-energy-efficient, on-device robot localization
Neuromorphic computing offers a transformative pathway to overcome the computational and energy challenges faced in deploying robotic localization and navigation systems at the edge. Visual place recognition, a critical component for navigation, is often hampered by the high resource demands of conventional systems, making them unsuitable for small-scale robotic platforms which still require accurate long-endurance localization. Although neuromorphic approaches offer potential for greater efficiency, real-time edge deployment remains constrained by the complexity of bio-realistic networks. In order to overcome this challenge, fusion of hardware and algorithms is critical to employ this specialized computing paradigm. Here, we demonstrate a neuromorphic localization system that performs competitive place recognition in up to 8 kilometers of traversal using models as small as 180 kilobytes with 44,000 parameters, while consuming less than 8% of the energy required by conventional methods. Our Locational Encoding with Neuromorphic Systems (LENS) integrates spiking neural networks, an event-based dynamic vision sensor, and a neuromorphic processor within a single SynSense Speck chip, enabling real-time, energy-efficient localization on a hexapod robot. When compared to a benchmark place recognition method, Sum-of-Absolute-Differences (SAD), LENS performs comparably in overall precision. LENS represents an accurate fully neuromorphic localization system capable of large-scale, on-device deployment for energy efficient robotic place recognition. Neuromorphic computing enables resource-constrained robots to perform energy efficient, accurate localization.
comment: 42 pages, 5 main figures, 8 supplementary figures, 2 supplementary tables, and 1 movie
♻ ☆ Map Space Belief Prediction for Manipulation-Enhanced Mapping
Searching for objects in cluttered environments requires selecting efficient viewpoints and manipulation actions to remove occlusions and reduce uncertainty in object locations, shapes, and categories. In this work, we address the problem of manipulation-enhanced semantic mapping, where a robot has to efficiently identify all objects in a cluttered shelf. Although Partially Observable Markov Decision Processes~(POMDPs) are standard for decision-making under uncertainty, representing unstructured interactive worlds remains challenging in this formalism. To tackle this, we define a POMDP whose belief is summarized by a metric-semantic grid map and propose a novel framework that uses neural networks to perform map-space belief updates to reason efficiently and simultaneously about object geometries, locations, categories, occlusions, and manipulation physics. Further, to enable accurate information gain analysis, the learned belief updates should maintain calibrated estimates of uncertainty. Therefore, we propose Calibrated Neural-Accelerated Belief Updates (CNABUs) to learn a belief propagation model that generalizes to novel scenarios and provides confidence-calibrated predictions for unknown areas. Our experiments show that our novel POMDP planner improves map completeness and accuracy over existing methods in challenging simulations and successfully transfers to real-world cluttered shelves in zero-shot fashion.
comment: 14 pages, 10 figures; Published at RSS 2025 - this version contains a small fix to figure 6 which was missing a plot in the original submission
♻ ☆ Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling ICML 2025
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions to long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with improved scaling w.r.t. inference-time computation. Code is available at https://github.com/Singularity0104/equilibrium-planner.
comment: ICML 2025
♻ ☆ SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation
Surgical video generation can enhance medical education and research, but existing methods lack fine-grained motion control and realism. We introduce SurgSora, a framework that generates high-fidelity, motion-controllable surgical videos from a single input frame and user-specified motion cues. Unlike prior approaches that treat objects indiscriminately or rely on ground-truth segmentation masks, SurgSora leverages self-predicted object features and depth information to refine RGB appearance and optical flow for precise video synthesis. It consists of three key modules: (1) the Dual Semantic Injector, which extracts object-specific RGB-D features and segmentation cues to enhance spatial representations; (2) the Decoupled Flow Mapper, which fuses multi-scale optical flow with semantic features for realistic motion dynamics; and (3) the Trajectory Controller, which estimates sparse optical flow and enables user-guided object movement. By conditioning these enriched features within the Stable Video Diffusion, SurgSora achieves state-of-the-art visual authenticity and controllability in advancing surgical video synthesis, as demonstrated by extensive quantitative and qualitative comparisons. Our human evaluation in collaboration with expert surgeons further demonstrates the high realism of SurgSora-generated videos, highlighting the potential of our method for surgical training and education. Our project is available at https://surgsora.github.io/surgsora.github.io.
♻ ☆ Tailless Flapping-Wing Robot With Bio-Inspired Elastic Passive Legs for Multi-Modal Locomotion
Flapping-wing robots offer significant versatility; however, achieving efficient multi-modal locomotion remains challenging. This paper presents the design, modeling, and experimentation of a novel tailless flapping-wing robot with three independently actuated pairs of wings. Inspired by the leg morphology of juvenile water striders, the robot incorporates bio-inspired elastic passive legs that convert flapping-induced vibrations into directional ground movement, enabling locomotion without additional actuators. This vibration-driven mechanism facilitates lightweight, mechanically simplified multi-modal mobility. An SE(3)-based controller coordinates flight and mode transitions with minimal actuation. To validate the robot's feasibility, a functional prototype was developed, and experiments were conducted to evaluate its flight, ground locomotion, and mode-switching capabilities. Results show satisfactory performance under constrained actuation, highlighting the potential of multi-modal flapping-wing designs for future aerial-ground robotic applications. These findings provide a foundation for future studies on frequency-based terrestrial control and passive yaw stabilization in hybrid locomotion systems.
comment: 8 pages, 11 figures, accepted by IEEE Robotics and Automation Letters (RAL)
♻ ☆ Human-Robot Co-Transportation using Disturbance-Aware MPC with Pose Optimization
This paper proposes a new control algorithm for human-robot co-transportation using a robot manipulator equipped with a mobile base and a robotic arm. We integrate the regular Model Predictive Control (MPC) with a novel pose optimization mechanism to more efficiently mitigate disturbances (such as human behavioral uncertainties or robot actuation noise) during the task. The core of our methodology involves a two-step iterative design: At each planning horizon, we determine the optimal pose of the robotic arm (joint angle configuration) from a candidate set, aiming to achieve the lowest estimated control cost. This selection is based on solving a disturbance-aware Discrete Algebraic Ricatti Equation (DARE), which also determines the optimal inputs for the robot's whole body control (including both the mobile base and the robotic arm). To validate the effectiveness of the proposed approach, we provide theoretical derivation for the disturbance-aware DARE and perform simulated experiments and hardware demos using a Fetch robot under varying conditions, including different trajectories and different levels of disturbances. The results reveal that our proposed approach outperforms baseline algorithms.
comment: 8 pages, 6 figures
♻ ☆ Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets
Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries - images with textural multiple-choice questions - based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.
♻ ☆ RefAV: Towards Planning-Centric Scenario Mining
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recent CVPR 2025 competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
comment: Project Page: https://cainand.github.io/RefAV/
Robotics 67
☆ GMT: General Motion Tracking for Humanoid Whole-Body Control
The ability to track general whole-body motions in the real world is a useful way to build general-purpose humanoid robots. However, achieving this can be challenging due to the temporal and kinematic diversity of the motions, the policy's capability, and the difficulty of coordination of the upper and lower bodies. To address these issues, we propose GMT, a general and scalable motion-tracking framework that trains a single unified policy to enable humanoid robots to track diverse motions in the real world. GMT is built upon two core components: an Adaptive Sampling strategy and a Motion Mixture-of-Experts (MoE) architecture. The Adaptive Sampling automatically balances easy and difficult motions during training. The MoE ensures better specialization of different regions of the motion manifold. We show through extensive experiments in both simulation and the real world the effectiveness of GMT, achieving state-of-the-art performance across a broad spectrum of motions using a unified general policy. Videos and additional information can be found at https://gmt-humanoid.github.io.
☆ CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion
Diffusion Policy (DP) enables robots to learn complex behaviors by imitating expert demonstrations through action diffusion. However, in practical applications, hardware limitations often degrade data quality, while real-time constraints restrict model inference to instantaneous state and scene observations. These limitations seriously reduce the efficacy of learning from expert demonstrations, resulting in failures in object localization, grasp planning, and long-horizon task execution. To address these challenges, we propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion model that enhances action prediction by conditioning on historical action sequences, thereby enabling more coherent and context-aware visuomotor policy learning. To further mitigate the computational cost associated with autoregressive inference, a caching mechanism is also introduced to store attention key-value pairs from previous timesteps, substantially reducing redundant computations during execution. Extensive experiments in both simulated and real-world environments, spanning diverse 2D and 3D manipulation tasks, demonstrate that CDP uniquely leverages historical action sequences to achieve significantly higher accuracy than existing methods. Moreover, even when faced with degraded input observation quality, CDP maintains remarkable precision by reasoning through temporal continuity, which highlights its practical robustness for robotic control under realistic, imperfect conditions.
☆ RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills
Endowing robots with tool design abilities is critical for enabling them to solve complex manipulation tasks that would otherwise be intractable. While recent generative frameworks can automatically synthesize task settings, such as 3D scenes and reward functions, they have not yet addressed the challenge of tool-use scenarios. Simply retrieving human-designed tools might not be ideal since many tools (e.g., a rolling pin) are difficult for robotic manipulators to handle. Furthermore, existing tool design approaches either rely on predefined templates with limited parameter tuning or apply generic 3D generation methods that are not optimized for tool creation. To address these limitations, we propose RobotSmith, an automated pipeline that leverages the implicit physical knowledge embedded in vision-language models (VLMs) alongside the more accurate physics provided by physics simulations to design and use tools for robotic manipulation. Our system (1) iteratively proposes tool designs using collaborative VLM agents, (2) generates low-level robot trajectories for tool use, and (3) jointly optimizes tool geometry and usage for task performance. We evaluate our approach across a wide range of manipulation tasks involving rigid, deformable, and fluid objects. Experiments show that our method consistently outperforms strong baselines in terms of both task success rate and overall performance. Notably, our approach achieves a 50.0\% average success rate, significantly surpassing other baselines such as 3D generation (21.4%) and tool retrieval (11.1%). Finally, we deploy our system in real-world settings, demonstrating that the generated tools and their usage plans transfer effectively to physical execution, validating the practicality and generalization capabilities of our approach.
☆ Markov Regime-Switching Intelligent Driver Model for Interpretable Car-Following Behavior
Accurate and interpretable car-following models are essential for traffic simulation and autonomous vehicle development. However, classical models like the Intelligent Driver Model (IDM) are fundamentally limited by their parsimonious and single-regime structure. They fail to capture the multi-modal nature of human driving, where a single driving state (e.g., speed, relative speed, and gap) can elicit many different driver actions. This forces the model to average across distinct behaviors, reducing its fidelity and making its parameters difficult to interpret. To overcome this, we introduce a regime-switching framework that allows driving behavior to be governed by different IDM parameter sets, each corresponding to an interpretable behavioral mode. This design enables the model to dynamically switch between interpretable behavioral modes, rather than averaging across diverse driving contexts. We instantiate the framework using a Factorial Hidden Markov Model with IDM dynamics (FHMM-IDM), which explicitly separates intrinsic driving regimes (e.g., aggressive acceleration, steady-state following) from external traffic scenarios (e.g., free-flow, congestion, stop-and-go) through two independent latent Markov processes. Bayesian inference via Markov chain Monte Carlo (MCMC) is used to jointly estimate the regime-specific parameters, transition dynamics, and latent state trajectories. Experiments on the HighD dataset demonstrate that FHMM-IDM uncovers interpretable structure in human driving, effectively disentangling internal driver actions from contextual traffic conditions and revealing dynamic regime-switching patterns. This framework provides a tractable and principled solution to modeling context-dependent driving behavior under uncertainty, offering improvements in the fidelity of traffic simulations, the efficacy of safety analyses, and the development of more human-centric ADAS.
☆ Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation
We present Sparsh-X, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. Trained on ~1M contact-rich interactions collected with the Digit 360 sensor, Sparsh-X captures complementary touch signals at diverse temporal and spatial scales. By leveraging self-supervised learning, Sparsh-X fuses these modalities into a unified representation that captures physical properties useful for robot manipulation tasks. We study how to effectively integrate real-world touch representations for both imitation learning and tactile adaptation of sim-trained policies, showing that Sparsh-X boosts policy success rates by 63% over an end-to-end model using tactile images and improves robustness by 90% in recovering object states from touch. Finally, we benchmark Sparsh-X ability to make inferences about physical properties, such as object-action identification, material-quantity estimation, and force estimation. Sparsh-X improves accuracy in characterizing physical properties by 48% compared to end-to-end approaches, demonstrating the advantages of multisensory pretraining for capturing features essential for dexterous manipulation.
☆ Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models
Assistive teleoperation, where control is shared between a human and a robot, enables efficient and intuitive human-robot collaboration in diverse and unstructured environments. A central challenge in real-world assistive teleoperation is for the robot to infer a wide range of human intentions from user control inputs and to assist users with correct actions. Existing methods are either confined to simple, predefined scenarios or restricted to task-specific data distributions at training, limiting their support for real-world assistance. We introduce Casper, an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models (VLMs) for real-time intent inference and flexible skill execution. Casper incorporates an open-world perception module for a generalized understanding of novel objects and scenes, a VLM-powered intent inference mechanism that leverages commonsense reasoning to interpret snippets of teleoperated user input, and a skill library that expands the scope of prior assistive teleoperation systems to support diverse, long-horizon mobile manipulation tasks. Extensive empirical evaluation, including human studies and system ablations, demonstrates that Casper improves task performance, reduces human cognitive load, and achieves higher user satisfaction than direct teleoperation and assistive teleoperation baselines.
☆ DiFuse-Net: RGB and Dual-Pixel Depth Estimation using Window Bi-directional Parallax Attention and Cross-modal Transfer Learning IROS 2025
Depth estimation is crucial for intelligent systems, enabling applications from autonomous navigation to augmented reality. While traditional stereo and active depth sensors have limitations in cost, power, and robustness, dual-pixel (DP) technology, ubiquitous in modern cameras, offers a compelling alternative. This paper introduces DiFuse-Net, a novel modality decoupled network design for disentangled RGB and DP based depth estimation. DiFuse-Net features a window bi-directional parallax attention mechanism (WBiPAM) specifically designed to capture the subtle DP disparity cues unique to smartphone cameras with small aperture. A separate encoder extracts contextual information from the RGB image, and these features are fused to enhance depth prediction. We also propose a Cross-modal Transfer Learning (CmTL) mechanism to utilize large-scale RGB-D datasets in the literature to cope with the limitations of obtaining large-scale RGB-DP-D dataset. Our evaluation and comparison of the proposed method demonstrates its superiority over the DP and stereo-based baseline methods. Additionally, we contribute a new, high-quality, real-world RGB-DP-D training dataset, named Dual-Camera Dual-Pixel (DCDP) dataset, created using our novel symmetric stereo camera hardware setup, stereo calibration and rectification protocol, and AI stereo disparity estimation method.
comment: Accepted in IROS 2025
☆ AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions
The rapid advancement of vision-language models (VLMs) and their integration into embodied agents have unlocked powerful capabilities for decision-making. However, as these systems are increasingly deployed in real-world environments, they face mounting safety concerns, particularly when responding to hazardous instructions. In this work, we propose AGENTSAFE, the first comprehensive benchmark for evaluating the safety of embodied VLM agents under hazardous instructions. AGENTSAFE simulates realistic agent-environment interactions within a simulation sandbox and incorporates a novel adapter module that bridges the gap between high-level VLM outputs and low-level embodied controls. Specifically, it maps recognized visual entities to manipulable objects and translates abstract planning into executable atomic actions in the environment. Building on this, we construct a risk-aware instruction dataset inspired by Asimovs Three Laws of Robotics, including base risky instructions and mutated jailbroken instructions. The benchmark includes 45 adversarial scenarios, 1,350 hazardous tasks, and 8,100 hazardous instructions, enabling systematic testing under adversarial conditions ranging from perception, planning, and action execution stages.
comment: 11 pages
☆ Factor-Graph-Based Passive Acoustic Navigation for Decentralized Cooperative Localization Using Bearing Elevation Depth Difference
Accurate and scalable underwater multi-agent localization remains a critical challenge due to the constraints of underwater communication. In this work, we propose a multi-agent localization framework using a factor-graph representation that incorporates bearing, elevation, and depth difference (BEDD). Our method leverages inverted ultra-short baseline (inverted-USBL) derived azimuth and elevation measurements from incoming acoustic signals and relative depth measurements to enable cooperative localization for a multi-robot team of autonomous underwater vehicles (AUVs). We validate our approach in the HoloOcean underwater simulator with a fleet of AUVs, demonstrating improved localization accuracy compared to dead reckoning. Additionally, we investigate the impact of azimuth and elevation measurement outliers, highlighting the need for robust outlier rejection techniques for acoustic signals.
☆ SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning
Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds.
comment: 8 pages, 8 figures
☆ Latent Action Diffusion for Cross-Embodiment Manipulation
End-to-end learning approaches offer great potential for robotic manipulation, but their impact is constrained by data scarcity and heterogeneity across different embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 13% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.
comment: 14 pages, 6 figures
☆ NetRoller: Interfacing General and Specialized Models for End-to-End Autonomous Driving
Integrating General Models (GMs) such as Large Language Models (LLMs), with Specialized Models (SMs) in autonomous driving tasks presents a promising approach to mitigating challenges in data diversity and model capacity of existing specialized driving models. However, this integration leads to problems of asynchronous systems, which arise from the distinct characteristics inherent in GMs and SMs. To tackle this challenge, we propose NetRoller, an adapter that incorporates a set of novel mechanisms to facilitate the seamless integration of GMs and specialized driving models. Specifically, our mechanisms for interfacing the asynchronous GMs and SMs are organized into three key stages. NetRoller first harvests semantically rich and computationally efficient representations from the reasoning processes of LLMs using an early stopping mechanism, which preserves critical insights on driving context while maintaining low overhead. It then applies learnable query embeddings, nonsensical embeddings, and positional layer embeddings to facilitate robust and efficient cross-modality translation. At last, it employs computationally efficient Query Shift and Feature Shift mechanisms to enhance the performance of SMs through few-epoch fine-tuning. Based on the mechanisms formalized in these three stages, NetRoller enables specialized driving models to operate at their native frequencies while maintaining situational awareness of the GM. Experiments conducted on the nuScenes dataset demonstrate that integrating GM through NetRoller significantly improves human similarity and safety in planning tasks, and it also achieves noticeable precision improvements in detection and mapping tasks for end-to-end autonomous driving. The code and models are available at https://github.com/Rex-sys-hk/NetRoller .
comment: This work has been submitted to the IEEE for possible publication
☆ VisLanding: Monocular 3D Perception for UAV Safe Landing via Depth-Normal Synergy IROS2025
This paper presents VisLanding, a monocular 3D perception-based framework for safe UAV (Unmanned Aerial Vehicle) landing. Addressing the core challenge of autonomous UAV landing in complex and unknown environments, this study innovatively leverages the depth-normal synergy prediction capabilities of the Metric3D V2 model to construct an end-to-end safe landing zones (SLZ) estimation framework. By introducing a safe zone segmentation branch, we transform the landing zone estimation task into a binary semantic segmentation problem. The model is fine-tuned and annotated using the WildUAV dataset from a UAV perspective, while a cross-domain evaluation dataset is constructed to validate the model's robustness. Experimental results demonstrate that VisLanding significantly enhances the accuracy of safe zone identification through a depth-normal joint optimization mechanism, while retaining the zero-shot generalization advantages of Metric3D V2. The proposed method exhibits superior generalization and robustness in cross-domain testing compared to other approaches. Furthermore, it enables the estimation of landing zone area by integrating predicted depth and normal information, providing critical decision-making support for practical applications.
comment: Accepted by IROS2025
☆ GAMORA: A Gesture Articulated Meta Operative Robotic Arm for Hazardous Material Handling in Containment-Level Environments
The convergence of robotics and virtual reality (VR) has enabled safer and more efficient workflows in high-risk laboratory settings, particularly virology labs. As biohazard complexity increases, minimizing direct human exposure while maintaining precision becomes essential. We propose GAMORA (Gesture Articulated Meta Operative Robotic Arm), a novel VR-guided robotic system that enables remote execution of hazardous tasks using natural hand gestures. Unlike existing scripted automation or traditional teleoperation, GAMORA integrates the Oculus Quest 2, NVIDIA Jetson Nano, and Robot Operating System (ROS) to provide real-time immersive control, digital twin simulation, and inverse kinematics-based articulation. The system supports VR-based training and simulation while executing precision tasks in physical environments via a 3D-printed robotic arm. Inverse kinematics ensure accurate manipulation for delicate operations such as specimen handling and pipetting. The pipeline includes Unity-based 3D environment construction, real-time motion planning, and hardware-in-the-loop testing. GAMORA achieved a mean positional discrepancy of 2.2 mm (improved from 4 mm), pipetting accuracy within 0.2 mL, and repeatability of 1.2 mm across 50 trials. Integrated object detection via YOLOv8 enhances spatial awareness, while energy-efficient operation (50% reduced power output) ensures sustainable deployment. The system's digital-physical feedback loop enables safe, precise, and repeatable automation of high-risk lab tasks. GAMORA offers a scalable, immersive solution for robotic control and biosafety in biomedical research environments.
☆ Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?
Foundation models have revolutionized robotics by providing rich semantic representations without task-specific training. While many approaches integrate pretrained vision-language models (VLMs) with specialized navigation architectures, the fundamental question remains: can these pretrained embeddings alone successfully guide navigation without additional fine-tuning or specialized modules? We present a minimalist framework that decouples this question by training a behavior cloning policy directly on frozen vision-language embeddings from demonstrations collected by a privileged expert. Our approach achieves a 74% success rate in navigation to language-specified targets, compared to 100% for the state-aware expert, though requiring 3.2 times more steps on average. This performance gap reveals that pretrained embeddings effectively support basic language grounding but struggle with long-horizon planning and spatial reasoning. By providing this empirical baseline, we highlight both the capabilities and limitations of using foundation models as drop-in representations for embodied tasks, offering critical insights for robotics researchers facing practical design tradeoffs between system complexity and performance in resource-constrained scenarios. Our code is available at https://github.com/oadamharoon/text2nav
comment: 6 figures, 2 tables, Accepted to Robotics: Science and Systems (RSS) 2025 Workshop on Robot Planning in the Era of Foundation Models (FM4RoboPlan)
☆ ros2 fanuc interface: Design and Evaluation of a Fanuc CRX Hardware Interface in ROS2
This paper introduces the ROS2 control and the Hardware Interface (HW) integration for the Fanuc CRX- robot family. It explains basic implementation details and communication protocols, and its integration with the Moveit2 motion planning library. We conducted a series of experiments to evaluate relevant performances in the robotics field. We tested the developed ros2_fanuc_interface for four relevant robotics cases: step response, trajectory tracking, collision avoidance integrated with Moveit2, and dynamic velocity scaling, respectively. Results show that, despite a non-negligible delay between command and feedback, the robot can track the defined path with negligible errors (if it complies with joint velocity limits), ensuring collision avoidance. Full code is open source and available at https://github.com/paolofrance/ros2_fanuc_interface.
☆ Automatic Cannulation of Femoral Vessels in a Porcine Shock Model
Rapid and reliable vascular access is critical in trauma and critical care. Central vascular catheterization enables high-volume resuscitation, hemodynamic monitoring, and advanced interventions like ECMO and REBOA. While peripheral access is common, central access is often necessary but requires specialized ultrasound-guided skills, posing challenges in prehospital settings. The complexity arises from deep target vessels and the precision needed for needle placement. Traditional techniques, like the Seldinger method, demand expertise to avoid complications. Despite its importance, ultrasound-guided central access is underutilized due to limited field expertise. While autonomous needle insertion has been explored for peripheral vessels, only semi-autonomous methods exist for femoral access. This work advances toward full automation, integrating robotic ultrasound for minimally invasive emergency procedures. Our key contribution is the successful femoral vein and artery cannulation in a porcine hemorrhagic shock model.
comment: 2 pages, 2 figures, conference
☆ Enhancing Object Search in Indoor Spaces via Personalized Object-factored Ontologies
Personalization is critical for the advancement of service robots. Robots need to develop tailored understandings of the environments they are put in. Moreover, they need to be aware of changes in the environment to facilitate long-term deployment. Long-term understanding as well as personalization is necessary to execute complex tasks like prepare dinner table or tidy my room. A precursor to such tasks is that of Object Search. Consequently, this paper focuses on locating and searching multiple objects in indoor environments. In this paper, we propose two crucial novelties. Firstly, we propose a novel framework that can enable robots to deduce Personalized Ontologies of indoor environments. Our framework consists of a personalization schema that enables the robot to tune its understanding of ontologies. Secondly, we propose an Adaptive Inferencing strategy. We integrate Dynamic Belief Updates into our approach which improves performance in multi-object search tasks. The cumulative effect of personalization and adaptive inferencing is an improved capability in long-term object search. This framework is implemented on top of a multi-layered semantic map. We conduct experiments in real environments and compare our results against various state-of-the-art (SOTA) methods to demonstrate the effectiveness of our approach. Additionally, we show that personalization can act as a catalyst to enhance the performance of SOTAs. Video Link: https://bit.ly/3WHk9i9
comment: 8 pages, 9 figures. Accepted for publication in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems
☆ Adaptive Reinforcement Learning for Unobservable Random Delays
In standard Reinforcement Learning (RL) settings, the interaction between the agent and the environment is typically modeled as a Markov Decision Process (MDP), which assumes that the agent observes the system state instantaneously, selects an action without delay, and executes it immediately. In real-world dynamic environments, such as cyber-physical systems, this assumption often breaks down due to delays in the interaction between the agent and the system. These delays can vary stochastically over time and are typically unobservable, meaning they are unknown when deciding on an action. Existing methods deal with this uncertainty conservatively by assuming a known fixed upper bound on the delay, even if the delay is often much lower. In this work, we introduce the interaction layer, a general framework that enables agents to adaptively and seamlessly handle unobservable and time-varying delays. Specifically, the agent generates a matrix of possible future actions to handle both unpredictable delays and lost action packets sent over networks. Building on this framework, we develop a model-based algorithm, Actor-Critic with Delay Adaptation (ACDA), which dynamically adjusts to delay patterns. Our method significantly outperforms state-of-the-art approaches across a wide range of locomotion benchmark environments.
☆ Data Driven Approach to Input Shaping for Vibration Suppression in a Flexible Robot Arm
This paper presents a simple and effective method for setting parameters for an input shaper to suppress the residual vibrations in flexible robot arms using a data-driven approach. The parameters are adaptively tuned in the workspace of the robot by interpolating previously measured data of the robot's residual vibrations. Input shaping is a simple and robust technique to generate vibration-reduced shaped commands by a convolution of an impulse sequence with the desired input command. The generated impulses create waves in the material countering the natural vibrations of the system. The method is demonstrated with a flexible 3D-printed robot arm with multiple different materials, achieving a significant reduction in the residual vibrations.
comment: 6 pages, 11 figures, robosoft2025 conference
☆ Barrier Method for Inequality Constrained Factor Graph Optimization with Application to Model Predictive Control
Factor graphs have demonstrated remarkable efficiency for robotic perception tasks, particularly in localization and mapping applications. However, their application to optimal control problems -- especially Model Predictive Control (MPC) -- has remained limited due to fundamental challenges in constraint handling. This paper presents a novel integration of the Barrier Interior Point Method (BIPM) with factor graphs, implemented as an open-source extension to the widely adopted g2o framework. Our approach introduces specialized inequality factor nodes that encode logarithmic barrier functions, thereby overcoming the quadratic-form limitations of conventional factor graph formulations. To the best of our knowledge, this is the first g2o-based implementation capable of efficiently handling both equality and inequality constraints within a unified optimization backend. We validate the method through a multi-objective adaptive cruise control application for autonomous vehicles. Benchmark comparisons with state-of-the-art constraint-handling techniques demonstrate faster convergence and improved computational efficiency. (Code repository: https://github.com/snt-arg/bipm_g2o)
☆ ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes
Dexterous grasping in cluttered scenes presents significant challenges due to diverse object geometries, occlusions, and potential collisions. Existing methods primarily focus on single-object grasping or grasp-pose prediction without interaction, which are insufficient for complex, cluttered scenes. Recent vision-language-action models offer a potential solution but require extensive real-world demonstrations, making them costly and difficult to scale. To address these limitations, we revisit the sim-to-real transfer pipeline and develop key techniques that enable zero-shot deployment in reality while maintaining robust generalization. We propose ClutterDexGrasp, a two-stage teacher-student framework for closed-loop target-oriented dexterous grasping in cluttered scenes. The framework features a teacher policy trained in simulation using clutter density curriculum learning, incorporating both a novel geometry and spatially-embedded scene representation and a comprehensive safety curriculum, enabling general, dynamic, and safe grasping behaviors. Through imitation learning, we distill the teacher's knowledge into a student 3D diffusion policy (DP3) that operates on partial point cloud observations. To the best of our knowledge, this represents the first zero-shot sim-to-real closed-loop system for target-oriented dexterous grasping in cluttered scenes, demonstrating robust performance across diverse objects and layouts. More details and videos are available at https://clutterdexgrasp.github.io/.
☆ Socially Aware Robot Crowd Navigation via Online Uncertainty-Driven Risk Adaptation
Navigation in human-robot shared crowded environments remains challenging, as robots are expected to move efficiently while respecting human motion conventions. However, many existing approaches emphasize safety or efficiency while overlooking social awareness. This article proposes Learning-Risk Model Predictive Control (LR-MPC), a data-driven navigation algorithm that balances efficiency, safety, and social awareness. LR-MPC consists of two phases: an offline risk learning phase, where a Probabilistic Ensemble Neural Network (PENN) is trained using risk data from a heuristic MPC-based baseline (HR-MPC), and an online adaptive inference phase, where local waypoints are sampled and globally guided by a Multi-RRT planner. Each candidate waypoint is evaluated for risk by PENN, and predictions are filtered using epistemic and aleatoric uncertainty to ensure robust decision-making. The safest waypoint is selected as the MPC input for real-time navigation. Extensive experiments demonstrate that LR-MPC outperforms baseline methods in success rate and social awareness, enabling robots to navigate complex crowds with high adaptability and low disruption. A website about this work is available at https://sites.google.com/view/lr-mpc.
☆ Uncertainty-Driven Radar-Inertial Fusion for Instantaneous 3D Ego-Velocity Estimation
We present a method for estimating ego-velocity in autonomous navigation by integrating high-resolution imaging radar with an inertial measurement unit. The proposed approach addresses the limitations of traditional radar-based ego-motion estimation techniques by employing a neural network to process complex-valued raw radar data and estimate instantaneous linear ego-velocity along with its associated uncertainty. This uncertainty-aware velocity estimate is then integrated with inertial measurement unit data using an Extended Kalman Filter. The filter leverages the network-predicted uncertainty to refine the inertial sensor's noise and bias parameters, improving the overall robustness and accuracy of the ego-motion estimation. We evaluated the proposed method on the publicly available ColoRadar dataset. Our approach achieves significantly lower error compared to the closest publicly available method and also outperforms both instantaneous and scan matching-based techniques.
comment: This paper has been accepted for presentation at the 28th International Conference on Information Fusion (Fusion 2025)
☆ Steering Robots with Inference-Time Interactions
Imitation learning has driven the development of generalist policies capable of autonomously solving multiple tasks. However, when a pretrained policy makes errors during deployment, there are limited mechanisms for users to correct its behavior. While collecting additional data for finetuning can address such issues, doing so for each downstream use case is inefficient at deployment. My research proposes an alternative: keeping pretrained policies frozen as a fixed skill repertoire while allowing user interactions to guide behavior generation toward user preferences at inference time. By making pretrained policies steerable, users can help correct policy errors when the model struggles to generalize-without needing to finetune the policy. Specifically, I propose (1) inference-time steering, which leverages user interactions to switch between discrete skills, and (2) task and motion imitation, which enables user interactions to edit continuous motions while satisfying task constraints defined by discrete symbolic plans. These frameworks correct misaligned policy predictions without requiring additional training, maximizing the utility of pretrained models while achieving inference-time user objectives.
comment: MIT Robotics PhD Thesis
☆ Whole-Body Control Framework for Humanoid Robots with Heavy Limbs: A Model-Based Approach
Humanoid robots often face significant balance issues due to the motion of their heavy limbs. These challenges are particularly pronounced when attempting dynamic motion or operating in environments with irregular terrain. To address this challenge, this manuscript proposes a whole-body control framework for humanoid robots with heavy limbs, using a model-based approach that combines a kino-dynamics planner and a hierarchical optimization problem. The kino-dynamics planner is designed as a model predictive control (MPC) scheme to account for the impact of heavy limbs on mass and inertia distribution. By simplifying the robot's system dynamics and constraints, the planner enables real-time planning of motion and contact forces. The hierarchical optimization problem is formulated using Hierarchical Quadratic Programming (HQP) to minimize limb control errors and ensure compliance with the policy generated by the kino-dynamics planner. Experimental validation of the proposed framework demonstrates its effectiveness. The humanoid robot with heavy limbs controlled by the proposed framework can achieve dynamic walking speeds of up to 1.2~m/s, respond to external disturbances of up to 60~N, and maintain balance on challenging terrains such as uneven surfaces, and outdoor environments.
☆ Public Acceptance of Cybernetic Avatars in the service sector: Evidence from a Large-Scale Survey in Dubai
Cybernetic avatars are hybrid interaction robots or digital representations that combine autonomous capabilities with teleoperated control. This study investigates the acceptance of cybernetic avatars in the highly multicultural society of Dubai, with particular emphasis on robotic avatars for customer service. Specifically, we explore how acceptance varies as a function of robot appearance (e.g., android, robotic-looking, cartoonish), deployment settings (e.g., shopping malls, hotels, hospitals), and functional tasks (e.g., providing information, patrolling). To this end, we conducted a large-scale survey with over 1,000 participants. Overall, cybernetic avatars received a high level of acceptance, with physical robot avatars receiving higher acceptance than digital avatars. In terms of appearance, robot avatars with a highly anthropomorphic robotic appearance were the most accepted, followed by cartoonish designs and androids. Animal-like appearances received the lowest level of acceptance. Among the tasks, providing information and guidance was rated as the most valued. Shopping malls, airports, public transport stations, and museums were the settings with the highest acceptance, whereas healthcare-related spaces received lower levels of support. An analysis by community cluster revealed among others that Emirati respondents showed significantly greater acceptance of android appearances compared to the overall sample, while participants from the 'Other Asia' cluster were significantly more accepting of cartoonish appearances. Our study underscores the importance of incorporating citizen feedback into the design and deployment of cybernetic avatars from the early stages to enhance acceptance of this technology in society.
comment: 25 pages, 3 Figures
☆ Robust Adaptive Time-Varying Control Barrier Function with Application to Robotic Surface Treatment
Set invariance techniques such as control barrier functions (CBFs) can be used to enforce time-varying constraints such as keeping a safe distance from dynamic objects. However, existing methods for enforcing time-varying constraints often overlook model uncertainties. To address this issue, this paper proposes a CBFs-based robust adaptive controller design endowing time-varying constraints while considering parametric uncertainty and additive disturbances. To this end, we first leverage Robust adaptive Control Barrier Functions (RaCBFs) to handle model uncertainty, along with the concept of Input-to-State Safety (ISSf) to ensure robustness towards input disturbances. Furthermore, to alleviate the inherent conservatism in robustness, we also incorporate a set membership identification scheme. We demonstrate the proposed method on robotic surface treatment that requires time-varying force bounds to ensure uniform quality, in numerical simulation and real robotic setup, showing that the quality is formally guaranteed within an acceptable range.
comment: This work has been accepted to ECC 2025
☆ A Novel Indicator for Quantifying and Minimizing Information Utility Loss of Robot Teams
The timely exchange of information among robots within a team is vital, but it can be constrained by limited wireless capacity. The inability to deliver information promptly can result in estimation errors that impact collaborative efforts among robots. In this paper, we propose a new metric termed Loss of Information Utility (LoIU) to quantify the freshness and utility of information critical for cooperation. The metric enables robots to prioritize information transmissions within bandwidth constraints. We also propose the estimation of LoIU using belief distributions and accordingly optimize both transmission schedule and resource allocation strategy for device-to-device transmissions to minimize the time-average LoIU within a robot team. A semi-decentralized Multi-Agent Deep Deterministic Policy Gradient framework is developed, where each robot functions as an actor responsible for scheduling transmissions among its collaborators while a central critic periodically evaluates and refines the actors in response to mobility and interference. Simulations validate the effectiveness of our approach, demonstrating an enhancement of information freshness and utility by 98%, compared to alternative methods.
☆ Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments
Large Vision-Language Models (VLMs) have demonstrated potential in enhancing mobile robot navigation in human-centric environments by understanding contextual cues, human intentions, and social dynamics while exhibiting reasoning capabilities. However, their computational complexity and limited sensitivity to continuous numerical data impede real-time performance and precise motion control. To this end, we propose Narrate2Nav, a novel real-time vision-action model that leverages a novel self-supervised learning framework based on the Barlow Twins redundancy reduction loss to embed implicit natural language reasoning, social cues, and human intentions within a visual encoder-enabling reasoning in the model's latent space rather than token space. The model combines RGB inputs, motion commands, and textual signals of scene context during training to bridge from robot observations to low-level motion commands for short-horizon point-goal navigation during deployment. Extensive evaluation of Narrate2Nav across various challenging scenarios in both offline unseen dataset and real-world experiments demonstrates an overall improvement of 52.94 percent and 41.67 percent, respectively, over the next best baseline. Additionally, qualitative comparative analysis of Narrate2Nav's visual encoder attention map against four other baselines demonstrates enhanced attention to navigation-critical scene elements, underscoring its effectiveness in human-centric navigation tasks.
☆ Pose State Perception of Interventional Robot for Cardio-cerebrovascular Procedures
In response to the increasing demand for cardiocerebrovascular interventional surgeries, precise control of interventional robots has become increasingly important. Within these complex vascular scenarios, the accurate and reliable perception of the pose state for interventional robots is particularly crucial. This paper presents a novel vision-based approach without the need of additional sensors or markers. The core of this paper's method consists of a three-part framework: firstly, a dual-head multitask U-Net model for simultaneous vessel segment and interventional robot detection; secondly, an advanced algorithm for skeleton extraction and optimization; and finally, a comprehensive pose state perception system based on geometric features is implemented to accurately identify the robot's pose state and provide strategies for subsequent control. The experimental results demonstrate the proposed method's high reliability and accuracy in trajectory tracking and pose state perception.
☆ AMPLIFY: Actionless Motion Priors for Robot Learning from Videos
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify-robotics.github.io/.
☆ Hard Contacts with Soft Gradients: Refining Differentiable Simulators for Learning and Control
Contact forces pose a major challenge for gradient-based optimization of robot dynamics as they introduce jumps in the system's velocities. Penalty-based simulators, such as MuJoCo, simplify gradient computation by softening the contact forces. However, realistically simulating hard contacts requires very stiff contact settings, which leads to incorrect gradients when using automatic differentiation. On the other hand, using non-stiff settings strongly increases the sim-to-real gap. We analyze the contact computation of penalty-based simulators to identify the causes of gradient errors. Then, we propose DiffMJX, which combines adaptive integration with MuJoCo XLA, to notably improve gradient quality in the presence of hard contacts. Finally, we address a key limitation of contact gradients: they vanish when objects do not touch. To overcome this, we introduce Contacts From Distance (CFD), a mechanism that enables the simulator to generate informative contact gradients even before objects are in contact. To preserve physical realism, we apply CFD only in the backward pass using a straight-through trick, allowing us to compute useful gradients without modifying the forward simulation.
☆ Non-Overlap-Aware Egocentric Pose Estimation for Collaborative Perception in Connected Autonomy IROS 2025
Egocentric pose estimation is a fundamental capability for multi-robot collaborative perception in connected autonomy, such as connected autonomous vehicles. During multi-robot operations, a robot needs to know the relative pose between itself and its teammates with respect to its own coordinates. However, different robots usually observe completely different views that contains similar objects, which leads to wrong pose estimation. In addition, it is unrealistic to allow robots to share their raw observations to detect overlap due to the limited communication bandwidth constraint. In this paper, we introduce a novel method for Non-Overlap-Aware Egocentric Pose Estimation (NOPE), which performs egocentric pose estimation in a multi-robot team while identifying the non-overlap views and satifying the communication bandwidth constraint. NOPE is built upon an unified hierarchical learning framework that integrates two levels of robot learning: (1) high-level deep graph matching for correspondence identification, which allows to identify if two views are overlapping or not, (2) low-level position-aware cross-attention graph learning for egocentric pose estimation. To evaluate NOPE, we conduct extensive experiments in both high-fidelity simulation and real-world scenarios. Experimental results have demonstrated that NOPE enables the novel capability for non-overlapping-aware egocentric pose estimation and achieves state-of-art performance compared with the existing methods. Our project page at https://hongh0.github.io/NOPE/.
comment: IROS 2025
☆ TACS-Graphs: Traversability-Aware Consistent Scene Graphs for Ground Robot Indoor Localization and Mapping IROS 2025
Scene graphs have emerged as a powerful tool for robots, providing a structured representation of spatial and semantic relationships for advanced task planning. Despite their potential, conventional 3D indoor scene graphs face critical limitations, particularly under- and over-segmentation of room layers in structurally complex environments. Under-segmentation misclassifies non-traversable areas as part of a room, often in open spaces, while over-segmentation fragments a single room into overlapping segments in complex environments. These issues stem from naive voxel-based map representations that rely solely on geometric proximity, disregarding the structural constraints of traversable spaces and resulting in inconsistent room layers within scene graphs. To the best of our knowledge, this work is the first to tackle segmentation inconsistency as a challenge and address it with Traversability-Aware Consistent Scene Graphs (TACS-Graphs), a novel framework that integrates ground robot traversability with room segmentation. By leveraging traversability as a key factor in defining room boundaries, the proposed method achieves a more semantically meaningful and topologically coherent segmentation, effectively mitigating the inaccuracies of voxel-based scene graph approaches in complex environments. Furthermore, the enhanced segmentation consistency improves loop closure detection efficiency in the proposed Consistent Scene Graph-leveraging Loop Closure Detection (CoSG-LCD) leading to higher pose estimation accuracy. Experimental results confirm that the proposed approach outperforms state-of-the-art methods in terms of scene graph consistency and pose graph optimization performance.
comment: Accepted by IROS 2025
☆ Lasso Gripper: A String Shooting-Retracting Mechanism for Shape-Adaptive Grasping
Handling oversized, variable-shaped, or delicate objects in transportation, grasping tasks is extremely challenging, mainly due to the limitations of the gripper's shape and size. This paper proposes a novel gripper, Lasso Gripper. Inspired by traditional tools like the lasso and the uurga, Lasso Gripper captures objects by launching and retracting a string. Contrary to antipodal grippers, which concentrate force on a limited area, Lasso Gripper applies uniform pressure along the length of the string for a more gentle grasp. The gripper is controlled by four motors-two for launching the string inward and two for launching it outward. By adjusting motor speeds, the size of the string loop can be tuned to accommodate objects of varying sizes, eliminating the limitations imposed by the maximum gripper separation distance. To address the issue of string tangling during rapid retraction, a specialized mechanism was incorporated. Additionally, a dynamic model was developed to estimate the string's curve, providing a foundation for the kinematic analysis of the workspace. In grasping experiments, Lasso Gripper, mounted on a robotic arm, successfully captured and transported a range of objects, including bull and horse figures as well as delicate vegetables. The demonstration video is available here: https://youtu.be/PV1J76mNP9Y.
comment: 6 pages, 13 figures
☆ GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation
Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we propose a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing simultaneous modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF supports three key query types: reconstruction of the current scene, prediction of future frames, and estimation of initial action via robot motion. Furthermore, the high-quality current and future frames generated by GAF facilitate manipulation action refinement through a GAF-guided diffusion model. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average success rate in robotic manipulation tasks by 10.33% over state-of-the-art methods. Project page: http://chaiying1.github.io/GAF.github.io/project_page/
comment: http://chaiying1.github.io/GAF.github.io/project_page/
☆ KDMOS:Knowledge Distillation for Motion Segmentation
Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real-time inference remains a challenge. To address this, we propose a logits-based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real-time efficiency. Specifically, we adopt a Bird's Eye View (BEV) projection-based model as the student and a non-projection model as the teacher. To handle the severe imbalance between moving and non-moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion-related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at https://github.com/SCNU-RISLAB/KDMOS.
☆ Haptic-Based User Authentication for Tele-robotic System
Tele-operated robots rely on real-time user behavior mapping for remote tasks, but ensuring secure authentication remains a challenge. Traditional methods, such as passwords and static biometrics, are vulnerable to spoofing and replay attacks, particularly in high-stakes, continuous interactions. This paper presents a novel anti-spoofing and anti-replay authentication approach that leverages distinctive user behavioral features extracted from haptic feedback during human-robot interactions. To evaluate our authentication approach, we collected a time-series force feedback dataset from 15 participants performing seven distinct tasks. We then developed a transformer-based deep learning model to extract temporal features from the haptic signals. By analyzing user-specific force dynamics, our method achieves over 90 percent accuracy in both user identification and task classification, demonstrating its potential for enhancing access control and identity assurance in tele-robotic systems.
☆ A Hierarchical Test Platform for Vision Language Model (VLM)-Integrated Real-World Autonomous Driving
Vision-Language Models (VLMs) have demonstrated notable promise in autonomous driving by offering the potential for multimodal reasoning through pretraining on extensive image-text pairs. However, adapting these models from broad web-scale data to the safety-critical context of driving presents a significant challenge, commonly referred to as domain shift. Existing simulation-based and dataset-driven evaluation methods, although valuable, often fail to capture the full complexity of real-world scenarios and cannot easily accommodate repeatable closed-loop testing with flexible scenario manipulation. In this paper, we introduce a hierarchical real-world test platform specifically designed to evaluate VLM-integrated autonomous driving systems. Our approach includes a modular, low-latency on-vehicle middleware that allows seamless incorporation of various VLMs, a clearly separated perception-planning-control architecture that can accommodate both VLM-based and conventional modules, and a configurable suite of real-world testing scenarios on a closed track that facilitates controlled yet authentic evaluations. We demonstrate the effectiveness of the proposed platform`s testing and evaluation ability with a case study involving a VLM-enabled autonomous vehicle, highlighting how our test framework supports robust experimentation under diverse conditions.
☆ ReLCP: Scalable Complementarity-Based Collision Resolution for Smooth Rigid Bodies
We present a complementarity-based collision resolution algorithm for smooth, non-spherical, rigid bodies. Unlike discrete surface representation approaches, which approximate surfaces using discrete elements (e.g., tessellations or sub-spheres) with constraints between nearby faces, edges, nodes, or sub-objects, our algorithm solves a recursively generated linear complementarity problem (ReLCP) to adaptively identify potential collision locations during the collision resolution procedure. Despite adaptively and in contrast to Newton-esque schemes, we prove conditions under which the resulting solution exists and the center of mass translational and rotational dynamics are unique. Our ReLCP also converges to classical LCP-based collision resolution for sufficiently small timesteps. Because increasing the surface resolution in discrete representation methods necessitates subdividing geometry into finer elements -- leading to a super-linear increase in the number of collision constraints -- these approaches scale poorly with increased surface resolution. In contrast, our adaptive ReLCP framework begins with a single constraint per pair of nearby bodies and introduces new constraints only when unconstrained motion would lead to overlap, circumventing the oversampling required by discrete methods. By requiring one to two orders of magnitude fewer collision constraints to achieve the same surface resolution, we observe 10-100x speedup in densely packed applications. We validate our ReLCP method against multisphere and single-constraint methods, comparing convergence in a two-ellipsoid collision test, scalability and performance in a compacting ellipsoid suspension and growing bacterial colony, and stability in a taut chainmail network, highlighting our ability to achieve high-fidelity surface representations without suffering from poor scalability or artificial surface roughness.
☆ Context Matters: Learning Generalizable Rewards via Calibrated Features
A key challenge in reward learning from human input is that desired agent behavior often changes based on context. Traditional methods typically treat each new context as a separate task with its own reward function. For example, if a previously ignored stove becomes too hot to be around, the robot must learn a new reward from scratch, even though the underlying preference for prioritizing safety over efficiency remains unchanged. We observe that context influences not the underlying preference itself, but rather the $\textit{saliency}$--or importance--of reward features. For instance, stove heat affects the importance of the robot's proximity, yet the human's safety preference stays the same. Existing multi-task and meta IRL methods learn context-dependent representations $\textit{implicitly}$--without distinguishing between preferences and feature importance--resulting in substantial data requirements. Instead, we propose $\textit{explicitly}$ modeling context-invariant preferences separately from context-dependent feature saliency, creating modular reward representations that adapt to new contexts. To achieve this, we introduce $\textit{calibrated features}$--representations that capture contextual effects on feature saliency--and present specialized paired comparison queries that isolate saliency from preference for efficient learning. Experiments with simulated users show our method significantly improves sample efficiency, requiring 10x fewer preference queries than baselines to achieve equivalent reward accuracy, with up to 15% better performance in low-data regimes (5-10 queries). An in-person user study (N=12) demonstrates that participants can effectively teach their unique personal contextual preferences using our method, enabling more adaptable and personalized reward learning.
comment: 30 pages, 21 figures
☆ Six-DoF Hand-Based Teleoperation for Omnidirectional Aerial Robots IROS 2025
Omnidirectional aerial robots offer full 6-DoF independent control over position and orientation, making them popular for aerial manipulation. Although advancements in robotic autonomy, operating by human remains essential in complex aerial environments. Existing teleoperation approaches for multirotors fail to fully leverage the additional DoFs provided by omnidirectional rotation. Additionally, the dexterity of human fingers should be exploited for more engaged interaction. In this work, we propose an aerial teleoperation system that brings the omnidirectionality of human hands into the unbounded aerial workspace. Our system includes two motion-tracking marker sets -- one on the shoulder and one on the hand -- along with a data glove to capture hand gestures. Using these inputs, we design four interaction modes for different tasks, including Spherical Mode and Cartesian Mode for long-range moving as well as Operation Mode and Locking Mode for precise manipulation, where the hand gestures are utilized for seamless mode switching. We evaluate our system on a valve-turning task in real world, demonstrating how each mode contributes to effective aerial manipulation. This interaction framework bridges human dexterity with aerial robotics, paving the way for enhanced teleoperated aerial manipulation in unstructured environments.
comment: 7 pages, 9 figures. This work has been accepted to IROS 2025. The video will be released soon
☆ Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors
Compliance is a critical parameter for describing objects in engineering, agriculture, and biomedical applications. Traditional compliance detection methods are limited by their lack of portability and scalability, rely on specialized, often expensive equipment, and are unsuitable for robotic applications. Moreover, existing neural network-based approaches using vision-based tactile sensors still suffer from insufficient prediction accuracy. In this paper, we propose two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures that leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. We validate the performance of these models using multiple metrics and demonstrate their effectiveness in accurately estimating compliance. The proposed models exhibit significant performance improvement over the baseline. Additionally, we investigated the correlation between sensor compliance and object compliance estimation, which revealed that objects that are harder than the sensor are more challenging to estimate.
comment: Accepted in the IEEE International Conference on Development and Learning (ICDL). The paper contains 8 pages and 7 figures
☆ Time-Optimized Safe Navigation in Unstructured Environments through Learning Based Depth Completion
Quadrotors hold significant promise for several applications such as agriculture, search and rescue, and infrastructure inspection. Achieving autonomous operation requires systems to navigate safely through complex and unfamiliar environments. This level of autonomy is particularly challenging due to the complexity of such environments and the need for real-time decision making especially for platforms constrained by size, weight, and power (SWaP), which limits flight time and precludes the use of bulky sensors like Light Detection and Ranging (LiDAR) for mapping. Furthermore, computing globally optimal, collision-free paths and translating them into time-optimized, safe trajectories in real time adds significant computational complexity. To address these challenges, we present a fully onboard, real-time navigation system that relies solely on lightweight onboard sensors. Our system constructs a dense 3D map of the environment using a novel visual depth estimation approach that fuses stereo and monocular learning-based depth, yielding longer-range, denser, and less noisy depth maps than conventional stereo methods. Building on this map, we introduce a novel planning and trajectory generation framework capable of rapidly computing time-optimal global trajectories. As the map is incrementally updated with new depth information, our system continuously refines the trajectory to maintain safety and optimality. Both our planner and trajectory generator outperforms state-of-the-art methods in terms of computational efficiency and guarantee obstacle-free trajectories. We validate our system through robust autonomous flight experiments in diverse indoor and outdoor environments, demonstrating its effectiveness for safe navigation in previously unknown settings.
☆ FEAST: A Flexible Mealtime-Assistance System Towards In-the-Wild Personalization
Physical caregiving robots hold promise for improving the quality of life of millions worldwide who require assistance with feeding. However, in-home meal assistance remains challenging due to the diversity of activities (e.g., eating, drinking, mouth wiping), contexts (e.g., socializing, watching TV), food items, and user preferences that arise during deployment. In this work, we propose FEAST, a flexible mealtime-assistance system that can be personalized in-the-wild to meet the unique needs of individual care recipients. Developed in collaboration with two community researchers and informed by a formative study with a diverse group of care recipients, our system is guided by three key tenets for in-the-wild personalization: adaptability, transparency, and safety. FEAST embodies these principles through: (i) modular hardware that enables switching between assisted feeding, drinking, and mouth-wiping, (ii) diverse interaction methods, including a web interface, head gestures, and physical buttons, to accommodate diverse functional abilities and preferences, and (iii) parameterized behavior trees that can be safely and transparently adapted using a large language model. We evaluate our system based on the personalization requirements identified in our formative study, demonstrating that FEAST offers a wide range of transparent and safe adaptations and outperforms a state-of-the-art baseline limited to fixed customizations. To demonstrate real-world applicability, we conduct an in-home user study with two care recipients (who are community researchers), feeding them three meals each across three diverse scenarios. We further assess FEAST's ecological validity by evaluating with an Occupational Therapist previously unfamiliar with the system. In all cases, users successfully personalize FEAST to meet their individual needs and preferences. Website: https://emprise.cs.cornell.edu/feast
comment: RSS 2025 - Outstanding Paper Award & Outstanding Systems Paper Award Finalist
☆ Efficient and Real-Time Motion Planning for Robotics Using Projection-Based Optimization IROS 2025
Generating motions for robots interacting with objects of various shapes is a complex challenge, further complicated by the robot geometry and multiple desired behaviors. While current robot programming tools (such as inverse kinematics, collision avoidance, and manipulation planning) often treat these problems as constrained optimization, many existing solvers focus on specific problem domains or do not exploit geometric constraints effectively. We propose an efficient first-order method, Augmented Lagrangian Spectral Projected Gradient Descent (ALSPG), which leverages geometric projections via Euclidean projections, Minkowski sums, and basis functions. We show that by using geometric constraints rather than full constraints and gradients, ALSPG significantly improves real-time performance. Compared to second-order methods like iLQR, ALSPG remains competitive in the unconstrained case. We validate our method through toy examples and extensive simulations, and demonstrate its effectiveness on a 7-axis Franka robot, a 6-axis P-Rob robot and a 1:10 scale car in real-world experiments. Source codes, experimental data and videos are available on the project webpage: https://sites.google.com/view/alspg-oc
comment: submitted to IROS 2025
☆ Towards Perception-based Collision Avoidance for UAVs when Guiding the Visually Impaired IROS
Autonomous navigation by drones using onboard sensors combined with machine learning and computer vision algorithms is impacting a number of domains, including agriculture, logistics, and disaster management. In this paper, we examine the use of drones for assisting visually impaired people (VIPs) in navigating through outdoor urban environments. Specifically, we present a perception-based path planning system for local planning around the neighborhood of the VIP, integrated with a global planner based on GPS and maps for coarse planning. We represent the problem using a geometric formulation and propose a multi DNN based framework for obstacle avoidance of the UAV as well as the VIP. Our evaluations conducted on a drone human system in a university campus environment verifies the feasibility of our algorithms in three scenarios; when the VIP walks on a footpath, near parked vehicles, and in a crowded street.
comment: 16 pages, 7 figures; Accepted as Late-Breaking Results at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2023
☆ Feedback-MPPI: Fast Sampling-Based MPC via Rollout Differentiation -- Adios low-level controllers
Model Predictive Path Integral control is a powerful sampling-based approach suitable for complex robotic tasks due to its flexibility in handling nonlinear dynamics and non-convex costs. However, its applicability in real-time, highfrequency robotic control scenarios is limited by computational demands. This paper introduces Feedback-MPPI (F-MPPI), a novel framework that augments standard MPPI by computing local linear feedback gains derived from sensitivity analysis inspired by Riccati-based feedback used in gradient-based MPC. These gains allow for rapid closed-loop corrections around the current state without requiring full re-optimization at each timestep. We demonstrate the effectiveness of F-MPPI through simulations and real-world experiments on two robotic platforms: a quadrupedal robot performing dynamic locomotion on uneven terrain and a quadrotor executing aggressive maneuvers with onboard computation. Results illustrate that incorporating local feedback significantly improves control performance and stability, enabling robust, high-frequency operation suitable for complex robotic systems.
♻ ☆ Opt2Skill: Imitating Dynamically-feasible Whole-Body Trajectories for Versatile Humanoid Loco-Manipulation
Humanoid robots are designed to perform diverse loco-manipulation tasks. However, they face challenges due to their high-dimensional and unstable dynamics, as well as the complex contact-rich nature of the tasks. Model-based optimal control methods offer flexibility to define precise motion but are limited by high computational complexity and accurate contact sensing. On the other hand, reinforcement learning (RL) handles high-dimensional spaces with strong robustness but suffers from inefficient learning, unnatural motion, and sim-to-real gaps. To address these challenges, we introduce Opt2Skill, an end-to-end pipeline that combines model-based trajectory optimization with RL to achieve robust whole-body loco-manipulation. Opt2Skill generates dynamic feasible and contact-consistent reference motions for the Digit humanoid robot using differential dynamic programming (DDP) and trains RL policies to track these optimal trajectories. Our results demonstrate that Opt2Skill outperforms baselines that rely on human demonstrations and inverse kinematics-based references, both in motion tracking and task success rates. Furthermore, we show that incorporating trajectories with torque information improves contact force tracking in contact-involved tasks, such as wiping a table. We have successfully transferred our approach to real-world applications.
♻ ☆ IKDiffuser: Fast and Diverse Inverse Kinematics Solution Generation for Multi-arm Robotic Systems
Solving Inverse Kinematics (IK) problems is fundamental to robotics, but has primarily been successful with single serial manipulators. For multi-arm robotic systems, IK remains challenging due to complex self-collisions, coupled joints, and high-dimensional redundancy. These complexities make traditional IK solvers slow, prone to failure, and lacking in solution diversity. In this paper, we present IKDiffuser, a diffusion-based model designed for fast and diverse IK solution generation for multi-arm robotic systems. IKDiffuser learns the joint distribution over the configuration space, capturing complex dependencies and enabling seamless generalization to multi-arm robotic systems of different structures. In addition, IKDiffuser can incorporate additional objectives during inference without retraining, offering versatility and adaptability for task-specific requirements. In experiments on 6 different multi-arm systems, the proposed IKDiffuser achieves superior solution accuracy, precision, diversity, and computational efficiency compared to existing solvers. The proposed IKDiffuser framework offers a scalable, unified approach to solving multi-arm IK problems, facilitating the potential of multi-arm robotic systems in real-time manipulation tasks.
comment: under review
♻ ☆ LBAP: Improved Uncertainty Alignment of LLM Planners using Bayesian Inference
Large language models (LLMs) showcase many desirable traits for intelligent and helpful robots. However, they are also known to hallucinate predictions. This issue is exacerbated in robotics where LLM hallucinations may result in robots confidently executing plans that are contrary to user goals or relying more frequently on human assistance. In this work, we present LBAP, a novel approach for utilizing off-the-shelf LLMs, alongside Bayesian inference for uncertainty Alignment in robotic Planners that minimizes hallucinations and human intervention. Our key finding is that we can use Bayesian inference to more accurately calibrate a robots confidence measure through accounting for both scene grounding and world knowledge. This process allows us to mitigate hallucinations and better align the LLM's confidence measure with the probability of success. Through experiments in both simulation and the real world on tasks with a variety of ambiguities, we show that LBAP significantly increases success rate and decreases the amount of human intervention required relative to prior art. For example, in our real-world testing paradigm, LBAP decreases the human help rate of previous methods by over 33% at a success rate of 70%.
♻ ☆ Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models AAAI'25
Large Language Models (LLMs) such as GPT-4, trained on huge amount of datasets spanning multiple domains, exhibit significant reasoning, understanding, and planning capabilities across various tasks. This study presents the first-ever work in Arabic language integration within the Vision-and-Language Navigation (VLN) domain in robotics, an area that has been notably underexplored in existing research. We perform a comprehensive evaluation of state-of-the-art multi-lingual Small Language Models (SLMs), including GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the Arabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure LLM-based instruction-following navigation agent, to assess the impact of language on navigation reasoning through zero-shot sequential action prediction using the R2R dataset. Through comprehensive experiments, we demonstrate that our framework is capable of high-level planning for navigation tasks when provided with instructions in both English and Arabic. However, certain models struggled with reasoning and planning in the Arabic language due to inherent limitations in their capabilities, sub-optimal performance, and parsing issues. These findings highlight the importance of enhancing planning and reasoning capabilities in language models for effective navigation, emphasizing this as a key area for further development while also unlocking the potential of Arabic-language models for impactful real-world applications.
comment: This work has been accepted for presentation at LM4Plan@AAAI'25. For more details, please check: https://llmforplanning.github.io/
♻ ☆ ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation
Transparent object depth perception poses a challenge in everyday life and logistics, primarily due to the inability of standard 3D sensors to accurately capture depth on transparent or reflective surfaces. This limitation significantly affects depth map and point cloud-reliant applications, especially in robotic manipulation. We developed a vision transformer-based algorithm for stereo depth recovery of transparent objects. This approach is complemented by an innovative feature post-fusion module, which enhances the accuracy of depth recovery by structural features in images. To address the high costs associated with dataset collection for stereo camera-based perception of transparent objects, our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm. Our experimental results demonstrate the model's exceptional Sim2Real generalizability in real-world scenarios, enabling precise depth mapping of transparent objects to assist in robotic manipulation. Project details are available at https://sites.google.com/view/cleardepth/ .
comment: 7 pages, 7 figures
♻ ☆ AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment
Current service robots suffer from limited natural language communication abilities, heavy reliance on predefined commands, ongoing human intervention, and, most notably, a lack of proactive collaboration awareness in human-populated environments. This results in narrow applicability and low utility. In this paper, we introduce AssistantX, an LLM-powered proactive assistant designed for autonomous operation in realworld scenarios with high accuracy. AssistantX employs a multi-agent framework consisting of 4 specialized LLM agents, each dedicated to perception, planning, decision-making, and reflective review, facilitating advanced inference capabilities and comprehensive collaboration awareness, much like a human assistant by your side. We built a dataset of 210 real-world tasks to validate AssistantX, which includes instruction content and status information on whether relevant personnel are available. Extensive experiments were conducted in both text-based simulations and a real office environment over the course of a month and a half. Our experiments demonstrate the effectiveness of the proposed framework, showing that AssistantX can reactively respond to user instructions, actively adjust strategies to adapt to contingencies, and proactively seek assistance from humans to ensure successful task completion. More details and videos can be found at https://assistantx-agent. github.io/AssistantX/.
comment: 8 pages, 10 figures, 6 tables
♻ ☆ Human-robot collaborative transport personalization via Dynamic Movement Primitives and velocity scaling
Nowadays, industries are showing a growing interest in human-robot collaboration, particularly for shared tasks. This requires intelligent strategies to plan a robot's motions, considering both task constraints and human-specific factors such as height and movement preferences. This work introduces a novel approach to generate personalized trajectories using Dynamic Movement Primitives (DMPs), enhanced with real-time velocity scaling based on human feedback. The method was rigorously tested in industrial-grade experiments, focusing on the collaborative transport of an engine cowl lip section. Comparative analysis between DMP-generated trajectories and a state-of-the-art motion planner (BiTRRT) highlights their adaptability combined with velocity scaling. Subjective user feedback further demonstrates a clear preference for DMP- based interactions. Objective evaluations, including physiological measurements from brain and skin activity, reinforce these findings, showcasing the advantages of DMPs in enhancing human-robot interaction and improving user experience.
♻ ☆ Design and Evaluation of an Uncertainty-Aware Shared-Autonomy System with Hierarchical Conservative Skill Inference
Shared-autonomy imitation learning lets a human correct a robot in real time, mitigating covariate-shift errors. Yet existing approaches ignore two critical factors: (i) the operator's cognitive load and (ii) the risk created by delayed or erroneous interventions. We present an uncertainty-aware shared-autonomy system in which the robot modulates its behaviour according to a learned estimate of latent-space skill uncertainty. A hierarchical policy first infers a conservative skill embedding and then decodes it into low-level actions, enabling rapid task execution while automatically slowing down when uncertainty is high. We detail a full, open-source VR-teleoperation pipeline that is compatible with multi-configuration manipulators such as UR-series arms. Experiments on pouring and pick-and-place tasks demonstrate 70-90% success in dynamic scenes with moving targets, and a qualitative study shows a marked reduction in collision events compared with a non-conservative baseline. Although a dedicated ablation that isolates uncertainty is impractical on hardware for safety and cost reasons, the reported gains in stability and operator workload already validate the design and motivate future large-scale studies.
comment: ArXiv 2024
♻ ☆ DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation CVPR 2025
Dexterous manipulation with contact-rich interactions is crucial for advanced robotics. While recent diffusion-based planning approaches show promise for simple manipulation tasks, they often produce unrealistic ghost states (e.g., the object automatically moves without hand contact) or lack adaptability when handling complex sequential interactions. In this work, we introduce DexHandDiff, an interaction-aware diffusion planning framework for adaptive dexterous manipulation. DexHandDiff models joint state-action dynamics through a dual-phase diffusion process which consists of pre-interaction contact alignment and post-contact goal-directed control, enabling goal-adaptive generalizable dexterous manipulation. Additionally, we incorporate dynamics model-based dual guidance and leverage large language models for automated guidance function generation, enhancing generalizability for physical interactions and facilitating diverse goal adaptation through language cues. Experiments on physical interaction tasks such as door opening, pen and block re-orientation, object relocation, and hammer striking demonstrate DexHandDiff's effectiveness on goals outside training distributions, achieving over twice the average success rate (59.2% vs. 29.5%) compared to existing methods. Our framework achieves an average of 70.7% success rate on goal adaptive dexterous tasks, highlighting its robustness and flexibility in contact-rich manipulation.
comment: Accepted by CVPR 2025. Camera ready version. Previous DexDiffuser. Project page: https://dexdiffuser.github.io/
♻ ☆ Sketch-Plan-Generalize: Learning and Planning with Neuro-Symbolic Programmatic Representations for Inductive Spatial Concepts ICML 2025
Effective human-robot collaboration requires the ability to learn personalized concepts from a limited number of demonstrations, while exhibiting inductive generalization, hierarchical composition, and adaptability to novel constraints. Existing approaches that use code generation capabilities of pre-trained large (vision) language models as well as purely neural models show poor generalization to \emph{a-priori} unseen complex concepts. Neuro-symbolic methods (Grand et al., 2023) offer a promising alternative by searching in program space, but face challenges in large program spaces due to the inability to effectively guide the search using demonstrations. Our key insight is to factor inductive concept learning as: (i) {\it Sketch:} detecting and inferring a coarse signature of a new concept (ii) {\it Plan:} performing an MCTS search over grounded action sequences guided by human demonstrations (iii) {\it Generalize:} abstracting out grounded plans as inductive programs. Our pipeline facilitates generalization and modular re-use, enabling continual concept learning. Our approach combines the benefits of code generation ability of large language models (LLMs) along with grounded neural representations, resulting in neuro-symbolic programs that show stronger inductive generalization on the task of constructing complex structures vis-\'a-vis LLM-only and purely neural approaches. Further, we demonstrate reasoning and planning capabilities with learned concepts for embodied instruction following.
comment: Programmatic Representations for Agent Learning Worskop, ICML 2025
♻ ☆ SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning
Research on autonomous surgery has largely focused on simple task automation in controlled environments. However, real-world surgical applications demand dexterous manipulation over extended durations and robust generalization to the inherent variability of human tissue. These challenges remain difficult to address using existing logic-based or conventional end-to-end learning strategies. To address this gap, we propose a hierarchical framework for performing dexterous, long-horizon surgical steps. Our approach utilizes a high-level policy for task planning and a low-level policy for generating low-level trajectories. The high-level planner plans in language space, generating task or corrective instructions to guide the robot through the long-horizon steps and correct for the low-level policy's errors. We validate our framework through ex vivo experiments on cholecystectomy, a commonly-practiced minimally invasive procedure, and conduct ablation studies to evaluate key components of the system. Our method achieves a 100% success rate across n=8 different ex vivo gallbladders, operating fully autonomously without human intervention. The hierarchical approach improves the policy's ability to recover from suboptimal states that are inevitable in the highly dynamic environment of realistic surgical applications. This work demonstrates step-level autonomy in a surgical procedure, marking a milestone toward clinical deployment of autonomous surgical systems.
♻ ☆ Fast Contact Detection via Fusion of Joint and Inertial Sensors for Parallel Robots in Human-Robot Collaboration
Fast contact detection is crucial for safe human-robot collaboration. Observers based on proprioceptive information can be used for contact detection but have first-order error dynamics, which results in delays. Sensor fusion based on inertial measurement units (IMUs) consisting of accelerometers and gyroscopes is advantageous for reducing delays. The acceleration estimation enables the direct calculation of external forces. For serial robots, the installation of multiple accelerometers and gyroscopes is required for dynamics modeling since the joint coordinates are the minimal coordinates. Alternatively, parallel robots (PRs) offer the potential to use only one IMU on the end-effector platform, which already presents the minimal coordinates of the PR. This work introduces a sensor-fusion method for contact detection using encoders and only one low-cost, consumer-grade IMU for a PR. The end-effector accelerations are estimated by an extended Kalman filter and incorporated into the dynamics to calculate external forces. In real-world experiments with a planar PR, we demonstrate that this approach reduces the detection duration by up to 50% compared to a momentum observer and enables the collision and clamping detection within 3-39ms.
comment: Preprint of a publication accepted for IEEE Robotics and Automation Letters
♻ ☆ H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce $\textbf{Triply-Hierarchical Diffusion Policy}~(\textbf{H$^{\mathbf{3}}$DP})$, a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H$^{3}$DP contains $\mathbf{3}$ levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H$^{3}$DP yields a $\mathbf{+27.5\%}$ average relative improvement over baselines across $\mathbf{44}$ simulation tasks and achieves superior performance in $\mathbf{4}$ challenging bimanual real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.
♻ ☆ Hierarchical Intention Tracking with Switching Trees for Real-Time Adaptation to Dynamic Human Intentions during Collaboration
During collaborative tasks, human behavior is guided by multiple levels of intentions that evolve over time, such as task sequence preferences and interaction strategies. To adapt to these changing preferences and promptly correct any inaccurate estimations, collaborative robots must accurately track these dynamic human intentions in real time. We propose a Hierarchical Intention Tracking (HIT) algorithm for collaborative robots to track dynamic and hierarchical human intentions effectively in real time. HIT represents human intentions as intention trees with arbitrary depth, and probabilistically tracks human intentions by Bayesian filtering, upward measurement propagation, and downward posterior propagation across all levels. We develop a HIT-based robotic system that dynamically switches between Interaction-Task and Verification-Task trees for a collaborative assembly task, allowing the robot to effectively coordinate human intentions at three levels: task-level (subtask goal locations), interaction-level (mode of engagement with the robot), and verification-level (confirming or correcting intention recognition). Our user study shows that our HIT-based collaborative robot system surpasses existing collaborative robot solutions by achieving a balance between efficiency, physical workload, and user comfort while ensuring safety and task completion. Post-experiment surveys further reveal that the HIT-based system enhances the user trust and minimizes interruptions to user's task flow through its effective understanding of human intentions across multiple levels.
comment: 15 pages, 10 figures
♻ ☆ SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation IROS 2025
Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements. However, current waypoint predictors struggle with spatial awareness, while navigators lack historical reasoning and backtracking capabilities, limiting adaptability. We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder, masked cross-attention fusion, and an occupancy-aware loss for better waypoint quality. The navigator incorporates history-aware reasoning and adaptive path planning with backtracking, improving robustness. Experiments on R2R-CE and MP3D benchmarks show our method achieves state-of-the-art (SOTA) performance in zero-shot settings, demonstrating competitive results compared to fully supervised methods. Real-world validation on Turtlebot 4 further highlights its adaptability.
comment: Accepted by IROS 2025. Project website: https://sxyxs.github.io/smartway/
♻ ☆ DreamGen: Unlocking Generalization in Robot Learning through Video World Models
We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at https://github.com/NVIDIA/GR00T-Dreams.
comment: See website for videos: https://research.nvidia.com/labs/gear/dreamgen
♻ ☆ Mass-Adaptive Admittance Control for Robotic Manipulators
Handling objects with unknown or changing masses is a common challenge in robotics, often leading to errors or instability if the control system cannot adapt in real-time. In this paper, we present a novel approach that enables a six-degrees-of-freedom robotic manipulator to reliably follow waypoints while automatically estimating and compensating for unknown payload weight. Our method integrates an admittance control framework with a mass estimator, allowing the robot to dynamically update an excitation force to compensate for the payload mass. This strategy mitigates end-effector sagging and preserves stability when handling objects of unknown weights. We experimentally validated our approach in a challenging pick-and-place task on a shelf with a crossbar, improved accuracy in reaching waypoints and compliant motion compared to a baseline admittance-control scheme. By safely accommodating unknown payloads, our work enhances flexibility in robotic automation and represents a significant step forward in adaptive control for uncertain environments.
comment: 6 pages, 7 figures
♻ ☆ Semantic Mapping in Indoor Embodied AI -- A Survey on Advances, Challenges, and Future Directions
Intelligent embodied agents (e.g. robots) need to perform complex semantic tasks in unfamiliar environments. Among many skills that the agents need to possess, building and maintaining a semantic map of the environment is most crucial in long-horizon tasks. A semantic map captures information about the environment in a structured way, allowing the agent to reference it for advanced reasoning throughout the task. While existing surveys in embodied AI focus on general advancements or specific tasks like navigation and manipulation, this paper provides a comprehensive review of semantic map-building approaches in embodied AI, specifically for indoor navigation. We categorize these approaches based on their structural representation (spatial grids, topological graphs, dense point-clouds or hybrid maps) and the type of information they encode (implicit features or explicit environmental data). We also explore the strengths and limitations of the map building techniques, highlight current challenges, and propose future research directions. We identify that the field is moving towards developing open-vocabulary, queryable, task-agnostic map representations, while high memory demands and computational inefficiency still remaining to be open challenges. This survey aims to guide current and future researchers in advancing semantic mapping techniques for embodied AI systems.
Robotics 69
☆ Touch begins where vision ends: Generalizable policies for contact-rich manipulation
Data-driven approaches struggle with precise manipulation; imitation learning requires many hard-to-obtain demonstrations, while reinforcement learning yields brittle, non-generalizable policies. We introduce VisuoTactile Local (ViTaL) policy learning, a framework that solves fine-grained manipulation tasks by decomposing them into two phases: a reaching phase, where a vision-language model (VLM) enables scene-level reasoning to localize the object of interest, and a local interaction phase, where a reusable, scene-agnostic ViTaL policy performs contact-rich manipulation using egocentric vision and tactile sensing. This approach is motivated by the observation that while scene context varies, the low-level interaction remains consistent across task instances. By training local policies once in a canonical setting, they can generalize via a localize-then-execute strategy. ViTaL achieves around 90% success on contact-rich tasks in unseen environments and is robust to distractors. ViTaL's effectiveness stems from three key insights: (1) foundation models for segmentation enable training robust visual encoders via behavior cloning; (2) these encoders improve the generalizability of policies learned using residual RL; and (3) tactile sensing significantly boosts performance in contact-rich tasks. Ablation studies validate each of these insights, and we demonstrate that ViTaL integrates well with high-level VLMs, enabling robust, reusable low-level skills. Results and videos are available at https://vitalprecise.github.io.
Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins
Recent advancements in open-world robot manipulation have been largely driven by vision-language models (VLMs). While these models exhibit strong generalization ability in high-level planning, they struggle to predict low-level robot controls due to limited physical-world understanding. To address this issue, we propose a model predictive control framework for open-world manipulation that combines the semantic reasoning capabilities of VLMs with physically-grounded, interactive digital twins of the real-world environments. By constructing and simulating the digital twins, our approach generates feasible motion trajectories, simulates corresponding outcomes, and prompts the VLM with future observations to evaluate and select the most suitable outcome based on language instructions of the task. To further enhance the capability of pre-trained VLMs in understanding complex scenes for robotic control, we leverage the flexible rendering capabilities of the digital twin to synthesize the scene at various novel, unoccluded viewpoints. We validate our approach on a diverse set of complex manipulation tasks, demonstrating superior performance compared to baseline methods for language-conditioned robotic control using VLMs.
☆ Edge Nearest Neighbor in Sampling-Based Motion Planning
Neighborhood finders and nearest neighbor queries are fundamental parts of sampling based motion planning algorithms. Using different distance metrics or otherwise changing the definition of a neighborhood produces different algorithms with unique empiric and theoretical properties. In \cite{l-pa-06} LaValle suggests a neighborhood finder for the Rapidly-exploring Random Tree RRT algorithm \cite{l-rrtnt-98} which finds the nearest neighbor of the sampled point on the swath of the tree, that is on the set of all of the points on the tree edges, using a hierarchical data structure. In this paper we implement such a neighborhood finder and show, theoretically and experimentally, that this results in more efficient algorithms, and suggest a variant of the Rapidly-exploring Random Graph RRG algorithm \cite{f-isaom-10} that better exploits the exploration properties of the newly described subroutine for finding narrow passages.
☆ LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction
Vision-language-action (VLA) models have demonstrated strong semantic understanding and zero-shot generalization, yet most existing systems assume an accurate low-level controller with hand-crafted action "vocabulary" such as end-effector pose or root velocity. This assumption confines prior work to quasi-static tasks and precludes the agile, whole-body behaviors required by humanoid whole-body control (WBC) tasks. To capture this gap in the literature, we start by introducing the first sim-to-real-ready, vision-language, closed-loop benchmark for humanoid WBC, comprising over 150 tasks from 10 categories. We then propose LeVERB: Latent Vision-Language-Encoded Robot Behavior, a hierarchical latent instruction-following framework for humanoid vision-language WBC, the first of its kind. At the top level, a vision-language policy learns a latent action vocabulary from synthetically rendered kinematic demonstrations; at the low level, a reinforcement-learned WBC policy consumes these latent verbs to generate dynamics-level commands. In our benchmark, LeVERB can zero-shot attain a 80% success rate on simple visual navigation tasks, and 58.5% success rate overall, outperforming naive hierarchical whole-body VLA implementation by 7.8 times.
☆ Critical Insights about Robots for Mental Wellbeing
Social robots are increasingly being explored as tools to support emotional wellbeing, particularly in non-clinical settings. Drawing on a range of empirical studies and practical deployments, this paper outlines six key insights that highlight both the opportunities and challenges in using robots to promote mental wellbeing. These include (1) the lack of a single, objective measure of wellbeing, (2) the fact that robots don't need to act as companions to be effective, (3) the growing potential of virtual interactions, (4) the importance of involving clinicians in the design process, (5) the difference between one-off and long-term interactions, and (6) the idea that adaptation and personalization are not always necessary for positive outcomes. Rather than positioning robots as replacements for human therapists, we argue that they are best understood as supportive tools that must be designed with care, grounded in evidence, and shaped by ethical and psychological considerations. Our aim is to inform future research and guide responsible, effective use of robots in mental health and wellbeing contexts.
☆ CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding
In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.
comment: 16 pages
☆ HARMONI: Haptic-Guided Assistance for Unified Robotic Tele-Manipulation and Tele-Navigation
Shared control, which combines human expertise with autonomous assistance, is critical for effective teleoperation in complex environments. While recent advances in haptic-guided teleoperation have shown promise, they are often limited to simplified tasks involving 6- or 7-DoF manipulators and rely on separate control strategies for navigation and manipulation. This increases both cognitive load and operational overhead. In this paper, we present a unified tele-mobile manipulation framework that leverages haptic-guided shared control. The system integrates a 9-DoF follower mobile manipulator and a 7-DoF leader robotic arm, enabling seamless transitions between tele-navigation and tele-manipulation through real-time haptic feedback. A user study with 20 participants under real-world conditions demonstrates that our framework significantly improves task accuracy and efficiency without increasing cognitive load. These findings highlight the potential of haptic-guided shared control for enhancing operator performance in demanding teleoperation scenarios.
comment: To appear in IEEE CASE 2025
☆ ROSA: Harnessing Robot States for Vision-Language and Action Alignment
Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using expert demonstrations. However, this strategy suffers from a spatio-temporal gap, resulting in considerable data inefficiency and heavy reliance on human labor. Spatially, VLMs operate within a high-level semantic space, whereas robotic actions are grounded in low-level 3D physical space; temporally, VLMs primarily interpret the present, while VLA models anticipate future actions. To overcome these challenges, we propose a novel training paradigm, ROSA, which leverages robot state estimation to improve alignment between vision-language and action spaces. By integrating robot state estimation data obtained via an automated process, ROSA enables the VLA model to gain enhanced spatial understanding and self-awareness, thereby boosting performance and generalization. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of ROSA, particularly in low-data regimes.
☆ Towards Efficient Occupancy Mapping via Gaussian Process Latent Field Shaping
Occupancy mapping has been a key enabler of mobile robotics. Originally based on a discrete grid representation, occupancy mapping has evolved towards continuous representations that can predict the occupancy status at any location and account for occupancy correlations between neighbouring areas. Gaussian Process (GP) approaches treat this task as a binary classification problem using both observations of occupied and free space. Conceptually, a GP latent field is passed through a logistic function to obtain the output class without actually manipulating the GP latent field. In this work, we propose to act directly on the latent function to efficiently integrate free space information as a prior based on the shape of the sensor's field-of-view. A major difference with existing methods is the change in the classification problem, as we distinguish between free and unknown space. The `occupied' area is the infinitesimally thin location where the class transitions from free to unknown. We demonstrate in simulated environments that our approach is sound and leads to competitive reconstruction accuracy.
comment: Presented at RSS 2025 Workshop: Gaussian Representations for Robot Autonomy: Challenges and Opportunities
☆ Parallel Branch Model Predictive Control on GPUs
We present a parallel GPU-accelerated solver for branch Model Predictive Control problems. Based on iterative LQR methods, our solver exploits the tree-sparse structure and implements temporal parallelism using the parallel scan algorithm. Consequently, the proposed solver enables parallelism across both the prediction horizon and the scenarios. In addition, we utilize an augmented Lagrangian method to handle general inequality constraints. We compare our solver with state-of-the-art numerical solvers in two automated driving applications. The numerical results demonstrate that, compared to CPU-based solvers, our solver achieves competitive performance for problems with short horizons and small-scale trees, while outperforming other solvers on large-scale problems.
comment: 12 pages, 9 figures
☆ Disturbance-aware minimum-time planning strategies for motorsport vehicles with probabilistic safety certificates
This paper presents a disturbance-aware framework that embeds robustness into minimum-lap-time trajectory optimization for motorsport. Two formulations are introduced. (i) Open-loop, horizon-based covariance propagation uses worst-case uncertainty growth over a finite window to tighten tire-friction and track-limit constraints. (ii) Closed-loop, covariance-aware planning incorporates a time-varying LQR feedback law in the optimizer, providing a feedback-consistent estimate of disturbance attenuation and enabling sharper yet reliable constraint tightening. Both methods yield reference trajectories for human or artificial drivers: in autonomous applications the modelled controller can replicate the on-board implementation, while for human driving accuracy increases with the extent to which the driver can be approximated by the assumed time-varying LQR policy. Computational tests on a representative Barcelona-Catalunya sector show that both schemes meet the prescribed safety probability, yet the closed-loop variant incurs smaller lap-time penalties than the more conservative open-loop solution, while the nominal (non-robust) trajectory remains infeasible under the same uncertainties. By accounting for uncertainty growth and feedback action during planning, the proposed framework delivers trajectories that are both performance-optimal and probabilistically safe, advancing minimum-time optimization toward real-world deployment in high-performance motorsport and autonomous racing.
comment: 24 pages, 11 figures, paper under review
☆ Can you see how I learn? Human observers' inferences about Reinforcement Learning agents' learning processes
Reinforcement Learning (RL) agents often exhibit learning behaviors that are not intuitively interpretable by human observers, which can result in suboptimal feedback in collaborative teaching settings. Yet, how humans perceive and interpret RL agent's learning behavior is largely unknown. In a bottom-up approach with two experiments, this work provides a data-driven understanding of the factors of human observers' understanding of the agent's learning process. A novel, observation-based paradigm to directly assess human inferences about agent learning was developed. In an exploratory interview study (\textit{N}=9), we identify four core themes in human interpretations: Agent Goals, Knowledge, Decision Making, and Learning Mechanisms. A second confirmatory study (\textit{N}=34) applied an expanded version of the paradigm across two tasks (navigation/manipulation) and two RL algorithms (tabular/function approximation). Analyses of 816 responses confirmed the reliability of the paradigm and refined the thematic framework, revealing how these themes evolve over time and interrelate. Our findings provide a human-centered understanding of how people make sense of agent learning, offering actionable insights for designing interpretable RL systems and improving transparency in Human-Robot Interaction.
☆ What Matters in Learning from Large-Scale Datasets for Robot Manipulation
Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a systematic understanding of what data should be collected to improve the utility of a robotics dataset and facilitate downstream policy learning. In this work, we conduct a large-scale dataset composition study to answer this question. We develop a data generation framework to procedurally emulate common sources of diversity in existing datasets (such as sensor placements and object types and arrangements), and use it to generate large-scale robot datasets with controlled compositions, enabling a suite of dataset composition studies that would be prohibitively expensive in the real world. We focus on two practical settings: (1) what types of diversity should be emphasized when future researchers collect large-scale datasets for robotics, and (2) how should current practitioners retrieve relevant demonstrations from existing datasets to maximize downstream policy performance on tasks of interest. Our study yields several critical insights -- for example, we find that camera poses and spatial arrangements are crucial dimensions for both diversity in collection and alignment in retrieval. In real-world robot learning settings, we find that not only do our insights from simulation carry over, but our retrieval strategies on existing datasets such as DROID allow us to consistently outperform existing training strategies by up to 70%. More results at https://robo-mimiclabs.github.io/
☆ UAV Object Detection and Positioning in a Mining Industrial Metaverse with Custom Geo-Referenced Data
The mining sector increasingly adopts digital tools to improve operational efficiency, safety, and data-driven decision-making. One of the key challenges remains the reliable acquisition of high-resolution, geo-referenced spatial information to support core activities such as extraction planning and on-site monitoring. This work presents an integrated system architecture that combines UAV-based sensing, LiDAR terrain modeling, and deep learning-based object detection to generate spatially accurate information for open-pit mining environments. The proposed pipeline includes geo-referencing, 3D reconstruction, and object localization, enabling structured spatial outputs to be integrated into an industrial digital twin platform. Unlike traditional static surveying methods, the system offers higher coverage and automation potential, with modular components suitable for deployment in real-world industrial contexts. While the current implementation operates in post-flight batch mode, it lays the foundation for real-time extensions. The system contributes to the development of AI-enhanced remote sensing in mining by demonstrating a scalable and field-validated geospatial data workflow that supports situational awareness and infrastructure safety.
☆ A Survey on Imitation Learning for Contact-Rich Tasks in Robotics
This paper comprehensively surveys research trends in imitation learning for contact-rich robotic tasks. Contact-rich tasks, which require complex physical interactions with the environment, represent a central challenge in robotics due to their nonlinear dynamics and sensitivity to small positional deviations. The paper examines demonstration collection methodologies, including teaching methods and sensory modalities crucial for capturing subtle interaction dynamics. We then analyze imitation learning approaches, highlighting their applications to contact-rich manipulation. Recent advances in multimodal learning and foundation models have significantly enhanced performance in complex contact tasks across industrial, household, and healthcare domains. Through systematic organization of current research and identification of challenges, this survey provides a foundation for future advancements in contact-rich robotic manipulation.
comment: 47pages, 1 figures
☆ Learning Swing-up Maneuvers for a Suspended Aerial Manipulation Platform in a Hierarchical Control Framework
In this work, we present a novel approach to augment a model-based control method with a reinforcement learning (RL) agent and demonstrate a swing-up maneuver with a suspended aerial manipulation platform. These platforms are targeted towards a wide range of applications on construction sites involving cranes, with swing-up maneuvers allowing it to perch at a given location, inaccessible with purely the thrust force of the platform. Our proposed approach is based on a hierarchical control framework, which allows different tasks to be executed according to their assigned priorities. An RL agent is then subsequently utilized to adjust the reference set-point of the lower-priority tasks to perform the swing-up maneuver, which is confined in the nullspace of the higher-priority tasks, such as maintaining a specific orientation and position of the end-effector. Our approach is validated using extensive numerical simulation studies.
comment: 6 pages, 10 figures
☆ Block-wise Adaptive Caching for Accelerating Diffusion Policy
Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose Block-wise Adaptive Caching(BAC), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities vary non-uniformly across timesteps and locks. To operationalize this insight, we first propose the Adaptive Caching Scheduler, designed to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to signiffcant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with signiffcant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3x inference speedup for free.
☆ Towards a Formal Specification for Self-organized Shape Formation in Swarm Robotics
The self-organization of robots for the formation of structures and shapes is a stimulating application of the swarm robotic system. It involves a large number of autonomous robots of heterogeneous behavior, coordination among them, and their interaction with the dynamic environment. This process of complex structure formation is considered a complex system, which needs to be modeled by using any modeling approach. Although the formal specification approach along with other formal methods has been used to model the behavior of robots in a swarm. However, to the best of our knowledge, the formal specification approach has not been used to model the self-organization process in swarm robotic systems for shape formation. In this paper, we use a formal specification approach to model the shape formation task of swarm robots. We use Z (Zed) language of formal specification, which is a state-based language, to model the states of the entities of the systems. We demonstrate the effectiveness of Z for the self-organized shape formation. The presented formal specification model gives the outlines for designing and implementing the swarm robotic system for the formation of complex shapes and structures. It also provides the foundation for modeling the complex shape formation process for swarm robotics using a multi-agent system in a simulation-based environment. Keywords: Swarm robotics, Self-organization, Formal specification, Complex systems
☆ Adaptive Model-Base Control of Quadrupeds via Online System Identification using Kalman Filter IROS 2025
Many real-world applications require legged robots to be able to carry variable payloads. Model-based controllers such as model predictive control (MPC) have become the de facto standard in research for controlling these systems. However, most model-based control architectures use fixed plant models, which limits their applicability to different tasks. In this paper, we present a Kalman filter (KF) formulation for online identification of the mass and center of mass (COM) of a four-legged robot. We evaluate our method on a quadrupedal robot carrying various payloads and find that it is more robust to strong measurement noise than classical recursive least squares (RLS) methods. Moreover, it improves the tracking performance of the model-based controller with varying payloads when the model parameters are adjusted at runtime.
comment: 6 pages, 5 figures, 1 table, accepted for IEEE IROS 2025
☆ VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation
Dual-arm cooperative manipulation holds great promise for tackling complex real-world tasks that demand seamless coordination and adaptive dynamics. Despite substantial progress in learning-based motion planning, most approaches struggle to generalize across diverse manipulation tasks and adapt to dynamic, unstructured environments, particularly in scenarios involving interactions between two objects such as assembly, tool use, and bimanual grasping. To address these challenges, we introduce a novel VLM-Assisted Siamese Flow Diffusion (VLM-SFD) framework for efficient imitation learning in dual-arm cooperative manipulation. The proposed VLM-SFD framework exhibits outstanding adaptability, significantly enhancing the ability to rapidly adapt and generalize to diverse real-world tasks from only a minimal number of human demonstrations. Specifically, we propose a Siamese Flow Diffusion Network (SFDNet) employs a dual-encoder-decoder Siamese architecture to embed two target objects into a shared latent space, while a diffusion-based conditioning process-conditioned by task instructions-generates two-stream object-centric motion flows that guide dual-arm coordination. We further design a dynamic task assignment strategy that seamlessly maps the predicted 2D motion flows into 3D space and incorporates a pre-trained vision-language model (VLM) to adaptively assign the optimal motion to each robotic arm over time. Experiments validate the effectiveness of the proposed method, demonstrating its ability to generalize to diverse manipulation tasks while maintaining high efficiency and adaptability. The code and demo videos are publicly available on our project website https://sites.google.com/view/vlm-sfd/.
☆ JENGA: Object selection and pose estimation for robotic grasping from a stack
Vision-based robotic object grasping is typically investigated in the context of isolated objects or unstructured object sets in bin picking scenarios. However, there are several settings, such as construction or warehouse automation, where a robot needs to interact with a structured object formation such as a stack. In this context, we define the problem of selecting suitable objects for grasping along with estimating an accurate 6DoF pose of these objects. To address this problem, we propose a camera-IMU based approach that prioritizes unobstructed objects on the higher layers of stacks and introduce a dataset for benchmarking and evaluation, along with a suitable evaluation metric that combines object selection with pose accuracy. Experimental results show that although our method can perform quite well, this is a challenging problem if a completely error-free solution is needed. Finally, we show results from the deployment of our method for a brick-picking application in a construction scenario.
☆ Delayed Expansion AGT: Kinodynamic Planning with Application to Tractor-Trailer Parking
Kinodynamic planning of articulated vehicles in cluttered environments faces additional challenges arising from high-dimensional state space and complex system dynamics. Built upon [1],[2], this work proposes the DE-AGT algorithm that grows a tree using pre-computed motion primitives (MPs) and A* heuristics. The first feature of DE-AGT is a delayed expansion of MPs. In particular, the MPs are divided into different modes, which are ranked online. With the MP classification and prioritization, DE-AGT expands the most promising mode of MPs first, which eliminates unnecessary computation and finds solutions faster. To obtain the cost-to-go heuristic for nonholonomic articulated vehicles, we rely on supervised learning and train neural networks for fast and accurate cost-to-go prediction. The learned heuristic is used for online mode ranking and node selection. Another feature of DE-AGT is the improved goal-reaching. Exactly reaching a goal state usually requires a constant connection checking with the goal by solving steering problems -- non-trivial and time-consuming for articulated vehicles. The proposed termination scheme overcomes this challenge by tightly integrating a light-weight trajectory tracking controller with the search process. DE-AGT is implemented for autonomous parking of a general car-like tractor with 3-trailer. Simulation results show an average of 10x acceleration compared to a previous method.
☆ Observability-Aware Active Calibration of Multi-Sensor Extrinsics for Ground Robots via Online Trajectory Optimization
Accurate calibration of sensor extrinsic parameters for ground robotic systems (i.e., relative poses) is crucial for ensuring spatial alignment and achieving high-performance perception. However, existing calibration methods typically require complex and often human-operated processes to collect data. Moreover, most frameworks neglect acoustic sensors, thereby limiting the associated systems' auditory perception capabilities. To alleviate these issues, we propose an observability-aware active calibration method for ground robots with multimodal sensors, including a microphone array, a LiDAR (exteroceptive sensors), and wheel encoders (proprioceptive sensors). Unlike traditional approaches, our method enables active trajectory optimization for online data collection and calibration, contributing to the development of more intelligent robotic systems. Specifically, we leverage the Fisher information matrix (FIM) to quantify parameter observability and adopt its minimum eigenvalue as an optimization metric for trajectory generation via B-spline curves. Through planning and replanning of robot trajectory online, the method enhances the observability of multi-sensor extrinsic parameters. The effectiveness and advantages of our method have been demonstrated through numerical simulations and real-world experiments. For the benefit of the community, we have also open-sourced our code and data at https://github.com/AISLAB-sustech/Multisensor-Calibration.
comment: Accepted and to appear in the IEEE Sensors Journal
☆ Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation
Mobile robots exploring indoor environments increasingly rely on vision-language models to perceive high-level semantic cues in camera images, such as object categories. Such models offer the potential to substantially advance robot behaviour for tasks such as object-goal navigation (ObjectNav), where the robot must locate objects specified in natural language by exploring the environment. Current ObjectNav methods heavily depend on prompt engineering for perception and do not address the semantic uncertainty induced by variations in prompt phrasing. Ignoring semantic uncertainty can lead to suboptimal exploration, which in turn limits performance. Hence, we propose a semantic uncertainty-informed active perception pipeline for ObjectNav in indoor environments. We introduce a novel probabilistic sensor model for quantifying semantic uncertainty in vision-language models and incorporate it into a probabilistic geometric-semantic map to enhance spatial understanding. Based on this map, we develop a frontier exploration planner with an uncertainty-informed multi-armed bandit objective to guide efficient object search. Experimental results demonstrate that our method achieves ObjectNav success rates comparable to those of state-of-the-art approaches, without requiring extensive prompt engineering.
comment: 7 pages, 3 figures
☆ Open-Set LiDAR Panoptic Segmentation Guided by Uncertainty-Aware Learning
Autonomous vehicles that navigate in open-world environments may encounter previously unseen object classes. However, most existing LiDAR panoptic segmentation models rely on closed-set assumptions, failing to detect unknown object instances. In this work, we propose ULOPS, an uncertainty-guided open-set panoptic segmentation framework that leverages Dirichlet-based evidential learning to model predictive uncertainty. Our architecture incorporates separate decoders for semantic segmentation with uncertainty estimation, embedding with prototype association, and instance center prediction. During inference, we leverage uncertainty estimates to identify and segment unknown instances. To strengthen the model's ability to differentiate between known and unknown objects, we introduce three uncertainty-driven loss functions. Uniform Evidence Loss to encourage high uncertainty in unknown regions. Adaptive Uncertainty Separation Loss ensures a consistent difference in uncertainty estimates between known and unknown objects at a global scale. Contrastive Uncertainty Loss refines this separation at the fine-grained level. To evaluate open-set performance, we extend benchmark settings on KITTI-360 and introduce a new open-set evaluation for nuScenes. Extensive experiments demonstrate that ULOPS consistently outperforms existing open-set LiDAR panoptic segmentation methods.
☆ C2TE: Coordinated Constrained Task Execution Design for Ordering-Flexible Multi-Vehicle Platoon Merging
In this paper, we propose a distributed coordinated constrained task execution (C2TE) algorithm that enables a team of vehicles from different lanes to cooperatively merge into an {\it ordering-flexible platoon} maneuvering on the desired lane. Therein, the platoon is flexible in the sense that no specific spatial ordering sequences of vehicles are predetermined. To attain such a flexible platoon, we first separate the multi-vehicle platoon (MVP) merging mission into two stages, namely, pre-merging regulation and {\it ordering-flexible platoon} merging, and then formulate them into distributed constraint-based optimization problems. Particularly, by encoding longitudinal-distance regulation and same-lane collision avoidance subtasks into the corresponding control barrier function (CBF) constraints, the proposed algorithm in Stage 1 can safely enlarge sufficient longitudinal distances among adjacent vehicles. Then, by encoding lateral convergence, longitudinal-target attraction, and neighboring collision avoidance subtasks into CBF constraints, the proposed algorithm in Stage~2 can efficiently achieve the {\it ordering-flexible platoon}. Note that the {\it ordering-flexible platoon} is realized through the interaction of the longitudinal-target attraction and time-varying neighboring collision avoidance constraints simultaneously. Feasibility guarantee and rigorous convergence analysis are both provided under strong nonlinear couplings induced by flexible orderings. Finally, experiments using three autonomous mobile vehicles (AMVs) are conducted to verify the effectiveness and flexibility of the proposed algorithm, and extensive simulations are performed to demonstrate its robustness, adaptability, and scalability when tackling vehicles' sudden breakdown, new appearing, different number of lanes, mixed autonomy, and large-scale scenarios, respectively.
☆ Equilibrium-Driven Smooth Separation and Navigation of Marsupial Robotic Systems
In this paper, we propose an equilibrium-driven controller that enables a marsupial carrier-passenger robotic system to achieve smooth carrier-passenger separation and then to navigate the passenger robot toward a predetermined target point. Particularly, we design a potential gradient in the form of a cubic polynomial for the passenger's controller as a function of the carrier-passenger and carrier-target distances in the moving carrier's frame. This introduces multiple equilibrium points corresponding to the zero state of the error dynamic system during carrier-passenger separation. The change of equilibrium points is associated with the change in their attraction regions, enabling smooth carrier-passenger separation and afterwards seamless navigation toward the target. Finally, simulations demonstrate the effectiveness and adaptability of the proposed controller in environments containing obstacles.
☆ Multimodal "Puppeteer": An Exploration of Robot Teleoperation Via Virtual Counterpart with LLM-Driven Voice and Gesture Interaction in Augmented Reality
The integration of robotics and augmented reality (AR) holds transformative potential for advancing human-robot interaction (HRI), offering enhancements in usability, intuitiveness, accessibility, and collaborative task performance. This paper introduces and evaluates a novel multimodal AR-based robot puppeteer framework that enables intuitive teleoperation via virtual counterpart through large language model (LLM)-driven voice commands and hand gesture interactions. Utilizing the Meta Quest 3, users interact with a virtual counterpart robot in real-time, effectively "puppeteering" its physical counterpart within an AR environment. We conducted a within-subject user study with 42 participants performing robotic cube pick-and-place with pattern matching tasks under two conditions: gesture-only interaction and combined voice-and-gesture interaction. Both objective performance metrics and subjective user experience (UX) measures were assessed, including an extended comparative analysis between roboticists and non-roboticists. The results provide key insights into how multimodal input influences contextual task efficiency, usability, and user satisfaction in AR-based HRI. Our findings offer practical design implications for designing effective AR-enhanced HRI systems.
comment: This work has been submitted to the IEEE TVCG for possible publication
☆ Cognitive Synergy Architecture: SEGO for Human-Centric Collaborative Robots
This paper presents SEGO (Semantic Graph Ontology), a cognitive mapping architecture designed to integrate geometric perception, semantic reasoning, and explanation generation into a unified framework for human-centric collaborative robotics. SEGO constructs dynamic cognitive scene graphs that represent not only the spatial configuration of the environment but also the semantic relations and ontological consistency among detected objects. The architecture seamlessly combines SLAM-based localization, deep-learning-based object detection and tracking, and ontology-driven reasoning to enable real-time, semantically coherent mapping.
☆ Autonomous 3D Moving Target Encirclement and Interception with Range measurement IROS 2025
Commercial UAVs are an emerging security threat as they are capable of carrying hazardous payloads or disrupting air traffic. To counter UAVs, we introduce an autonomous 3D target encirclement and interception strategy. Unlike traditional ground-guided systems, this strategy employs autonomous drones to track and engage non-cooperative hostile UAVs, which is effective in non-line-of-sight conditions, GPS denial, and radar jamming, where conventional detection and neutralization from ground guidance fail. Using two noisy real-time distances measured by drones, guardian drones estimate the relative position from their own to the target using observation and velocity compensation methods, based on anti-synchronization (AS) and an X$-$Y circular motion combined with vertical jitter. An encirclement control mechanism is proposed to enable UAVs to adaptively transition from encircling and protecting a target to encircling and monitoring a hostile target. Upon breaching a warning threshold, the UAVs may even employ a suicide attack to neutralize the hostile target. We validate this strategy through real-world UAV experiments and simulated analysis in MATLAB, demonstrating its effectiveness in detecting, encircling, and intercepting hostile drones. More details: https://youtu.be/5eHW56lPVto.
comment: Paper has been accepted into IROS 2025
☆ Underwater target 6D State Estimation via UUV Attitude Enhance Observability IROS 2025
Accurate relative state observation of Unmanned Underwater Vehicles (UUVs) for tracking uncooperative targets remains a significant challenge due to the absence of GPS, complex underwater dynamics, and sensor limitations. Existing localization approaches rely on either global positioning infrastructure or multi-UUV collaboration, both of which are impractical for a single UUV operating in large or unknown environments. To address this, we propose a novel persistent relative 6D state estimation framework that enables a single UUV to estimate its relative motion to a non-cooperative target using only successive noisy range measurements from two monostatic sonar sensors. Our key contribution is an observability-enhanced attitude control strategy, which optimally adjusts the UUV's orientation to improve the observability of relative state estimation using a Kalman filter, effectively mitigating the impact of sensor noise and drift accumulation. Additionally, we introduce a rigorously proven Lyapunov-based tracking control strategy that guarantees long-term stability by ensuring that the UUV maintains an optimal measurement range, preventing localization errors from diverging over time. Through theoretical analysis and simulations, we demonstrate that our method significantly improves 6D relative state estimation accuracy and robustness compared to conventional approaches. This work provides a scalable, infrastructure-free solution for UUVs tracking uncooperative targets underwater.
comment: Paper has been accepted in IROS 2025
☆ A Novel ViDAR Device With Visual Inertial Encoder Odometry and Reinforcement Learning-Based Active SLAM Method
In the field of multi-sensor fusion for simultaneous localization and mapping (SLAM), monocular cameras and IMUs are widely used to build simple and effective visual-inertial systems. However, limited research has explored the integration of motor-encoder devices to enhance SLAM performance. By incorporating such devices, it is possible to significantly improve active capability and field of view (FOV) with minimal additional cost and structural complexity. This paper proposes a novel visual-inertial-encoder tightly coupled odometry (VIEO) based on a ViDAR (Video Detection and Ranging) device. A ViDAR calibration method is introduced to ensure accurate initialization for VIEO. In addition, a platform motion decoupled active SLAM method based on deep reinforcement learning (DRL) is proposed. Experimental data demonstrate that the proposed ViDAR and the VIEO algorithm significantly increase cross-frame co-visibility relationships compared to its corresponding visual-inertial odometry (VIO) algorithm, improving state estimation accuracy. Additionally, the DRL-based active SLAM algorithm, with the ability to decouple from platform motion, can increase the diversity weight of the feature points and further enhance the VIEO algorithm's performance. The proposed methodology sheds fresh insights into both the updated platform design and decoupled approach of active SLAM systems in complex environments.
comment: 12 pages, 13 figures
☆ SuperPoint-SLAM3: Augmenting ORB-SLAM3 with Deep Features, Adaptive NMS, and Learning-Based Loop Closure
Visual simultaneous localization and mapping (SLAM) must remain accurate under extreme viewpoint, scale and illumination variations. The widely adopted ORB-SLAM3 falters in these regimes because it relies on hand-crafted ORB keypoints. We introduce SuperPoint-SLAM3, a drop-in upgrade that (i) replaces ORB with the self-supervised SuperPoint detector--descriptor, (ii) enforces spatially uniform keypoints via adaptive non-maximal suppression (ANMS), and (iii) integrates a lightweight NetVLAD place-recognition head for learning-based loop closure. On the KITTI Odometry benchmark SuperPoint-SLAM3 reduces mean translational error from 4.15% to 0.34% and mean rotational error from 0.0027 deg/m to 0.0010 deg/m. On the EuRoC MAV dataset it roughly halves both errors across every sequence (e.g., V2\_03: 1.58% -> 0.79%). These gains confirm that fusing modern deep features with a learned loop-closure module markedly improves ORB-SLAM3 accuracy while preserving its real-time operation. Implementation, pretrained weights and reproducibility scripts are available at https://github.com/shahram95/SuperPointSLAM3.
comment: 10 pages, 6 figures, code at https://github.com/shahram95/SuperPointSLAM3
☆ CHARM: Considering Human Attributes for Reinforcement Modeling
Reinforcement Learning from Human Feedback has recently achieved significant success in various fields, and its performance is highly related to feedback quality. While much prior work acknowledged that human teachers' characteristics would affect human feedback patterns, there is little work that has closely investigated the actual effects. In this work, we designed an exploratory study investigating how human feedback patterns are associated with human characteristics. We conducted a public space study with two long horizon tasks and 46 participants. We found that feedback patterns are not only correlated with task statistics, such as rewards, but also correlated with participants' characteristics, especially robot experience and educational background. Additionally, we demonstrated that human feedback value can be more accurately predicted with human characteristics compared to only using task statistics. All human feedback and characteristics we collected, and codes for our data collection and predicting more accurate human feedback are available at https://github.com/AABL-Lab/CHARM
☆ Constrained Optimal Planning to Minimize Battery Degradation of Autonomous Mobile Robots
This paper proposes an optimization framework that addresses both cycling degradation and calendar aging of batteries for autonomous mobile robot (AMR) to minimize battery degradation while ensuring task completion. A rectangle method of piecewise linear approximation is employed to linearize the bilinear optimization problem. We conduct a case study to validate the efficiency of the proposed framework in achieving an optimal path planning for AMRs while reducing battery aging.
☆ A Point Cloud Completion Approach for the Grasping of Partially Occluded Objects and Its Applications in Robotic Strawberry Harvesting
In robotic fruit picking applications, managing object occlusion in unstructured settings poses a substantial challenge for designing grasping algorithms. Using strawberry harvesting as a case study, we present an end-to-end framework for effective object detection, segmentation, and grasp planning to tackle this issue caused by partially occluded objects. Our strategy begins with point cloud denoising and segmentation to accurately locate fruits. To compensate for incomplete scans due to occlusion, we apply a point cloud completion model to create a dense 3D reconstruction of the strawberries. The target selection focuses on ripe strawberries while categorizing others as obstacles, followed by converting the refined point cloud into an occupancy map for collision-aware motion planning. Our experimental results demonstrate high shape reconstruction accuracy, with the lowest Chamfer Distance compared to state-of-the-art methods with 1.10 mm, and significantly improved grasp success rates of 79.17%, yielding an overall success-to-attempt ratio of 89.58\% in real-world strawberry harvesting. Additionally, our method reduces the obstacle hit rate from 43.33% to 13.95%, highlighting its effectiveness in improving both grasp quality and safety compared to prior approaches. This pipeline substantially improves autonomous strawberry harvesting, advancing more efficient and reliable robotic fruit picking systems.
☆ Quadrotor Morpho-Transition: Learning vs Model-Based Control Strategies
Quadrotor Morpho-Transition, or the act of transitioning from air to ground through mid-air transformation, involves complex aerodynamic interactions and a need to operate near actuator saturation, complicating controller design. In recent work, morpho-transition has been studied from a model-based control perspective, but these approaches remain limited due to unmodeled dynamics and the requirement for planning through contacts. Here, we train an end-to-end Reinforcement Learning (RL) controller to learn a morpho-transition policy and demonstrate successful transfer to hardware. We find that the RL control policy achieves agile landing, but only transfers to hardware if motor dynamics and observation delays are taken into account. On the other hand, a baseline MPC controller transfers out-of-the-box without knowledge of the actuator dynamics and delays, at the cost of reduced recovery from disturbances in the event of unknown actuator failures. Our work opens the way for more robust control of agile in-flight quadrotor maneuvers that require mid-air transformation.
☆ GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics
Autonomous drones capable of interpreting and executing high-level language instructions in unstructured environments remain a long-standing goal. Yet existing approaches are constrained by their dependence on hand-crafted skills, extensive parameter tuning, or computationally intensive models unsuitable for onboard use. We introduce GRaD-Nav++, a lightweight Vision-Language-Action (VLA) framework that runs fully onboard and follows natural-language commands in real time. Our policy is trained in a photorealistic 3D Gaussian Splatting (3DGS) simulator via Differentiable Reinforcement Learning (DiffRL), enabling efficient learning of low-level control from visual and linguistic inputs. At its core is a Mixture-of-Experts (MoE) action head, which adaptively routes computation to improve generalization while mitigating forgetting. In multi-task generalization experiments, GRaD-Nav++ achieves a success rate of 83% on trained tasks and 75% on unseen tasks in simulation. When deployed on real hardware, it attains 67% success on trained tasks and 50% on unseen ones. In multi-environment adaptation experiments, GRaD-Nav++ achieves an average success rate of 81% across diverse simulated environments and 67% across varied real-world settings. These results establish a new benchmark for fully onboard Vision-Language-Action (VLA) flight and demonstrate that compact, efficient models can enable reliable, language-guided navigation without relying on external infrastructure.
☆ Diffusion-based Inverse Observation Model for Artificial Skin
Contact-based estimation of object pose is challenging due to discontinuities and ambiguous observations that can correspond to multiple possible system states. This multimodality makes it difficult to efficiently sample valid hypotheses while respecting contact constraints. Diffusion models can learn to generate samples from such multimodal probability distributions through denoising algorithms. We leverage these probabilistic modeling capabilities to learn an inverse observation model conditioned on tactile measurements acquired from a distributed artificial skin. We present simulated experiments demonstrating efficient sampling of contact hypotheses for object pose estimation through touch.
comment: Accepted to RSS 2025 workshop on Navigating Contact Dynamics in Robotics
☆ A Cooperative Contactless Object Transport with Acoustic Robots IROS 2025
Cooperative transport, the simultaneous movement of an object by multiple agents, has been widely observed in biological systems such as ant colonies, which improve efficiency and adaptability in dynamic environments. Inspired by these natural phenomena, we present a novel acoustic robotic system for the transport of contactless objects in mid-air. Our system leverages phased ultrasonic transducers and a robotic control system onboard to generate localized acoustic pressure fields, enabling precise manipulation of airborne particles and robots. We categorize contactless object-transport strategies into independent transport (uncoordinated) and forward-facing cooperative transport (coordinated), drawing parallels with biological systems to optimize efficiency and robustness. The proposed system is experimentally validated by evaluating levitation stability using a microphone in the measurement lab, transport efficiency through a phase-space motion capture system, and clock synchronization accuracy via an oscilloscope. The results demonstrate the feasibility of both independent and cooperative airborne object transport. This research contributes to the field of acoustophoretic robotics, with potential applications in contactless material handling, micro-assembly, and biomedical applications.
comment: This paper has been accepted for publication in the Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025) as oral presentation, 8 pages with 8 figures
☆ ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection
When designing robots to assist in everyday human activities, it is crucial to enhance user requests with visual cues from their surroundings for improved intent understanding. This process is defined as a multimodal classification task. However, gathering a large-scale dataset encompassing both visual and linguistic elements for model training is challenging and time-consuming. To address this issue, our paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios, encompassing both dialogues and related environmental imagery. This approach involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts, followed by the use of a stable diffusion model to create images depicting these environments. The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions in response to user interactions with the limited target data. Our experimental results, based on a dataset collected from real-world scenarios, demonstrate that our methodology significantly enhances the robot's action selection capabilities, achieving the state-of-the-art performance.
comment: IWSDS 2024 Best Paper Award
☆ Socially-aware Object Transportation by a Mobile Manipulator in Static Planar Environments with Obstacles
Socially-aware robotic navigation is essential in environments where humans and robots coexist, ensuring both safety and comfort. However, most existing approaches have been primarily developed for mobile robots, leaving a significant gap in research that addresses the unique challenges posed by mobile manipulators. In this paper, we tackle the challenge of navigating a robotic mobile manipulator, carrying a non-negligible load, within a static human-populated environment while adhering to social norms. Our goal is to develop a method that enables the robot to simultaneously manipulate an object and navigate between locations in a socially-aware manner. We propose an approach based on the Risk-RRT* framework that enables the coordinated actuation of both the mobile base and manipulator. This approach ensures collision-free navigation while adhering to human social preferences. We compared our approach in a simulated environment to socially-aware mobile-only methods applied to a mobile manipulator. The results highlight the necessity for mobile manipulator-specific techniques, with our method outperforming mobile-only approaches. Our method enabled the robot to navigate, transport an object, avoid collisions, and minimize social discomfort effectively.
comment: Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (ROMAN)
☆ Beyond the Plane: A 3D Representation of Human Personal Space for Socially-Aware Robotics
The increasing presence of robots in human environments requires them to exhibit socially appropriate behavior, adhering to social norms. A critical aspect in this context is the concept of personal space, a psychological boundary around an individual that influences their comfort based on proximity. This concept extends to human-robot interaction, where robots must respect personal space to avoid causing discomfort. While much research has focused on modeling personal space in two dimensions, almost none have considered the vertical dimension. In this work, we propose a novel three-dimensional personal space model that integrates both height (introducing a discomfort function along the Z-axis) and horizontal proximity (via a classic XY-plane formulation) to quantify discomfort. To the best of our knowledge, this is the first work to compute discomfort in 3D space at any robot component's position, considering the person's configuration and height.
comment: Accepted by the 2025 34th IEEE International Conference on Robot and Human Interactive Communication (ROMAN)
☆ TUM Teleoperation: Open Source Software for Remote Driving and Assistance of Automated Vehicles
Teleoperation is a key enabler for future mobility, supporting Automated Vehicles in rare and complex scenarios beyond the capabilities of their automation. Despite ongoing research, no open source software currently combines Remote Driving, e.g., via steering wheel and pedals, Remote Assistance through high-level interaction with automated driving software modules, and integration with a real-world vehicle for practical testing. To address this gap, we present a modular, open source teleoperation software stack that can interact with an automated driving software, e.g., Autoware, enabling Remote Assistance and Remote Driving. The software featuresstandardized interfaces for seamless integration with various real-world and simulation platforms, while allowing for flexible design of the human-machine interface. The system is designed for modularity and ease of extension, serving as a foundation for collaborative development on individual software components as well as realistic testing and user studies. To demonstrate the applicability of our software, we evaluated the latency and performance of different vehicle platforms in simulation and real-world. The source code is available on GitHub
☆ DynaGuide: Steering Diffusion Polices with Active Dynamic Guidance
Deploying large, complex policies in the real world requires the ability to steer them to fit the needs of a situation. Most common steering approaches, like goal-conditioning, require training the robot policy with a distribution of test-time objectives in mind. To overcome this limitation, we present DynaGuide, a steering method for diffusion policies using guidance from an external dynamics model during the diffusion denoising process. DynaGuide separates the dynamics model from the base policy, which gives it multiple advantages, including the ability to steer towards multiple objectives, enhance underrepresented base policy behaviors, and maintain robustness on low-quality objectives. The separate guidance signal also allows DynaGuide to work with off-the-shelf pretrained diffusion policies. We demonstrate the performance and features of DynaGuide against other steering approaches in a series of simulated and real experiments, showing an average steering success of 70% on a set of articulated CALVIN tasks and outperforming goal-conditioning by 5.4x when steered with low-quality objectives. We also successfully steer an off-the-shelf real robot policy to express preference for particular objects and even create novel behavior. Videos and more can be found on the project website: https://dynaguide.github.io
comment: 9 pages main, 21 pages with appendix and citations. 9 figures. Submitted to Neurips 2025
☆ Sequence Modeling for Time-Optimal Quadrotor Trajectory Optimization with Sampling-based Robustness Analysis
Time-optimal trajectories drive quadrotors to their dynamic limits, but computing such trajectories involves solving non-convex problems via iterative nonlinear optimization, making them prohibitively costly for real-time applications. In this work, we investigate learning-based models that imitate a model-based time-optimal trajectory planner to accelerate trajectory generation. Given a dataset of collision-free geometric paths, we show that modeling architectures can effectively learn the patterns underlying time-optimal trajectories. We introduce a quantitative framework to analyze local analytic properties of the learned models, and link them to the Backward Reachable Tube of the geometric tracking controller. To enhance robustness, we propose a data augmentation scheme that applies random perturbations to the input paths. Compared to classical planners, our method achieves substantial speedups, and we validate its real-time feasibility on a hardware quadrotor platform. Experiments demonstrate that the learned models generalize to previously unseen path lengths. The code for our approach can be found here: https://github.com/maokat12/lbTOPPQuad
☆ Scaling Algorithm Distillation for Continuous Control with Mamba
Algorithm Distillation (AD) was recently proposed as a new approach to perform In-Context Reinforcement Learning (ICRL) by modeling across-episodic training histories autoregressively with a causal transformer model. However, due to practical limitations induced by the attention mechanism, experiments were bottlenecked by the transformer's quadratic complexity and limited to simple discrete environments with short time horizons. In this work, we propose leveraging the recently proposed Selective Structured State Space Sequence (S6) models, which achieved state-of-the-art (SOTA) performance on long-range sequence modeling while scaling linearly in sequence length. Through four complex and continuous Meta Reinforcement Learning environments, we demonstrate the overall superiority of Mamba, a model built with S6 layers, over a transformer model for AD. Additionally, we show that scaling AD to very long contexts can improve ICRL performance and make it competitive even with a SOTA online meta RL baseline.
☆ ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning
Visuomotor policies often suffer from perceptual challenges, where visual differences between training and evaluation environments degrade policy performance. Policies relying on state estimations, like 6D pose, require task-specific tracking and are difficult to scale, while raw sensor-based policies may lack robustness to small visual disturbances.In this work, we leverage 2D keypoints - spatially consistent features in the image frame - as a flexible state representation for robust policy learning and apply it to both sim-to-real transfer and real-world imitation learning. However, the choice of which keypoints to use can vary across objects and tasks. We propose a novel method, ATK, to automatically select keypoints in a task-driven manner so that the chosen keypoints are predictive of optimal behavior for the given task. Our proposal optimizes for a minimal set of keypoints that focus on task-relevant parts while preserving policy performance and robustness. We distill expert data (either from an expert policy in simulation or a human expert) into a policy that operates on RGB images while tracking the selected keypoints. By leveraging pre-trained visual modules, our system effectively encodes states and transfers policies to the real-world evaluation scenario despite wide scene variations and perceptual challenges such as transparent objects, fine-grained tasks, and deformable objects manipulation. We validate ATK on various robotic tasks, demonstrating that these minimal keypoint representations significantly improve robustness to visual disturbances and environmental variations. See all experiments and more details on our website.
☆ A Survey on World Models Grounded in Acoustic Physical Information
This survey provides a comprehensive overview of the emerging field of world models grounded in the foundation of acoustic physical information. It examines the theoretical underpinnings, essential methodological frameworks, and recent technological advancements in leveraging acoustic signals for high-fidelity environmental perception, causal physical reasoning, and predictive simulation of dynamic events. The survey explains how acoustic signals, as direct carriers of mechanical wave energy from physical events, encode rich, latent information about material properties, internal geometric structures, and complex interaction dynamics. Specifically, this survey establishes the theoretical foundation by explaining how fundamental physical laws govern the encoding of physical information within acoustic signals. It then reviews the core methodological pillars, including Physics-Informed Neural Networks (PINNs), generative models, and self-supervised multimodal learning frameworks. Furthermore, the survey details the significant applications of acoustic world models in robotics, autonomous driving, healthcare, and finance. Finally, it systematically outlines the important technical and ethical challenges while proposing a concrete roadmap for future research directions toward robust, causal, uncertainty-aware, and responsible acoustic intelligence. These elements collectively point to a research pathway towards embodied active acoustic intelligence, empowering AI systems to construct an internal "intuitive physics" engine through sound.
comment: 28 pages,11 equations
♻ ☆ EmbodiedGen: Towards a Generative 3D World Engine for Embodied Intelligence
Constructing a physically realistic and accurately scaled simulated 3D world is crucial for the training and evaluation of embodied intelligence tasks. The diversity, realism, low cost accessibility and affordability of 3D data assets are critical for achieving generalization and scalability in embodied AI. However, most current embodied intelligence tasks still rely heavily on traditional 3D computer graphics assets manually created and annotated, which suffer from high production costs and limited realism. These limitations significantly hinder the scalability of data driven approaches. We present EmbodiedGen, a foundational platform for interactive 3D world generation. It enables the scalable generation of high-quality, controllable and photorealistic 3D assets with accurate physical properties and real-world scale in the Unified Robotics Description Format (URDF) at low cost. These assets can be directly imported into various physics simulation engines for fine-grained physical control, supporting downstream tasks in training and evaluation. EmbodiedGen is an easy-to-use, full-featured toolkit composed of six key modules: Image-to-3D, Text-to-3D, Texture Generation, Articulated Object Generation, Scene Generation and Layout Generation. EmbodiedGen generates diverse and interactive 3D worlds composed of generative 3D assets, leveraging generative AI to address the challenges of generalization and evaluation to the needs of embodied intelligence related research. Code is available at https://horizonrobotics.github.io/robot_lab/embodied_gen/index.html.
♻ ☆ Pursuit-Evasion for Car-like Robots with Sensor Constraints IROS 2025
We study a pursuit-evasion game between two players with car-like dynamics and sensing limitations by formalizing it as a partially observable stochastic zero-sum game. The partial observability caused by the sensing constraints is particularly challenging. As an example, in a situation where the agents have no visibility of each other, they would need to extract information from their sensor coverage history to reason about potential locations of their opponents. However, keeping historical information greatly increases the size of the state space. To mitigate the challenges encountered with such partially observable problems, we develop a new learning-based method that encodes historical information to a belief state and uses it to generate agent actions. Through experiments we show that the learned strategies improve over existing multi-agent RL baselines by up to 16 % in terms of capture rate for the pursuer. Additionally, we present experimental results showing that learned belief states are strong state estimators for extending existing game theory solvers and demonstrate our method's competitiveness for problems where existing fully observable game theory solvers are computationally feasible. Finally, we deploy the learned policies on physical robots for a game between the F1TENTH and JetRacer platforms moving as fast as $\textbf{2 m/s}$ in indoor environments, showing that they can be executed on real-robots.
comment: Accepted for publication in the Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025) as oral presentation
♻ ☆ JAEGER: Dual-Level Humanoid Whole-Body Controller
This paper presents JAEGER, a dual-level whole-body controller for humanoid robots that addresses the challenges of training a more robust and versatile policy. Unlike traditional single-controller approaches, JAEGER separates the control of the upper and lower bodies into two independent controllers, so that they can better focus on their distinct tasks. This separation alleviates the dimensionality curse and improves fault tolerance. JAEGER supports both root velocity tracking (coarse-grained control) and local joint angle tracking (fine-grained control), enabling versatile and stable movements. To train the controller, we utilize a human motion dataset (AMASS), retargeting human poses to humanoid poses through an efficient retargeting network, and employ a curriculum learning approach. This method performs supervised learning for initialization, followed by reinforcement learning for further exploration. We conduct our experiments on two humanoid platforms and demonstrate the superiority of our approach against state-of-the-art methods in both simulation and real environments.
comment: 15 pages, 2 figures
♻ ☆ AirIO: Learning Inertial Odometry with Enhanced IMU Feature Observability
Inertial odometry (IO) using only Inertial Measurement Units (IMUs) offers a lightweight and cost-effective solution for Unmanned Aerial Vehicle (UAV) applications, yet existing learning-based IO models often fail to generalize to UAVs due to the highly dynamic and non-linear-flight patterns that differ from pedestrian motion. In this work, we identify that the conventional practice of transforming raw IMU data to global coordinates undermines the observability of critical kinematic information in UAVs. By preserving the body-frame representation, our method achieves substantial performance improvements, with a 66.7% average increase in accuracy across three datasets. Furthermore, explicitly encoding attitude information into the motion network results in an additional 23.8% improvement over prior results. Combined with a data-driven IMU correction model (AirIMU) and an uncertainty-aware Extended Kalman Filter (EKF), our approach ensures robust state estimation under aggressive UAV maneuvers without relying on external sensors or control inputs. Notably, our method also demonstrates strong generalizability to unseen data not included in the training set, underscoring its potential for real-world UAV applications.
♻ ☆ Structureless VIO
Visual odometry (VO) is typically considered as a chicken-and-egg problem, as the localization and mapping modules are tightly-coupled. The estimation of a visual map relies on accurate localization information. Meanwhile, localization requires precise map points to provide motion constraints. This classical design principle is naturally inherited by visual-inertial odometry (VIO). Efficient localization solutions that do not require a map have not been fully investigated. To this end, we propose a novel structureless VIO, where the visual map is removed from the odometry framework. Experimental results demonstrated that, compared to the structure-based VIO baseline, our structureless VIO not only substantially improves computational efficiency but also has advantages in accuracy.
comment: Accepted by the SLAM Workshop at RSS 2025
♻ ☆ General agents need world models ICML 2025
Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent's policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.
comment: Accepted ICML 2025
♻ ☆ Zero-Shot Temporal Interaction Localization for Egocentric Videos
Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.
♻ ☆ Real-time Seafloor Segmentation and Mapping
Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts
♻ ☆ Active Perception for Tactile Sensing: A Task-Agnostic Attention-Based Approach
Humans make extensive use of haptic exploration to map and identify the properties of the objects that we touch. In robotics, active tactile perception has emerged as an important research domain that complements vision for tasks such as object classification, shape reconstruction, and manipulation. This work introduces TAP (Task-agnostic Active Perception) -- a novel framework that leverages reinforcement learning (RL) and transformer-based architectures to address the challenges posed by partially observable environments. TAP integrates Soft Actor-Critic (SAC) and CrossQ algorithms within a unified optimization objective, jointly training a perception module and decision-making policy. By design, TAP is completely task-agnostic and can, in principle, generalize to any active perception problem. We evaluate TAP across diverse tasks, including toy examples and realistic applications involving haptic exploration of 3D models from the Tactile MNIST benchmark. Experiments demonstrate the efficacy of TAP, achieving high accuracies on the Tactile MNIST haptic digit recognition task and a tactile pose estimation task. These findings underscore the potential of TAP as a versatile and generalizable framework for advancing active tactile perception in robotics.
comment: 16 pages; 13 figures Under Review
♻ ☆ Cybersecurity and Embodiment Integrity for Modern Robots: A Conceptual Framework
Thanks to new technologies and communication paradigms, such as the Internet of Things (IoT) and the Robotic Operating System (ROS), modern robots can be built by combining heterogeneous standard devices in a single embodiment. Although this approach brings high degrees of modularity, it also yields uncertainty, with regard to providing cybersecurity assurances and guarantees on the integrity of the embodiment. In this paper, first we illustrate how cyberattacks on different devices can have radically different consequences on the robot's ability to complete its tasks and preserve its embodiment. We also claim that modern robots should have self-awareness for what concerns such aspects, and formulate in two propositions the different characteristics that robots should integrate for doing so. Then, we show how these propositions relate to two established cybersecurity frameworks, the NIST Cybersecurity Framework and the MITRE ATT&CK, and we argue that achieving these propositions requires that robots possess at least three properties for mapping devices and tasks. Last, we reflect on how these three properties could be achieved in a larger conceptual framework.
comment: 18 pages, 2 figures, 4 tables
♻ ☆ BiFold: Bimanual Cloth Folding with Language Guidance ICRA 2025
Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves state-of-the-art performance on an existing language-conditioned folding benchmark. To address the lack of annotated bimanual folding data, we introduce a novel dataset with automatically parsed actions and language-aligned instructions, enabling better learning of text-conditioned manipulation. BiFold attains the best performance on our dataset and demonstrates strong generalization to new instructions, garments, and environments.
comment: Accepted at ICRA 2025. Project page at https://barbany.github.io/bifold/
♻ ☆ Canonical Representation and Force-Based Pretraining of 3D Tactile for Dexterous Visuo-Tactile Policy Learning ICRA 2025
Tactile sensing plays a vital role in enabling robots to perform fine-grained, contact-rich tasks. However, the high dimensionality of tactile data, due to the large coverage on dexterous hands, poses significant challenges for effective tactile feature learning, especially for 3D tactile data, as there are no large standardized datasets and no strong pretrained backbones. To address these challenges, we propose a novel canonical representation that reduces the difficulty of 3D tactile feature learning and further introduces a force-based self-supervised pretraining task to capture both local and net force features, which are crucial for dexterous manipulation. Our method achieves an average success rate of 78% across four fine-grained, contact-rich dexterous manipulation tasks in real-world experiments, demonstrating effectiveness and robustness compared to other methods. Further analysis shows that our method fully utilizes both spatial and force information from 3D tactile data to accomplish the tasks. The codes and videos can be viewed at https://3dtacdex.github.io.
comment: Accepted to ICRA 2025
♻ ☆ ActiveSplat: High-Fidelity Scene Reconstruction through Active Gaussian Splatting
We propose ActiveSplat, an autonomous high-fidelity reconstruction system leveraging Gaussian splatting. Taking advantage of efficient and realistic rendering, the system establishes a unified framework for online mapping, viewpoint selection, and path planning. The key to ActiveSplat is a hybrid map representation that integrates both dense information about the environment and a sparse abstraction of the workspace. Therefore, the system leverages sparse topology for efficient viewpoint sampling and path planning, while exploiting view-dependent dense prediction for viewpoint selection, facilitating efficient decision-making with promising accuracy and completeness. A hierarchical planning strategy based on the topological map is adopted to mitigate repetitive trajectories and improve local granularity given limited time budgets, ensuring high-fidelity reconstruction with photorealistic view synthesis. Extensive experiments and ablation studies validate the efficacy of the proposed method in terms of reconstruction accuracy, data coverage, and exploration efficiency. The released code will be available on our project page: https://li-yuetao.github.io/ActiveSplat/.
comment: Accepted to IEEE RA-L. Code: https://github.com/Li-Yuetao/ActiveSplat, Project: https://li-yuetao.github.io/ActiveSplat/
♻ ☆ Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System
Heterogeneous multi-robot systems show great potential in complex tasks requiring hybrid cooperation. However, traditional approaches relying on static models often struggle with task diversity and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical framework integrating a prompted Large Language Model (LLM) and a GridMask-enhanced fine-tuned Vision Language Model (VLM). The LLM decomposes tasks and constructs a global semantic map, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning. Within this framework, the aerial robot follows an optimized global semantic path and continuously provides bird-view images, guiding the ground robot's local semantic navigation and manipulation, including target-absent scenarios where implicit alignment is maintained. Experiments on real-world cube or object arrangement tasks demonstrate the framework's adaptability and robustness in dynamic environments. To the best of our knowledge, this is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.
♻ ☆ Efficient Estimation of Relaxed Model Parameters for Robust UAV Trajectory Optimization
Online trajectory optimization and optimal control methods are crucial for enabling sustainable unmanned aerial vehicle (UAV) services, such as agriculture, environmental monitoring, and transportation, where available actuation and energy are limited. However, optimal controllers are highly sensitive to model mismatch, which can occur due to loaded equipment, packages to be delivered, or pre-existing variability in fundamental structural and thrust-related parameters. To circumvent this problem, optimal controllers can be paired with parameter estimators to improve their trajectory planning performance and perform adaptive control. However, UAV platforms are limited in terms of onboard processing power, oftentimes making nonlinear parameter estimation too computationally expensive to consider. To address these issues, we propose a relaxed, affine-in-parameters multirotor model along with an efficient optimal parameter estimator. We convexify the nominal Moving Horizon Parameter Estimation (MHPE) problem into a linear-quadratic form (LQ-MHPE) via an affine-in-parameter relaxation on the nonlinear dynamics, resulting in fast quadratic programs (QPs) that facilitate adaptive Model Predictve Control (MPC) in real time. We compare this approach to the equivalent nonlinear estimator in Monte Carlo simulations, demonstrating a decrease in average solve time and trajectory optimality cost by 98.2% and 23.9-56.2%, respectively.
comment: 8 pages, 5 figures. Published in IEEE Sustech 2025, see https://ieeexplore.ieee.org/document/11025659
♻ ☆ Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey
Learning-based 3D reconstruction has emerged as a transformative technique in autonomous driving, enabling precise modeling of both dynamic and static environments through advanced neural representations. Despite data augmentation, 3D reconstruction inspires pioneering solution for vital tasks in the field of autonomous driving, such as scene understanding and closed-loop simulation. We investigates the details of 3D reconstruction and conducts a multi-perspective, in-depth analysis of recent advancements. Specifically, we first provide a systematic introduction of preliminaries, including data modalities, benchmarks and technical preliminaries of learning-based 3D reconstruction, facilitating instant identification of suitable methods according to sensor suites. Then, we systematically review learning-based 3D reconstruction methods in autonomous driving, categorizing approaches by subtasks and conducting multi-dimensional analysis and summary to establish a comprehensive technical reference. The development trends and existing challenges are summarized in the context of learning-based 3D reconstruction in autonomous driving. We hope that our review will inspire future researches.
♻ ☆ Towards Infant Sleep-Optimized Driving: Synergizing Wearable and Vehicle Sensing in Intelligent Cruise Control
Automated driving (AD) has substantially improved vehicle safety and driving comfort, but their impact on passenger well-being, particularly infant sleep, is not sufficiently studied. Sudden acceleration, abrupt braking, and sharp maneuvers can disrupt infant sleep, compromising both passenger comfort and parental convenience. To solve this problem, this paper explores the integration of reinforcement learning (RL) within AD to personalize driving behavior and optimally balance occupant comfort and travel efficiency. In particular, we propose an intelligent cruise control framework that adapts to varying driving conditions to enhance infant sleep quality by effectively synergizing wearable sensing and vehicle data. Long short-term memory (LSTM) and transformer-based neural networks are integrated with RL to model the relationship between driving behavior and infant sleep quality under diverse traffic and road conditions. Based on the sleep quality indicators from the wearable sensors, driving action data from vehicle controllers, and map data from map applications, the model dynamically computes the optimal driving aggressiveness level, which is subsequently translated into specific AD control strategies, e.g., the magnitude and frequency of acceleration, lane change, and overtaking. Simulation results demonstrate that the proposed solution significantly improves infant sleep quality compared to baseline methods, while preserving desirable travel efficiency.
♻ ☆ NGD-SLAM: Towards Real-Time Dynamic SLAM without GPU
Many existing visual SLAM methods can achieve high localization accuracy in dynamic environments by leveraging deep learning to mask moving objects. However, these methods incur significant computational overhead as the camera tracking needs to wait for the deep neural network to generate mask at each frame, and they typically require GPUs for real-time operation, which restricts their practicality in real-world robotic applications. Therefore, this paper proposes a real-time dynamic SLAM system that runs exclusively on a CPU. Our approach incorporates a mask propagation mechanism that decouples camera tracking and deep learning-based masking for each frame. We also introduce a hybrid tracking strategy that integrates ORB features with optical flow methods, enhancing both robustness and efficiency by selectively allocating computational resources to input frames. Compared to previous methods, our system maintains high localization accuracy in dynamic environments while achieving a tracking frame rate of 60 FPS on a laptop CPU. These results demonstrate the feasibility of utilizing deep learning for dynamic SLAM without GPU support. Since most existing dynamic SLAM systems are not open-source, we make our code publicly available at: https://github.com/yuhaozhang7/NGD-SLAM
comment: 7 pages, 6 figures
♻ ☆ Semantic Enhancement for Object SLAM with Heterogeneous Multimodal Large Language Model Agents
Object Simultaneous Localization and Mapping (SLAM) systems struggle to correctly associate semantically similar objects in close proximity, especially in cluttered indoor environments and when scenes change. We present Semantic Enhancement for Object SLAM (SEO-SLAM), a novel framework that enhances semantic mapping by integrating heterogeneous multimodal large language model (MLLM) agents. Our method enables scene adaptation while maintaining a semantically rich map. To improve computational efficiency, we propose an asynchronous processing scheme that significantly reduces the agents' inference time without compromising semantic accuracy or SLAM performance. Additionally, we introduce a multi-data association strategy using a cost matrix that combines semantic and Mahalanobis distances, formulating the problem as a Linear Assignment Problem (LAP) to alleviate perceptual aliasing. Experimental results demonstrate that SEO-SLAM consistently achieves higher semantic accuracy and reduces false positives compared to baselines, while our asynchronous MLLM agents significantly improve processing efficiency over synchronous setups. We also demonstrate that SEO-SLAM has the potential to improve downstream tasks such as robotic assistance. Our dataset is publicly available at: jungseokhong.com/SEO-SLAM.
♻ ☆ PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation
Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. Our training set consists of over 10,000 expert demonstrations synthesized in a 3D simulator, where each demonstration is paired with a high-level task instruction, a chain of base part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches, including end-to-end vision-language policy learning and bi-level planning models for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.
Robotics 15
☆ Bridging Data-Driven and Physics-Based Models: A Consensus Multi-Model Kalman Filter for Robust Vehicle State Estimation
Vehicle state estimation presents a fundamental challenge for autonomous driving systems, requiring both physical interpretability and the ability to capture complex nonlinear behaviors across diverse operating conditions. Traditional methodologies often rely exclusively on either physics-based or data-driven models, each with complementary strengths and limitations that become most noticeable during critical scenarios. This paper presents a novel consensus multi-model Kalman filter framework that integrates heterogeneous model types to leverage their complementary strengths while minimizing individual weaknesses. We introduce two distinct methodologies for handling covariance propagation in data-driven models: a Koopman operator-based linearization approach enabling analytical covariance propagation, and an ensemble-based method providing unified uncertainty quantification across model types without requiring pretraining. Our approach implements an iterative consensus fusion procedure that dynamically weighs different models based on their demonstrated reliability in current operating conditions. The experimental results conducted on an electric all-wheel-drive Equinox vehicle demonstrate performance improvements over single-model techniques, with particularly significant advantages during challenging maneuvers and varying road conditions, confirming the effectiveness and robustness of the proposed methodology for safety-critical autonomous driving applications.
☆ KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills
Humanoid robots are promising to acquire various skills by imitating human behaviors. However, existing algorithms are only capable of tracking smooth, low-speed human motions, even with delicate reward and curriculum design. This paper presents a physics-based humanoid control framework, aiming to master highly-dynamic human behaviors such as Kungfu and dancing through multi-steps motion processing and adaptive motion tracking. For motion processing, we design a pipeline to extract, filter out, correct, and retarget motions, while ensuring compliance with physical constraints to the maximum extent. For motion imitation, we formulate a bi-level optimization problem to dynamically adjust the tracking accuracy tolerance based on the current tracking error, creating an adaptive curriculum mechanism. We further construct an asymmetric actor-critic framework for policy training. In experiments, we train whole-body control policies to imitate a set of highly-dynamic motions. Our method achieves significantly lower tracking errors than existing approaches and is successfully deployed on the Unitree G1 robot, demonstrating stable and expressive behaviors. The project page is https://kungfu-bot.github.io.
☆ Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models ICML 2025
Designing effective reward functions remains a fundamental challenge in reinforcement learning (RL), as it often requires extensive human effort and domain expertise. While RL from human feedback has been successful in aligning agents with human intent, acquiring high-quality feedback is costly and labor-intensive, limiting its scalability. Recent advancements in foundation models present a promising alternative--leveraging AI-generated feedback to reduce reliance on human supervision in reward learning. Building on this paradigm, we introduce ERL-VLM, an enhanced rating-based RL method that effectively learns reward functions from AI feedback. Unlike prior methods that rely on pairwise comparisons, ERL-VLM queries large vision-language models (VLMs) for absolute ratings of individual trajectories, enabling more expressive feedback and improved sample efficiency. Additionally, we propose key enhancements to rating-based RL, addressing instability issues caused by data imbalance and noisy labels. Through extensive experiments across both low-level and high-level control tasks, we demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods. Our results demonstrate the potential of AI feedback for scaling RL with minimal human intervention, paving the way for more autonomous and efficient reward learning.
comment: Accepted to ICML 2025
☆ From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots
Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world.
☆ RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control
This paper focuses on a critical challenge in robotics: translating text-driven human motions into executable actions for humanoid robots, enabling efficient and cost-effective learning of new behaviors. While existing text-to-motion generation methods achieve semantic alignment between language and motion, they often produce kinematically or physically infeasible motions unsuitable for real-world deployment. To bridge this sim-to-real gap, we propose Reinforcement Learning from Physical Feedback (RLPF), a novel framework that integrates physics-aware motion evaluation with text-conditioned motion generation. RLPF employs a motion tracking policy to assess feasibility in a physics simulator, generating rewards for fine-tuning the motion generator. Furthermore, RLPF introduces an alignment verification module to preserve semantic fidelity to text instructions. This joint optimization ensures both physical plausibility and instruction alignment. Extensive experiments show that RLPF greatly outperforms baseline methods in generating physically feasible motions while maintaining semantic correspondence with text instruction, enabling successful deployment on real humanoid robots.
☆ On-board Sonar Data Classification for Path Following in Underwater Vehicles using Fast Interval Type-2 Fuzzy Extreme Learning Machine
In autonomous underwater missions, the successful completion of predefined paths mainly depends on the ability of underwater vehicles to recognise their surroundings. In this study, we apply the concept of Fast Interval Type-2 Fuzzy Extreme Learning Machine (FIT2-FELM) to train a Takagi-Sugeno-Kang IT2 Fuzzy Inference System (TSK IT2-FIS) for on-board sonar data classification using an underwater vehicle called BlueROV2. The TSK IT2-FIS is integrated into a Hierarchical Navigation Strategy (HNS) as the main navigation engine to infer local motions and provide the BlueROV2 with full autonomy to follow an obstacle-free trajectory in a water container of 2.5m x 2.5m x 3.5m. Compared to traditional navigation architectures, using the proposed method, we observe a robust path following behaviour in the presence of uncertainty and noise. We found that the proposed approach provides the BlueROV with a more complete sensory picture about its surroundings while real-time navigation planning is performed by the concurrent execution of two or more tasks.
☆ Physics-informed Neural Motion Planning via Domain Decomposition in Large Environments
Physics-informed Neural Motion Planners (PiNMPs) provide a data-efficient framework for solving the Eikonal Partial Differential Equation (PDE) and representing the cost-to-go function for motion planning. However, their scalability remains limited by spectral bias and the complex loss landscape of PDE-driven training. Domain decomposition mitigates these issues by dividing the environment into smaller subdomains, but existing methods enforce continuity only at individual spatial points. While effective for function approximation, these methods fail to capture the spatial connectivity required for motion planning, where the cost-to-go function depends on both the start and goal coordinates rather than a single query point. We propose Finite Basis Neural Time Fields (FB-NTFields), a novel neural field representation for scalable cost-to-go estimation. Instead of enforcing continuity in output space, FB-NTFields construct a latent space representation, computing the cost-to-go as a distance between the latent embeddings of start and goal coordinates. This enables global spatial coherence while integrating domain decomposition, ensuring efficient large-scale motion planning. We validate FB-NTFields in complex synthetic and real-world scenarios, demonstrating substantial improvements over existing PiNMPs. Finally, we deploy our method on a Unitree B1 quadruped robot, successfully navigating indoor environments. The supplementary videos can be found at https://youtu.be/OpRuCbLNOwM.
☆ Multimodal Large Language Models-Enabled UAV Swarm: Towards Efficient and Intelligent Autonomous Aerial Systems
Recent breakthroughs in multimodal large language models (MLLMs) have endowed AI systems with unified perception, reasoning and natural-language interaction across text, image and video streams. Meanwhile, Unmanned Aerial Vehicle (UAV) swarms are increasingly deployed in dynamic, safety-critical missions that demand rapid situational understanding and autonomous adaptation. This paper explores potential solutions for integrating MLLMs with UAV swarms to enhance the intelligence and adaptability across diverse tasks. Specifically, we first outline the fundamental architectures and functions of UAVs and MLLMs. Then, we analyze how MLLMs can enhance the UAV system performance in terms of target detection, autonomous navigation, and multi-agent coordination, while exploring solutions for integrating MLLMs into UAV systems. Next, we propose a practical case study focused on the forest fire fighting. To fully reveal the capabilities of the proposed framework, human-machine interaction, swarm task planning, fire assessment, and task execution are investigated. Finally, we discuss the challenges and future research directions for the MLLMs-enabled UAV swarm. An experiment illustration video could be found online at https://youtu.be/zwnB9ZSa5A4.
comment: 8 pages, 5 figures,submitted to IEEE wcm
☆ Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence
End-to-end visuomotor policies trained using behavior cloning have shown a remarkable ability to generate complex, multi-modal low-level robot behaviors. However, at deployment time, these policies still struggle to act reliably when faced with out-of-distribution (OOD) visuals induced by objects, backgrounds, or environment changes. Prior works in interactive imitation learning solicit corrective expert demonstrations under the OOD conditions -- but this can be costly and inefficient. We observe that task success under OOD conditions does not always warrant novel robot behaviors. In-distribution (ID) behaviors can directly be transferred to OOD conditions that share functional similarities with ID conditions. For example, behaviors trained to interact with in-distribution (ID) pens can apply to interacting with a visually-OOD pencil. The key challenge lies in disambiguating which ID observations functionally correspond to the OOD observation for the task at hand. We propose that an expert can provide this OOD-to-ID functional correspondence. Thus, instead of collecting new demonstrations and re-training at every OOD encounter, our method: (1) detects the need for feedback by first checking if current observations are OOD and then identifying whether the most similar training observations show divergent behaviors, (2) solicits functional correspondence feedback to disambiguate between those behaviors, and (3) intervenes on the OOD observations with the functionally corresponding ID observations to perform deployment-time generalization. We validate our method across diverse real-world robotic manipulation tasks with a Franka Panda robotic manipulator. Our results show that test-time functional correspondences can improve the generalization of a vision-based diffusion policy to OOD objects and environment conditions with low feedback.
comment: 15 pages, 11 figures
☆ Goal-based Self-Adaptive Generative Adversarial Imitation Learning (Goal-SAGAIL) for Multi-goal Robotic Manipulation Tasks
Reinforcement learning for multi-goal robot manipulation tasks poses significant challenges due to the diversity and complexity of the goal space. Techniques such as Hindsight Experience Replay (HER) have been introduced to improve learning efficiency for such tasks. More recently, researchers have combined HER with advanced imitation learning methods such as Generative Adversarial Imitation Learning (GAIL) to integrate demonstration data and accelerate training speed. However, demonstration data often fails to provide enough coverage for the goal space, especially when acquired from human teleoperation. This biases the learning-from-demonstration process toward mastering easier sub-tasks instead of tackling the more challenging ones. In this work, we present Goal-based Self-Adaptive Generative Adversarial Imitation Learning (Goal-SAGAIL), a novel framework specifically designed for multi-goal robot manipulation tasks. By integrating self-adaptive learning principles with goal-conditioned GAIL, our approach enhances imitation learning efficiency, even when limited, suboptimal demonstrations are available. Experimental results validate that our method significantly improves learning efficiency across various multi-goal manipulation scenarios -- including complex in-hand manipulation tasks -- using suboptimal demonstrations provided by both simulation and human experts.
comment: 6 pages, 5 figures
♻ ☆ Object State Estimation Through Robotic Active Interaction for Biological Autonomous Drilling
Estimating the state of biological specimens is challenging due to limited observation through microscopic vision. For instance, during mouse skull drilling, the appearance alters little when thinning bone tissue because of its semi-transparent property and the high-magnification microscopic vision. To obtain the object's state, we introduce an object state estimation method for biological specimens through active interaction based on the deflection. The method is integrated to enhance the autonomous drilling system developed in our previous work. The method and integrated system were evaluated through 12 autonomous eggshell drilling experiment trials. The results show that the system achieved a 91.7% successful ratio and 75% detachable ratio, showcasing its potential applicability in more complex surgical procedures such as mouse skull craniotomy. This research paves the way for further development of autonomous robotic systems capable of estimating the object's state through active interaction.
comment: The first and second authors contribute equally to this research. 6 pages, 5 figures, submitted to RA-L
♻ ☆ SceneComplete: Open-World 3D Scene Completion in Cluttered Real World Environments for Robot Manipulation
Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. SceneComplete is a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, visual-descriptors and pose-estimation) to obtain highly accurate results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand. We release the code on our website https://scenecomplete.github.io/.
♻ ☆ Physics-informed Neural Mapping and Motion Planning in Unknown Environments
Mapping and motion planning are two essential elements of robot intelligence that are interdependent in generating environment maps and navigating around obstacles. The existing mapping methods create maps that require computationally expensive motion planning tools to find a path solution. In this paper, we propose a new mapping feature called arrival time fields, which is a solution to the Eikonal equation. The arrival time fields can directly guide the robot in navigating the given environments. Therefore, this paper introduces a new approach called Active Neural Time Fields (Active NTFields), which is a physics-informed neural framework that actively explores the unknown environment and maps its arrival time field on the fly for robot motion planning. Our method does not require any expert data for learning and uses neural networks to directly solve the Eikonal equation for arrival time field mapping and motion planning. We benchmark our approach against state-of-the-art mapping and motion planning methods and demonstrate its superior performance in both simulated and real-world environments with a differential drive robot and a 6 degrees-of-freedom (DOF) robot manipulator. The supplementary videos can be found at https://youtu.be/qTPL5a6pRKk, and the implementation code repository is available at https://github.com/Rtlyc/antfields-demo.
comment: Published in: IEEE Transactions on Robotics ( Volume: 41)
♻ ☆ X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real
Human videos offer a scalable way to train robot manipulation policies, but lack the action labels needed by standard imitation learning algorithms. Existing cross-embodiment approaches try to map human motion to robot actions, but often fail when the embodiments differ significantly. We propose X-Sim, a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies. X-Sim starts by reconstructing a photorealistic simulation from an RGBD human video and tracking object trajectories to define object-centric rewards. These rewards are used to train a reinforcement learning (RL) policy in simulation. The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting. To transfer to the real world, X-Sim introduces an online domain adaptation technique that aligns real and simulated observations during deployment. Importantly, X-Sim does not require any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2 environments and show that it: (1) improves task progress by 30% on average over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with 10x less data collection time, and (3) generalizes to new camera viewpoints and test-time changes. Code and videos are available at https://portal-cornell.github.io/X-Sim/.
♻ ☆ Diffusion Graph Neural Networks for Robustness in Olfaction Sensors and Datasets
Robotic odour source localization (OSL) is a critical capability for autonomous systems operating in complex environments. However, current OSL methods often suffer from ambiguities, particularly when robots misattribute odours to incorrect objects due to limitations in olfactory datasets and sensor resolutions. To address this challenge, we introduce a novel machine learning method using diffusion-based molecular generation to enhance odour localization accuracy that can be used by itself or with automated olfactory dataset construction pipelines with vision-language models (VLMs) This generative process of our diffusion model expands the chemical space beyond the limitations of both current olfactory datasets and the training data of VLMs, enabling the identification of potential odourant molecules not previously documented. The generated molecules can then be more accurately validated using advanced olfactory sensors which emulate human olfactory recognition through electronic sensor arrays. By integrating visual analysis, language processing, and molecular generation, our framework enhances the ability of olfaction-vision models on robots to accurately associate odours with their correct sources, thereby improving navigation and decision-making through better sensor selection for a target compound. Our methodology represents a foundational advancement in the field of artificial olfaction, offering a scalable solution to the challenges posed by limited olfactory data and sensor ambiguities.