Robotics 36
☆ Bioinspired Soft Quadrotors Jointly Unlock Agility, Squeezability, and Collision Resilience
Natural flyers use soft wings to seamlessly enable a wide range of flight
behaviours, including agile manoeuvres, squeezing through narrow passageways,
and withstanding collisions. In contrast, conventional quadrotor designs rely
on rigid frames that support agile flight but inherently limit collision
resilience and squeezability, thereby constraining flight capabilities in
cluttered environments. Inspired by the anisotropic stiffness and distributed
mass-energy structures observed in biological organisms, we introduce
FlexiQuad, a soft-frame quadrotor design approach that limits this trade-off.
We demonstrate a 405-gram FlexiQuad prototype, three orders of magnitude more
compliant than conventional quadrotors, yet capable of acrobatic manoeuvres
with peak speeds above 80 km/h and linear and angular accelerations exceeding 3
g and 300 rad/s$^2$, respectively. Analysis demonstrates it can replicate
accelerations of rigid counterparts up to a thrust-to-weight ratio of 8.
Simultaneously, FlexiQuad exhibits fourfold higher collision resilience,
surviving frontal impacts at 5 m/s without damage and reducing destabilising
forces in glancing collisions by a factor of 39. Its frame can fully compress,
enabling flight through gaps as narrow as 70% of its nominal width. Our
analysis identifies an optimal structural softness range, from 0.006 to 0.77
N/mm, comparable to that of natural flyers' wings, whereby agility,
squeezability, and collision resilience are jointly achieved for FlexiQuad
models from 20 to 3000 grams. FlexiQuad expands hovering drone capabilities in
complex environments, enabling robust physical interactions without
compromising flight performance.
comment: 26 pages, 12 figures, 2 tables, 9 videos (not yet disclosed, awaiting
peer review)
☆ Stable and Robust SLIP Model Control via Energy Conservation-Based Feedback Cancellation for Quadrupedal Applications
In this paper, we present an energy-conservation based control architecture
for stable dynamic motion in quadruped robots. We model the robot as a
Spring-loaded Inverted Pendulum (SLIP), a model well-suited to represent the
bouncing motion characteristic of running gaits observed in various biological
quadrupeds and bio-inspired robotic systems. The model permits leg-orientation
control during flight and leg-length control during stance, a design choice
inspired by natural quadruped behaviors and prevalent in robotic quadruped
systems. Our control algorithm uses the reduced-order SLIP dynamics of the
quadruped to track a stable parabolic spline during stance, which is calculated
using the principle of energy conservation. Through simulations based on the
design specifications of an actual quadruped robot, Ghost Robotics Minitaur, we
demonstrate that our control algorithm generates stable bouncing gaits.
Additionally, we illustrate the robustness of our controller by showcasing its
ability to maintain stable bouncing even when faced with up to a 10% error in
sensor measurements.
☆ EveryDayVLA: A Vision-Language-Action Model for Affordable Robotic Manipulation ICRA 2026
While Vision-Language-Action (VLA) models map visual inputs and language
instructions directly to robot actions, they often rely on costly hardware and
struggle in novel or cluttered scenes. We introduce EverydayVLA, a 6-DOF
manipulator that can be assembled for under $300, capable of modest payloads
and workspace. A single unified model jointly outputs discrete and continuous
actions, and our adaptive-horizon ensemble monitors motion uncertainty to
trigger on-the-fly re-planning for safe, reliable operation. On LIBERO,
EverydayVLA matches state-of-the-art success rates, and in real-world tests it
outperforms prior methods by 49% in-distribution and 34.9% out-of-distribution.
By combining a state-of-the-art VLA with cost-effective hardware, EverydayVLA
democratizes access to a robotic foundation model and paves the way for
economical use in homes and research labs alike. Experiment videos and details:
https://everydayvla.github.io/
comment: Submitted to ICRA 2026
☆ Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction ICML 2025
Off-dynamics reinforcement learning (RL), where training and deployment
transition dynamics are different, can be formulated as learning in a robust
Markov decision process (RMDP) where uncertainties in transition dynamics are
imposed. Existing literature mostly assumes access to generative models
allowing arbitrary state-action queries or pre-collected datasets with a good
state coverage of the deployment environment, bypassing the challenge of
exploration. In this work, we study a more realistic and challenging setting
where the agent is limited to online interaction with the training environment.
To capture the intrinsic difficulty of exploration in online RMDPs, we
introduce the supremal visitation ratio, a novel quantity that measures the
mismatch between the training dynamics and the deployment dynamics. We show
that if this ratio is unbounded, online learning becomes exponentially hard. We
propose the first computationally efficient algorithm that achieves sublinear
regret in online RMDPs with $f$-divergence based transition uncertainties. We
also establish matching regret lower bounds, demonstrating that our algorithm
achieves optimal dependence on both the supremal visitation ratio and the
number of interaction episodes. Finally, we validate our theoretical results
through comprehensive numerical experiments.
comment: 53 pages, 6 figures, 3 tables. Published in Proceedings of the 42nd
International Conference on Machine Learning (ICML 2025)
☆ ETHOS: A Robotic Encountered-Type Haptic Display for Social Interaction in Virtual Reality
We present ETHOS (Encountered-Type Haptics for On-demand Social Interaction),
a dynamic encountered-type haptic display (ETHD) that enables natural physical
contact in virtual reality (VR) during social interactions such as handovers,
fist bumps, and high-fives. The system integrates a torque-controlled robotic
manipulator with interchangeable passive props (silicone hand replicas and a
baton), marker-based physical-virtual registration via a ChArUco board, and a
safety monitor that gates motion based on the user's head and hand pose. We
introduce two control strategies: (i) a static mode that presents a stationary
prop aligned with its virtual counterpart, consistent with prior ETHD
baselines, and (ii) a dynamic mode that continuously updates prop position by
exponentially blending an initial mid-point trajectory with real-time hand
tracking, generating a unique contact point for each interaction. Bench tests
show static colocation accuracy of 5.09 +/- 0.94 mm, while user interactions
achieved temporal alignment with an average contact latency of 28.53 +/- 31.21
ms across all interaction and control conditions. These results demonstrate the
feasibility of recreating socially meaningful haptics in VR. By incorporating
essential safety and control mechanisms, ETHOS establishes a practical
foundation for high-fidelity, dynamic interpersonal interactions in virtual
environments.
comment: 8 pages
☆ SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning
Tzu-Yuan Huang, Armin Lederer, Dai-Jie Wu, Xiaobing Dai, Sihua Zhang, Stefan Sosnowski, Shao-Hua Sun, Sandra Hirche
Flow matching (FM) has shown promising results in data-driven planning.
However, it inherently lacks formal guarantees for ensuring state and action
constraints, whose satisfaction is a fundamental and crucial requirement for
the safety and admissibility of planned trajectories on various systems.
Moreover, existing FM planners do not ensure the dynamical consistency, which
potentially renders trajectories inexecutable. We address these shortcomings by
proposing SAD-Flower, a novel framework for generating Safe, Admissible, and
Dynamically consistent trajectories. Our approach relies on an augmentation of
the flow with a virtual control input. Thereby, principled guidance can be
derived using techniques from nonlinear control theory, providing formal
guarantees for state constraints, action constraints, and dynamic consistency.
Crucially, SAD-Flower operates without retraining, enabling test-time
satisfaction of unseen constraints. Through extensive experiments across
several tasks, we demonstrate that SAD-Flower outperforms various
generative-model-based baselines in ensuring constraint satisfaction.
☆ Cleaning Maintenance Logs with LLM Agents for Improved Predictive Maintenance
Economic constraints, limited availability of datasets for reproducibility
and shortages of specialized expertise have long been recognized as key
challenges to the adoption and advancement of predictive maintenance (PdM) in
the automotive sector. Recent progress in large language models (LLMs) presents
an opportunity to overcome these barriers and speed up the transition of PdM
from research to industrial practice. Under these conditions, we explore the
potential of LLM-based agents to support PdM cleaning pipelines. Specifically,
we focus on maintenance logs, a critical data source for training
well-performing machine learning (ML) models, but one often affected by errors
such as typos, missing fields, near-duplicate entries, and incorrect dates. We
evaluate LLM agents on cleaning tasks involving six distinct types of noise.
Our findings show that LLMs are effective at handling generic cleaning tasks
and offer a promising foundation for future industrial applications. While
domain-specific errors remain challenging, these results highlight the
potential for further improvements through specialized training and enhanced
agentic capabilities.
☆ Force-Safe Environment Maps and Real-Time Detection for Soft Robot Manipulators
Soft robot manipulators have the potential for deployment in delicate
environments to perform complex manipulation tasks. However, existing obstacle
detection and avoidance methods do not consider limits on the forces that
manipulators may exert upon contact with delicate obstacles. This work
introduces a framework that maps force safety criteria from task space (i.e.
positions along the robot's body) to configuration space (i.e. the robot's
joint angles) and enables real-time force safety detection. We incorporate
limits on allowable environmental contact forces for given task-space
obstacles, and map them into configuration space (C-space) through the
manipulator's forward kinematics. This formulation ensures that configurations
classified as safe are provably below the maximum force thresholds, thereby
allowing us to determine force-safe configurations of the soft robot
manipulator in real-time. We validate our approach in simulation and hardware
experiments on a two-segment pneumatic soft robot manipulator. Results
demonstrate that the proposed method accurately detects force safety during
interactions with deformable obstacles, thereby laying the foundation for
real-time safe planning of soft manipulators in delicate, cluttered
environments.
☆ TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models
Vision-language-action models (VLAs) trained on large-scale robotic datasets
have demonstrated strong performance on manipulation tasks, including bimanual
tasks. However, because most public datasets focus on single-arm
demonstrations, adapting VLAs for bimanual tasks typically requires substantial
additional bimanual data and fine-tuning. To address this challenge, we
introduce TwinVLA, a modular framework that composes two copies of a pretrained
single-arm VLA into a coordinated bimanual VLA. Unlike monolithic
cross-embodiment models trained on mixtures of single-arm and bimanual data,
TwinVLA improves both data efficiency and performance by composing pretrained
single-arm policies. Across diverse bimanual tasks in real-world and simulation
settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model
without requiring any bimanual pretraining. Furthermore, it narrows the gap to
state-of-the-art model, $\pi_0$ which rely on extensive proprietary bimanual
data and compute cost. These results establish our modular composition approach
as a data-efficient and scalable path toward high-performance bimanual
manipulation, leveraging public single-arm data.
comment: Project webpage : https://jellyho.github.io/TwinVLA/
☆ Context-aware Learned Mesh-based Simulation via Trajectory-Level Meta-Learning
Philipp Dahlinger, Niklas Freymuth, Tai Hoang, Tobias Würth, Michael Volpp, Luise Kärger, Gerhard Neumann
Simulating object deformations is a critical challenge across many scientific
domains, including robotics, manufacturing, and structural mechanics. Learned
Graph Network Simulators (GNSs) offer a promising alternative to traditional
mesh-based physics simulators. Their speed and inherent differentiability make
them particularly well suited for applications that require fast and accurate
simulations, such as robotic manipulation or manufacturing optimization.
However, existing learned simulators typically rely on single-step
observations, which limits their ability to exploit temporal context. Without
this information, these models fail to infer, e.g., material properties.
Further, they rely on auto-regressive rollouts, which quickly accumulate error
for long trajectories. We instead frame mesh-based simulation as a
trajectory-level meta-learning problem. Using Conditional Neural Processes, our
method enables rapid adaptation to new simulation scenarios from limited
initial data while capturing their latent simulation properties. We utilize
movement primitives to directly predict fast, stable and accurate simulations
from a single model call. The resulting approach, Movement-primitive
Meta-MeshGraphNet (M3GN), provides higher simulation accuracy at a fraction of
the runtime cost compared to state-of-the-art GNSs across several tasks.
comment: 35 pages. Submitted to Transactions on Machine Learning Research
(TMLR)
☆ Beyond Master and Apprentice: Grounding Foundation Models for Symbiotic Interactive Learning in a Shared Latent Space
Today's autonomous agents can understand free-form natural language
instructions and execute long-horizon tasks in a manner akin to human-level
reasoning. These capabilities are mostly driven by large-scale pre-trained
foundation models (FMs). However, the approaches with which these models are
grounded for human-robot interaction (HRI) perpetuate a master-apprentice
model, where the apprentice (embodied agent) passively receives and executes
the master's (human's) commands without reciprocal learning. This reactive
interaction approach does not capture the co-adaptive dynamics inherent in
everyday multi-turn human-human interactions. To address this, we propose a
Symbiotic Interactive Learning (SIL) approach that enables both the master and
the apprentice to co-adapt through mutual, bidirectional interactions. We
formalised SIL as a co-adaptation process within a shared latent task space,
where the agent and human maintain joint belief states that evolve based on
interaction history. This enables the agent to move beyond reactive execution
to proactive clarification, adaptive suggestions, and shared plan refinement.
To realise these novel behaviours, we leveraged pre-trained FMs for spatial
perception and reasoning, alongside a lightweight latent encoder that grounds
the models' outputs into task-specific representations. Furthermore, to ensure
stability as the tasks evolve, we augment SIL with a memory architecture that
prevents the forgetting of learned task-space representations. We validate SIL
on both simulated and real-world embodied tasks, including instruction
following, information retrieval, query-oriented reasoning, and interactive
dialogues. Demos and resources are public
at:~\href{https://linusnep.github.io/SIL/}{https://linusnep.github.io/SIL/}.
☆ Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation IROS 2025
Robots operating in complex and uncertain environments face considerable
challenges. Advanced robotic systems often rely on extensive datasets to learn
manipulation tasks. In contrast, when humans are faced with unfamiliar tasks,
such as assembling a chair, a common approach is to learn by watching video
demonstrations. In this paper, we propose a novel method for learning robot
policies by Retrieving-from-Video (RfV), using analogies from human
demonstrations to address manipulation tasks. Our system constructs a video
bank comprising recordings of humans performing diverse daily tasks. To enrich
the knowledge from these videos, we extract mid-level information, such as
object affordance masks and hand motion trajectories, which serve as additional
inputs to enhance the robot model's learning and generalization capabilities.
We further feature a dual-component system: a video retriever that taps into an
external video bank to fetch task-relevant video based on task specification,
and a policy generator that integrates this retrieved knowledge into the
learning cycle. This approach enables robots to craft adaptive responses to
various scenarios and generalize to tasks beyond those in the training data.
Through rigorous testing in multiple simulated and real-world settings, our
system demonstrates a marked improvement in performance over conventional
robotic systems, showcasing a significant breakthrough in the field of
robotics.
comment: Accepted by IROS 2025
☆ Procedimiento de auditoría de ciberseguridad para sistemas autónomos: metodología, amenazas y mitigaciones SC
Adrián Campazas-Vega, Claudia Álvarez-Aparicio, David Sobrín-Hidalgo, Laura Inyesto-Alonso, Francisco Javier Rodríguez-Lera, Vicente Matellán-Olivera, Ángel Manuel Guerrero-Higueras
The deployment of autonomous systems has experienced remarkable growth in
recent years, driven by their integration into sectors such as industry,
medicine, logistics, and domestic environments. This expansion is accompanied
by a series of security issues that entail significant risks due to the
critical nature of autonomous systems, especially those operating in
human-interaction environments. Furthermore, technological advancement and the
high operational and architectural complexity of autonomous systems have
resulted in an increased attack surface. This article presents a specific
security auditing procedure for autonomous systems, based on a layer-structured
methodology, a threat taxonomy adapted to the robotic context, and a set of
concrete mitigation measures. The validity of the proposed approach is
demonstrated through four practical case studies applied to representative
robotic platforms: the Vision 60 military quadruped from Ghost Robotics, the A1
robot from Unitree Robotics, the UR3 collaborative arm from Universal Robots,
and the Pepper social robot from Aldebaran Robotics.
comment: 32 pages, in Spanish language, 7 tables, 12 Figures. White paper
under the TESCAC project
☆ Follow-Me in Micro-Mobility with End-to-End Imitation Learning
Autonomous micro-mobility platforms face challenges from the perspective of
the typical deployment environment: large indoor spaces or urban areas that are
potentially crowded and highly dynamic. While social navigation algorithms have
progressed significantly, optimizing user comfort and overall user experience
over other typical metrics in robotics (e.g., time or distance traveled) is
understudied. Specifically, these metrics are critical in commercial
applications. In this paper, we show how imitation learning delivers smoother
and overall better controllers, versus previously used manually-tuned
controllers. We demonstrate how DAAV's autonomous wheelchair achieves
state-of-the-art comfort in follow-me mode, in which it follows a human
operator assisting persons with reduced mobility (PRM). This paper analyzes
different neural network architectures for end-to-end control and demonstrates
their usability in real-world production-level deployments.
☆ Decomposed Object Manipulation via Dual-Actor Policy
Bin Fan, Jianjian Jiang, Zhuohao Li, Yixiang He, Xiaoming Wu, Yihan Yang, Shengbang Liu, Weishi Zheng
Object manipulation, which focuses on learning to perform tasks on similar
parts across different types of objects, can be divided into an approaching
stage and a manipulation stage. However, previous works often ignore this
characteristic of the task and rely on a single policy to directly learn the
whole process of object manipulation. To address this problem, we propose a
novel Dual-Actor Policy, termed DAP, which explicitly considers different
stages and leverages heterogeneous visual priors to enhance each stage.
Specifically, we introduce an affordance-based actor to locate the functional
part in the manipulation task, thereby improving the approaching process.
Following this, we propose a motion flow-based actor to capture the movement of
the component, facilitating the manipulation process. Finally, we introduce a
decision maker to determine the current stage of DAP and select the
corresponding actor. Moreover, existing object manipulation datasets contain
few objects and lack the visual priors needed to support training. To address
this, we construct a simulated dataset, the Dual-Prior Object Manipulation
Dataset, which combines the two visual priors and includes seven tasks,
including two challenging long-term, multi-stage tasks. Experimental results on
our dataset, the RoboTwin benchmark and real-world scenarios illustrate that
our method consistently outperforms the SOTA method by 5.55%, 14.7% and 10.4%
on average respectively.
comment: 9 pages, 7 figures, 5 tables
☆ TAPOM: Task-Space Topology-Guided Motion Planning for Manipulating Elongated Object in Cluttered Environments
Robotic manipulation in complex, constrained spaces is vital for widespread
applications but challenging, particularly when navigating narrow passages with
elongated objects. Existing planning methods often fail in these low-clearance
scenarios due to the sampling difficulties or the local minima. This work
proposes Topology-Aware Planning for Object Manipulation (TAPOM), which
explicitly incorporates task-space topological analysis to enable efficient
planning. TAPOM uses a high-level analysis to identify critical pathways and
generate guiding keyframes, which are utilized in a low-level planner to find
feasible configuration space trajectories. Experimental validation demonstrates
significantly high success rates and improved efficiency over state-of-the-art
methods on low-clearance manipulation tasks. This approach offers broad
implications for enhancing manipulation capabilities of robots in complex
real-world environments.
☆ Epically Powerful: An open-source software and mechatronics infrastructure for wearable robotic systems
Jennifer K. Leestma, Siddharth R. Nathella, Christoph P. O. Nuesslein, Snehil Mathur, Gregory S. Sawicki, Aaron J. Young
Epically Powerful is an open-source robotics infrastructure that streamlines
the underlying framework of wearable robotic systems - managing communication
protocols, clocking, actuator commands, visualization, sensor data acquisition,
data logging, and more - while also providing comprehensive guides for hardware
selection, system assembly, and controller implementation. Epically Powerful
contains a code base enabling simplified user implementation via Python that
seamlessly interfaces with various commercial state-of-the-art quasi-direct
drive (QDD) actuators, single-board computers, and common sensors, provides
example controllers, and enables real-time visualization. To further support
device development, the package also includes a recommended parts list and
compatibility guide and detailed documentation on hardware and software
implementation. The goal of Epically Powerful is to lower the barrier to
developing and deploying custom wearable robotic systems without a
pre-specified form factor, enabling researchers to go from raw hardware to
modular, robust devices quickly and effectively. Though originally designed
with wearable robotics in mind, Epically Powerful is broadly applicable to
other robotic domains that utilize QDD actuators, single-board computers, and
sensors for closed-loop control.
comment: 11 pages, 5 figures. This work has been submitted to the IEEE for
possible publication
☆ Tunable Passivity Control for Centralized Multiport Networked Systems
Centralized Multiport Networked Dynamic (CMND) systems have emerged as a key
architecture with applications in several complex network systems, such as
multilateral telerobotics and multi-agent control. These systems consist of a
hub node/subsystem connecting with multiple remote nodes/subsystems via a
networked architecture. One challenge for this system is stability, which can
be affected by non-ideal network artifacts. Conventional passivity-based
approaches can stabilize the system under specialized applications like
small-scale networked systems. However, those conventional passive stabilizers
have several restrictions, such as distributing compensation across subsystems
in a decentralized manner, limiting flexibility, and, at the same time, relying
on the restrictive assumptions of node passivity. This paper synthesizes a
centralized optimal passivity-based stabilization framework for CMND systems.
It consists of a centralized passivity observer monitoring overall energy flow
and an optimal passivity controller that distributes the just-needed
dissipation among various nodes, guaranteeing strict passivity and, thus, L2
stability. The proposed data-driven model-free approach, i.e., Tunable
Centralized Optimal Passivity Control (TCoPC), optimizes total performance
based on the prescribed dissipation distribution strategy while ensuring
stability. The controller can put high dissipation loads on some sub-networks
while relaxing the dissipation on other nodes. Simulation results demonstrate
the proposed frameworks performance in a complex task under different
time-varying delay scenarios while relaxing the remote nodes minimum phase and
passivity assumption, enhancing the scalability and generalizability.
☆ MoE-DP: An MoE-Enhanced Diffusion Policy for Robust Long-Horizon Robotic Manipulation with Skill Decomposition and Failure Recovery
Baiye Cheng, Tianhai Liang, Suning Huang, Maanping Shao, Feihong Zhang, Botian Xu, Zhengrong Xue, Huazhe Xu
Diffusion policies have emerged as a powerful framework for robotic
visuomotor control, yet they often lack the robustness to recover from subtask
failures in long-horizon, multi-stage tasks and their learned representations
of observations are often difficult to interpret. In this work, we propose the
Mixture of Experts-Enhanced Diffusion Policy (MoE-DP), where the core idea is
to insert a Mixture of Experts (MoE) layer between the visual encoder and the
diffusion model. This layer decomposes the policy's knowledge into a set of
specialized experts, which are dynamically activated to handle different phases
of a task. We demonstrate through extensive experiments that MoE-DP exhibits a
strong capability to recover from disturbances, significantly outperforming
standard baselines in robustness. On a suite of 6 long-horizon simulation
tasks, this leads to a 36% average relative improvement in success rate under
disturbed conditions. This enhanced robustness is further validated in the real
world, where MoE-DP also shows significant performance gains. We further show
that MoE-DP learns an interpretable skill decomposition, where distinct experts
correspond to semantic task primitives (e.g., approaching, grasping). This
learned structure can be leveraged for inference-time control, allowing for the
rearrangement of subtasks without any re-training.Our video and code are
available at the https://moe-dp-website.github.io/MoE-DP-Website/.
☆ Multi-agent Coordination via Flow Matching
This work presents MAC-Flow, a simple yet expressive framework for
multi-agent coordination. We argue that requirements of effective coordination
are twofold: (i) a rich representation of the diverse joint behaviors present
in offline data and (ii) the ability to act efficiently in real time. However,
prior approaches often sacrifice one for the other, i.e., denoising
diffusion-based solutions capture complex coordination but are computationally
slow, while Gaussian policy-based solutions are fast but brittle in handling
multi-agent interaction. MAC-Flow addresses this trade-off by first learning a
flow-based representation of joint behaviors, and then distilling it into
decentralized one-step policies that preserve coordination while enabling fast
execution. Across four different benchmarks, including $12$ environments and
$34$ datasets, MAC-Flow alleviates the trade-off between performance and
computational cost, specifically achieving about $\boldsymbol{\times14.5}$
faster inference compared to diffusion-based MARL methods, while maintaining
good performance. At the same time, its inference speed is similar to that of
prior Gaussian policy-based offline multi-agent reinforcement learning (MARL)
methods.
☆ Encoding Biomechanical Energy Margin into Passivity-based Synchronization for Networked Telerobotic Systems
Maintaining system stability and accurate position tracking is imperative in
networked robotic systems, particularly for haptics-enabled human-robot
interaction. Recent literature has integrated human biomechanics into the
stabilizers implemented for teleoperation, enhancing force preservation while
guaranteeing convergence and safety. However, position desynchronization due to
imperfect communication and non-passive behaviors remains a challenge. This
paper proposes a two-port biomechanics-aware passivity-based synchronizer and
stabilizer, referred to as TBPS2. This stabilizer optimizes position
synchronization by leveraging human biomechanics while reducing the
stabilizer's conservatism in its activation. We provide the mathematical design
synthesis of the stabilizer and the proof of stability. We also conducted a
series of grid simulations and systematic experiments, comparing their
performance with that of state-of-the-art solutions under varying time delays
and environmental conditions.
☆ A semi-analytical approach for computing the largest singularity-free spheres of a class of 6-6 Stewart-Gough platforms for specified orientation workspaces
This article presents a method for computing the largest singularity-free
sphere (SFS) of a 6-6 Stewart-Gough platform manipulator (SGPM) over a
specified orientation workspace. For a fixed orientation of the moving
platform, the SFS is computed analytically. This process is repeated over a set
of samples generated within the orientation workspace, and the smallest among
them is designated as the desired SFS for the given orientation workspace.
Numerical experiments are performed on four distinct architectures of the SGPM
to understand their relative performances w.r.t. SFS volumes over the same
orientation workspace. This study demonstrates the potential utility of the
proposed computational method both in analysis and design of SGPMs.
☆ iFlyBot-VLM Technical Report
We introduce iFlyBot-VLM, a general-purpose Vision-Language Model (VLM) used
to improve the domain of Embodied Intelligence. The central objective of
iFlyBot-VLM is to bridge the cross-modal semantic gap between high-dimensional
environmental perception and low-level robotic motion control. To this end, the
model abstracts complex visual and spatial information into a body-agnostic and
transferable Operational Language, thereby enabling seamless perception-action
closed-loop coordination across diverse robotic platforms. The architecture of
iFlyBot-VLM is systematically designed to realize four key functional
capabilities essential for embodied intelligence: 1) Spatial Understanding and
Metric Reasoning; 2) Interactive Target Grounding; 3) Action Abstraction and
Control Parameter Generation; 4) Task Planning and Skill Sequencing. We
envision iFlyBot-VLM as a scalable and generalizable foundation model for
embodied AI, facilitating the progression from specialized task-oriented
systems toward generalist, cognitively capable agents. We conducted evaluations
on 10 current mainstream embodied intelligence-related VLM benchmark datasets,
such as Blink and Where2Place, and achieved optimal performance while
preserving the model's general capabilities. We will publicly release both the
training data and model weights to foster further research and development in
the field of Embodied Intelligence.
♻ ☆ Periodic Skill Discovery NeurIPS 2025
Unsupervised skill discovery in reinforcement learning (RL) aims to learn
diverse behaviors without relying on external rewards. However, current methods
often overlook the periodic nature of learned skills, focusing instead on
increasing the mutual dependence between states and skills or maximizing the
distance traveled in latent space. Considering that many robotic tasks -
particularly those involving locomotion - require periodic behaviors across
varying timescales, the ability to discover diverse periodic skills is
essential. Motivated by this, we propose Periodic Skill Discovery (PSD), a
framework that discovers periodic behaviors in an unsupervised manner. The key
idea of PSD is to train an encoder that maps states to a circular latent space,
thereby naturally encoding periodicity in the latent representation. By
capturing temporal distance, PSD can effectively learn skills with diverse
periods in complex robotic tasks, even with pixel-based observations. We
further show that these learned skills achieve high performance on downstream
tasks such as hurdling. Moreover, integrating PSD with an existing skill
discovery method offers more diverse behaviors, thus broadening the agent's
repertoire. Our code and demos are available at
https://jonghaepark.github.io/psd/
comment: NeurIPS 2025
♻ ☆ Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based Reward
We develop a deep reinforcement learning framework for tactical decision
making in an autonomous truck, specifically for Adaptive Cruise Control (ACC)
and lane change maneuvers in a highway scenario. Our results demonstrate that
it is beneficial to separate high-level decision-making processes and low-level
control actions between the reinforcement learning agent and the low-level
controllers based on physical models. In the following, we study optimizing the
performance with a realistic and multi-objective reward function based on Total
Cost of Operation (TCOP) of the truck using different approaches; by adding
weights to reward components, by normalizing the reward components and by using
curriculum learning techniques.
comment: Paper is accepted for publication in Artificial Intelligence Review
♻ ☆ Ethics-Aware Safe Reinforcement Learning for Rare-Event Risk Control in Interactive Urban Driving
Autonomous vehicles hold great promise for reducing traffic fatalities and
improving transportation efficiency, yet their widespread adoption hinges on
embedding credible and transparent ethical reasoning into routine and emergency
maneuvers, particularly to protect vulnerable road users (VRUs) such as
pedestrians and cyclists. Here, we present a hierarchical Safe Reinforcement
Learning (Safe RL) framework that augments standard driving objectives with
ethics-aware cost signals. At the decision level, a Safe RL agent is trained
using a composite ethical risk cost, combining collision probability and harm
severity, to generate high-level motion targets. A dynamic, risk-sensitive
Prioritized Experience Replay mechanism amplifies learning from rare but
critical, high-risk events. At the execution level, polynomial path planning
coupled with Proportional-Integral-Derivative (PID) and Stanley controllers
translates these targets into smooth, feasible trajectories, ensuring both
accuracy and comfort. We train and validate our approach on closed-loop
simulation environments derived from large-scale, real-world traffic datasets
encompassing diverse vehicles, cyclists, and pedestrians, and demonstrate that
it outperforms baseline methods in reducing risk to others while maintaining
ego performance and comfort. This work provides a reproducible benchmark for
Safe RL with explicitly ethics-aware objectives in human-mixed traffic
scenarios. Our results highlight the potential of combining formal control
theory and data-driven learning to advance ethically accountable autonomy that
explicitly protects those most at risk in urban traffic environments. Across
two interactive benchmarks and five random seeds, our policy decreases conflict
frequency by 25-45% compared to matched task successes while maintaining
comfort metrics within 5%.
♻ ☆ Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
Multimodal models have achieved remarkable progress in recent years.
Nevertheless, they continue to exhibit notable limitations in spatial
understanding and reasoning, the very capability that anchors artificial
general intelligence in the physical world. With the recent release of GPT-5,
allegedly the most powerful AI model to date, it is timely to examine where the
leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path
toward spatial intelligence. We thus propose EASI for holistic Evaluation of
multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive
taxonomy of spatial tasks that unifies existing benchmarks and a standardized
protocol for the fair evaluation of state-of-the-art proprietary and
open-source models. In this report, we conduct the study across eight key
benchmarks, at a cost exceeding ten billion total tokens. Our empirical study
then reveals that (1) GPT-5 demonstrates unprecedented strength in spatial
intelligence (SI), yet (2) still falls short of human performance significantly
across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose
greater model capability deficiency than non-SI tasks, to the extent that (4)
proprietary models do not exhibit a decisive advantage when facing the most
difficult ones. In addition, we conduct a qualitative evaluation across a
diverse set of scenarios that are intuitive for humans, yet fail even the most
advanced multimodal models.
comment: Codebase: https://github.com/EvolvingLMMs-Lab/EASI/
♻ ☆ Pogobot: an Open-Source, Low-Cost Robot for Swarm Robotics and Programmable Active Matter
Alessia Loi, Loona Macabre, Jérémy Fersula, Keivan Amini, Leo Cazenille, Fabien Caura, Alexandre Guerre, Stéphane Gourichon, Laurent Fabre, Olivier Dauchot, Nicolas Bredeche
This paper describes the Pogobot, an open-source platform specifically
designed for research at the interface of swarm robotics and active matter.
Pogobot features vibration-based or wheel-based locomotion, fast infrared
communication, and an array of sensors in a cost-effective package (approx.
250euros/unit). The platform's modular design, comprehensive API, and
extensible architecture facilitate the implementation of swarm intelligence
algorithms and collective motion. Pogobots offer an accessible alternative to
existing platforms while providing advanced capabilities including directional
communication between units and fast locomotion, all with a compact form
factor. More than 200 Pogobots are already being used on a daily basis in
several Universities to study self-organizing systems, programmable active
matter, discrete reaction-diffusion-advection systems and computational models
of social learning and evolution. This paper details the hardware and software
architecture, communication protocols, locomotion mechanisms, and the
infrastructure built around the Pogobots.
♻ ☆ Mean-Shift Theory and Its Applications in Swarm Robotics: A New Way to Enhance the Efficiency of Multi-Robot Collaboration
Swarms evolving from collective behaviors among multiple individuals are
commonly seen in nature, which enables biological systems to exhibit more
efficient and robust collaboration. Creating similar swarm intelligence in
engineered robots poses challenges to the design of collaborative algorithms
that can be programmed at large scales. The assignment-based method has played
an eminent role for a very long time in solving collaboration problems of robot
swarms. However, it faces fundamental limitations in terms of efficiency and
robustness due to its unscalability to swarm variants. This article presents a
tutorial review on recent advances in assignment-free collaboration of robot
swarms, focusing on the problem of shape formation. A key theoretical component
is the recently developed \emph{mean-shift exploration} strategy, which
improves the collaboration efficiency of large-scale swarms by dozens of times.
Further, the efficiency improvement is more significant as the swarm scale
increases. Finally, this article discusses three important applications of the
mean-shift exploration strategy, including precise shape formation, area
coverage formation, and maneuvering formation, as well as their corresponding
industrial scenarios in smart warehousing, area exploration, and cargo
transportation.
♻ ☆ Affordance-based Robot Manipulation with Flow Matching
We present a framework for assistive robot manipulation, which focuses on two
fundamental challenges: first, efficiently adapting large-scale models to
downstream scene affordance understanding tasks, especially in daily living
scenarios where gathering multi-task data involving humans requires strenuous
effort; second, effectively learning robot action trajectories by grounding the
visual affordance model. We tackle the first challenge by employing a
parameter-efficient prompt tuning method that prepends learnable text prompts
to the frozen vision model to predict manipulation affordances in multi-task
scenarios. Then we propose to learn robot action trajectories guided by
affordances in a supervised flow matching method. Flow matching represents a
robot visuomotor policy as a conditional process of flowing random waypoints to
desired robot action trajectories. Finally, we introduce a real-world dataset
with 10 tasks across Activities of Daily Living to test our framework. Our
extensive evaluation highlights that the proposed prompt tuning method for
learning manipulation affordance achieves competitive performance and even
outperforms some other finetuning protocols across data scales, while
satisfying parameter efficiency. Learning multi-task robot action trajectories
with flow matching leads to consistently favorable results in several robot
manipulation benchmarks than some alternative behavior cloning methods. This
includes more stable training and evaluation, and noticeably faster inference,
while maintaining comparable generalization performance to diffusion policy,
where flow matching performs marginally better in most cases. Our framework
seamlessly unifies affordance learning and action generation with flow matching
for robot manipulation.
♻ ☆ Learning to Navigate Socially Through Proactive Risk Perception
Erjia Xiao, Lingfeng Zhang, Yingbo Tang, Hao Cheng, Renjing Xu, Wenbo Ding, Lei Zhou, Long Chen, Hangjun Ye, Xiaoshuai Hao
In this report, we describe the technical details of our submission to the
IROS 2025 RoboSense Challenge Social Navigation Track. This track focuses on
developing RGBD-based perception and navigation systems that enable autonomous
agents to navigate safely, efficiently, and socially compliantly in dynamic
human-populated indoor environments. The challenge requires agents to operate
from an egocentric perspective using only onboard sensors including RGB-D
observations and odometry, without access to global maps or privileged
information, while maintaining social norm compliance such as safe distances
and collision avoidance. Building upon the Falcon model, we introduce a
Proactive Risk Perception Module to enhance social navigation performance. Our
approach augments Falcon with collision risk understanding that learns to
predict distance-based collision risk scores for surrounding humans, which
enables the agent to develop more robust spatial awareness and proactive
collision avoidance behaviors. The evaluation on the Social-HM3D benchmark
demonstrates that our method improves the agent's ability to maintain personal
space compliance while navigating toward goals in crowded indoor scenes with
dynamic human agents, achieving 2nd place among 16 participating teams in the
challenge.
♻ ☆ GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model
Vision-Language-Action (VLA) models often fail to generalize to novel camera
viewpoints, a limitation stemming from their difficulty in inferring robust 3D
geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective
approach that enhances viewpoint invariance by integrating strong geometric
priors into the vision backbone. Instead of training a visual encoder or
relying on explicit 3D data, we leverage a frozen, pretrained geometric vision
model as a feature extractor. A trainable projection layer then adapts these
geometrically-rich features for the policy decoder, relieving it of the burden
of learning 3D consistency from scratch. Through extensive evaluations on
LIBERO benchmark subsets, we show GeoAware-VLA achieves substantial
improvements in zero-shot generalization to novel camera poses, boosting
success rates by over 2x in simulation. Crucially, these benefits translate to
the physical world; our model shows a significant performance gain on a real
robot, especially when evaluated from unseen camera angles. Our approach proves
effective across both continuous and discrete action spaces, highlighting that
robust geometric grounding is a key component for creating more generalizable
robotic agents.
comment: Under Review, Project Page https://alisharey.github.io/GeoAware-VLA/
♻ ☆ Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild
Derek Ming Siang Tan, Shailesh, Boyang Liu, Alok Raj, Qi Xuan Ang, Weiheng Dai, Tanishq Duhan, Jimmy Chiun, Yuhong Cao, Florian Shkurti, Guillaume Sartoretti
To perform outdoor visual navigation and search, a robot may leverage
satellite imagery to generate visual priors. This can help inform high-level
search strategies, even when such images lack sufficient resolution for target
recognition. However, many existing informative path planning or search-based
approaches either assume no prior information, or use priors without accounting
for how they were obtained. Recent work instead utilizes large Vision Language
Models (VLMs) for generalizable priors, but their outputs can be inaccurate due
to hallucination, leading to inefficient search. To address these challenges,
we introduce Search-TTA, a multimodal test-time adaptation framework with a
flexible plug-and-play interface compatible with various input modalities
(e.g., image, text, sound) and planning methods (e.g., RL-based). First, we
pretrain a satellite image encoder to align with CLIP's visual encoder to
output probability distributions of target presence used for visual search.
Second, our TTA framework dynamically refines CLIP's predictions during search
using uncertainty-weighted gradient updates inspired by Spatial Poisson Point
Processes. To train and evaluate Search-TTA, we curate AVS-Bench, a visual
search dataset based on internet-scale ecological data containing 380k images
and taxonomy data. We find that Search-TTA improves planner performance by up
to 30.0%, particularly in cases with poor initial CLIP predictions due to
domain mismatch and limited training data. It also performs comparably with
significantly larger VLMs, and achieves zero-shot generalization via emergent
alignment to unseen modalities. Finally, we deploy Search-TTA on a real UAV via
hardware-in-the-loop testing, by simulating its operation within a large-scale
simulation that provides onboard sensing.
comment: Accepted for presentation at CORL 2025. Code, models, and data are
available at https://search-tta.github.io/
♻ ☆ Octopus-like Reaching Motion: A Perspective Inspired by Whipping
Shengyao Zhang, Yiyuan Zhang, Chenrui Zhang, Yiming Li, Wenci Xin, Yuliang Liufu, Hong Wei Ng, Cecilia Laschi
The stereotypical reaching motion of the octopus arm has drawn growing
attention for its efficient control of a highly deformable body. Previous
studies suggest that its characteristic bend propagation may share underlying
principles with the dynamics of a whip. This work investigates whether
whip-like passive dynamics in water can reproduce the kinematic features
observed in biological reaching and their similarities and differences.
Platform-based whipping tests were performed in water and air while
systematically varying material stiffness and driving speed. Image-based
quantification revealed that the Ecoflex Gel 2 arm driven at 150 rpm (motor
speed) reproduced curvature propagation similar to that observed in octopus
reaching. However, its bend-point velocity decreased monotonically rather than
exhibiting the biological bell-shaped profile, confirming that the octopus
reaching movement is not merely a passive whipping behavior. The absence of
propagation in air further highlights the critical role of the surrounding
medium in forming octopus-like reaching motion. This study provides a new
perspective for understand biological reaching movement, and offers a potential
platform for future hydrodynamic research.
comment: The first two listed authors contributed equally. Yiyuan Zhang is the
corresponding author
♻ ☆ ReNiL: Event-Driven Pedestrian Bayesian Localization Using IMU for Real-World Applications
Pedestrian inertial localization is key for mobile and IoT services because
it provides infrastructure-free positioning. Yet most learning-based methods
depend on fixed sliding-window integration, struggle to adapt to diverse motion
scales and cadences, and yield inconsistent uncertainty, limiting real-world
use. We present ReNiL, a Bayesian deep-learning framework for accurate,
efficient, and uncertainty-aware pedestrian localization. ReNiL introduces
Inertial Positioning Demand Points (IPDPs) to estimate motion at contextually
meaningful waypoints instead of dense tracking, and supports inference on IMU
sequences at any scale so cadence can match application needs. It couples a
motion-aware orientation filter with an Any-Scale Laplace Estimator (ASLE), a
dual-task network that blends patch-based self-supervision with Bayesian
regression. By modeling displacements with a Laplace distribution, ReNiL
provides homogeneous Euclidean uncertainty that integrates cleanly with other
sensors. A Bayesian inference chain links successive IPDPs into consistent
trajectories. On RoNIN-ds and a new WUDataset covering indoor and outdoor
motion from 28 participants, ReNiL achieves state-of-the-art displacement
accuracy and uncertainty consistency, outperforming TLIO, CTIN, iMoT, and RoNIN
variants while reducing computation. Application studies further show
robustness and practicality for mobile and IoT localization, making ReNiL a
scalable, uncertainty-aware foundation for next-generation positioning.
comment: This work has been submitted to the ACM for possible publication
♻ ☆ Generalizing Robot Trajectories from Single-Context Human Demonstrations: A Probabilistic Approach
Qian Ying Lee, Suhas Raghavendra Kulkarni, Kenzhi Iskandar Wong, Lin Yang, Bernardo Noronha, Yongjun Wee, Domenico Campolo
Generalizing robot trajectories from human demonstrations to new contexts
remains a key challenge in Learning from Demonstration (LfD), particularly when
only single-context demonstrations are available. We present a novel Gaussian
Mixture Model (GMM)-based approach that enables systematic generalization from
single-context demonstrations to a wide range of unseen start and goal
configurations. Our method performs component-level reparameterization of the
GMM, adapting both mean vectors and covariance matrices, followed by Gaussian
Mixture Regression (GMR) to generate smooth trajectories. We evaluate the
approach on a dual-arm pick-and-place task with varying box placements,
comparing against several baselines. Results show that our method significantly
outperforms baselines in trajectory success and fidelity, maintaining accuracy
even under combined translational and rotational variations of task
configurations. These results demonstrate that our method generalizes
effectively while ensuring boundary convergence and preserving the intrinsic
structure of demonstrated motions.