EKF and UKF NaN divergence in ROS 2 sensor fusion¶
Your filter runs normally for seconds or minutes, then produces NaN on every output topic. The filter is dead and will not recover without a restart. This page explains the three causes and how to fix each.
What NaN divergence looks like¶
[ERROR] [ekf_node]: nan detected in position output
[WARN] [fusioncore]: covariance regularization applied at step 4821
Or more subtly: /fusion/odom stops updating, RViz shows the robot at the origin, and ros2 topic echo /fusion/odom --field pose.pose.position.x prints nan.
Cause 1: Covariance matrix loses positive semi-definiteness¶
The Kalman filter maintains a covariance matrix P that must be positive semi-definite at all times. If floating-point arithmetic accumulates errors (off-diagonal terms drift, rounding errors in predict steps), P can become indefinite. The next square root (Cholesky decomposition in UKF, or matrix inverse in EKF) fails and NaN propagates through everything downstream.
Who is affected: robot_localization UKF is highly sensitive to this. It diverges with NaN at 100 Hz IMU input on GPS-heavy sequences, typically within 30-60 seconds.
Why it happens in robot_localization UKF specifically: the sigma point propagation accumulates floating-point error faster than the scalar Mahalanobis thresholds can prevent. The covariance matrix drifts slightly indefinite per predict step and the error compounds.
Fix in robot_localization: switch to EKF mode. robot_localization EKF is significantly more numerically stable than its UKF. If UKF is required, lower the IMU rate: imu0_queue_size: 10 or process IMU at 50 Hz instead of 100 Hz.
Fix in FusionCore: FusionCore normalizes the quaternion component of the state vector at every predict step and applies a symmetry correction P = (P + P^T) / 2 to prevent covariance drift. If P loses positive semi-definiteness despite this, a regularization floor is applied before the Cholesky. NaN divergence has not occurred on any of the 12 NCLT sequences (940 total minutes at 100 Hz).
Cause 2: Sensor covariance is zero or nearly zero¶
If your IMU driver publishes an Imu message with angular_velocity_covariance[0] = 0 (or very close to zero), the measurement noise matrix R becomes singular. The Kalman gain K = P*H^T * (H*P*H^T + R)^{-1} involves inverting H*P*H^T + R. If R has a zero diagonal, the inversion fails or produces NaN.
Check your sensor covariances:
ros2 topic echo /imu/data --field angular_velocity_covariance --once
ros2 topic echo /imu/data --field linear_acceleration_covariance --once
ros2 topic echo /gnss/fix --field position_covariance --once
If you see all zeros, the driver is not filling covariances. Fix by overriding in the driver config or setting fallback values in FusionCore:
robot_localization uses whatever covariance the driver publishes with no floor. If the driver publishes zero, rl will attempt to invert a singular matrix.
Cause 3: Outlier not rejected, large measurement update¶
A 100m GPS spike arrives, the outlier gate doesn't reject it (either because gating is off or the threshold is too loose), and the filter tries to fuse a position 100m away from the current estimate. The Kalman gain is proportional to P and inversely proportional to innovation covariance. A huge innovation with nominal covariance produces a huge state update. The state jumps discontinuously, P updates incorrectly, and the next predict step produces NaN.
Check: is outlier rejection enabled and calibrated?
In robot_localization:
If this is not set, rl accepts every GPS fix regardless of how far it is from the prediction.
In FusionCore:
FusionCore also bounds the covariance at each step independently of outlier rejection, so even if a large measurement update occurs, P cannot become indefinite on the next step.
Diagnosis steps¶
- Check when NaN first appears:
ros2 bag play your_bag.bagand watch the filter output timestamp - Check what sensor message arrived just before NaN: look at the GPS or IMU message 1-2 steps earlier
- Check covariance values on all sensor topics (see Cause 2 above)
- Check whether outlier rejection thresholds are set (see Cause 3 above)
- If using robot_localization UKF: try switching to EKF first to rule out Cause 1