Benchmark Results¶
FusionCore vs robot_localization EKF on the NCLT dataset (University of Michigan North Campus Long-Term). Twelve sequences across all seasons, same single config file, no per-sequence tuning.
Summary: 12 sequences, 10 FC wins¶
| Sequence | Season | Duration | GPS Fixes | Max Blackout | FC ATE (3D) | RL-EKF ATE (3D) | Winner |
|---|---|---|---|---|---|---|---|
| 2012-01-08 | Winter | 92 min | 22,041 | 203s | 18.6 m | 41.2 m | FC +55% |
| 2012-02-04 | Winter | 77 min | 18,808 | 184s | 49.7 m | 265.5 m | FC +81% |
| 2012-03-31 | Spring | 87 min | 20,482 | 262s | 22.0 m | 156.5 m | FC +86% |
| 2012-05-11 | Spring | 84 min | 21,621 | 120s | 9.7 m | 11.5 m | FC +16% |
| 2012-06-15 | Summer | 55 min | 12,399 | 462s | 49.2 m | 18.2 m | RL +63% |
| 2012-08-20 | Summer | 83 min | 20,025 | 228s | 98.3 m | 10.6 m | RL +89% |
| 2012-09-28 | Fall | 77 min | 19,191 | 196s | 10.8 m | 55.7 m | FC +81% |
| 2012-10-28 | Fall | 85 min | 21,060 | 256s | 29.9 m | 60.0 m | FC +50% |
| 2012-11-04 | Fall | 79 min | 17,840 | 400s | 60.1 m | 122.0 m | FC +51% |
| 2012-12-01 | Winter | 75 min | 17,941 | 173s | 21.0 m | 90.7 m | FC +77% |
| 2013-02-23 | Winter | 78 min | 19,333 | 240s | 59.4 m | 82.2 m | FC +28% |
| 2013-04-05 | Spring | 68 min | 16,297 | 275s | 12.1 m | 268.9 m | FC +96% |
ATE = absolute trajectory error, SE3-aligned to RTK GPS ground truth. GPS Fixes = mode-3 (3D) fixes only, as published by nclt_player.
RL-UKF: NaN divergence on all twelve sequences (known numerical instability under sim-time playback, confirmed by RL maintainer). Excluded from results.
Full metrics table¶
| Sequence | Filter | ATE 3D | ATE XY | Within 5m | Within 10m | Drift (m/km) | RPE@10m |
|---|---|---|---|---|---|---|---|
| 2012-01-08 | FusionCore | 18.6 m | 16.9 m | 26.7% | 74.5% | 2.55 | 22.5 m |
| RL-EKF | 41.2 m | 41.0 m | 22.8% | 76.5% | 5.64 | 25.3 m | |
| 2012-02-04 | FusionCore | 49.7 m | 31.5 m | 6.4% | 33.4% | 5.96 | 30.0 m |
| RL-EKF | 265.5 m | 265.4 m | 0.0% | 0.1% | 31.84 | 44.3 m | |
| 2012-03-31 | FusionCore | 22.0 m | 20.2 m | 19.9% | 67.3% | 2.27 | 21.4 m |
| RL-EKF | 156.5 m | 156.3 m | 0.2% | 0.6% | 16.16 | 42.7 m | |
| 2012-05-11 | FusionCore | 9.7 m | 4.9 m | 45.9% | 82.6% | 1.05 | 19.0 m |
| RL-EKF | 11.5 m | 9.0 m | 56.2% | 90.1% | 1.25 | 20.2 m | |
| 2012-06-15 | FusionCore | 49.2 m | 48.4 m | 2.4% | 20.0% | 8.40 | 22.4 m |
| RL-EKF | 18.2 m | 17.1 m | 42.8% | 78.4% | 3.11 | 22.3 m | |
| 2012-08-20 | FusionCore | 98.3 m | 97.9 m | 0.1% | 13.8% | 13.08 | 53.7 m |
| RL-EKF | 10.6 m | 9.9 m | 59.4% | 89.3% | 1.40 | 19.1 m | |
| 2012-09-28 | FusionCore | 10.8 m | 7.5 m | 31.4% | 76.9% | 1.50 | 23.7 m |
| RL-EKF | 55.7 m | 55.5 m | 1.7% | 25.1% | 7.73 | 28.0 m | |
| 2012-10-28 | FusionCore | 29.9 m | 21.1 m | 19.9% | 59.7% | 3.69 | 40.6 m |
| RL-EKF | 60.0 m | 59.6 m | 0.1% | 3.6% | 7.40 | 27.8 m | |
| 2012-11-04 | FusionCore | 60.1 m | 59.2 m | 3.8% | 29.5% | 9.86 | 32.3 m |
| RL-EKF | 122.0 m | 121.9 m | 0.0% | 0.0% | 20.02 | 37.0 m | |
| 2012-12-01 | FusionCore | 21.0 m | 14.6 m | 24.3% | 65.4% | 2.90 | 32.9 m |
| RL-EKF | 90.7 m | 90.5 m | 5.3% | 20.6% | 12.53 | 42.1 m | |
| 2013-02-23 | FusionCore | 59.4 m | 58.5 m | 1.6% | 16.2% | 6.67 | 24.1 m |
| RL-EKF | 82.2 m | 81.8 m | 0.0% | 0.6% | 9.23 | 35.0 m | |
| 2013-04-05 | FusionCore | 12.1 m | 10.1 m | 32.8% | 81.5% | 2.26 | 30.2 m |
| RL-EKF | 268.9 m | 268.7 m | 0.0% | 0.0% | 50.11 | 27.3 m |
Methodology¶
Dataset: NCLT (University of Michigan, 2012-2013). Wheeled robot (Segway RMP) driving on a large campus over multiple seasons. Raw CSV sensor files replayed at 3x real time via nclt_player.
Sensors used (identical inputs to both filters):
- IMU: Microstrain 3DM-GX3-45 at 100 Hz (raw specific force, no factory gravity removal)
- Wheel odometry: Segway RMP encoders at 100 Hz
- GPS: Novatel SPAN-CPT, ~3m CEP, 5 Hz
Ground truth: RTK GPS (gps_rtk.csv), projected to local ENU via PROJ/WGS84. Evaluation: evo, SE(3)-aligned ATE.
FusionCore config: Single YAML file (fusioncore_datasets/config/nclt_fusioncore.yaml), identical across all twelve sequences. No per-sequence tuning.
RL-EKF config: two_d_mode: true (flat-terrain Segway assumption), GPS fused via navsat_transform with a fixed datum from the first valid RTK fix. Matching chi-squared gating thresholds to FusionCore (odom0_twist_rejection_threshold: 4.03, odom1_pose_rejection_threshold: 3.72).
RL-UKF: Diverged with NaN on all sequences during sim-time playback (rapid timer catchup causes near-zero dt, Cholesky failure). Excluded from results.
What drives the results¶
Why RL-EKF fails on 10 sequences¶
The drift rate column tells the story most clearly. RL drift rates of 31.84 m/km (2012-02-04), 50.11 m/km (2013-04-05), and 20.02 m/km (2012-11-04) mean the filter is operating without GPS for large portions of those runs. A Segway at 1.5 m/s accumulating 31 m per kilometer traveled is in pure dead-reckoning almost the entire time.
The cause is consistent across all RL failures: nclt_player publishes position_covariance var_xy=9 (3m sigma), which is the Novatel SPAN-CPT specification under ideal open-sky conditions. Measured against the RTK ground truth, actual GPS noise across all twelve sequences looks like this:
| Sequence | Median error | p95 error | p99 error | RL result |
|---|---|---|---|---|
| 2012-01-08 | 3.7 m | 20.1 m | 49.7 m | 41.2 m |
| 2012-02-04 | 5.6 m | 46.6 m | 234.9 m | 265.5 m |
| 2012-03-31 | 5.7 m | 14.7 m | 32.7 m | 156.5 m |
| 2012-05-11 | 3.3 m | 13.3 m | 47.7 m | 11.5 m |
| 2012-06-15 | 2.6 m | 9.7 m | 21.3 m | 18.2 m (RL wins) |
| 2012-08-20 | 3.4 m | 12.7 m | 55.0 m | 10.6 m (RL wins) |
| 2012-09-28 | 3.5 m | 12.8 m | 43.2 m | 55.7 m |
| 2012-10-28 | 4.6 m | 16.0 m | 48.9 m | 60.0 m |
| 2012-11-04 | 5.7 m | 53.1 m | 79.2 m | 122.0 m |
| 2012-12-01 | 4.7 m | 20.7 m | 80.4 m | 90.7 m |
| 2013-02-23 | 5.4 m | 33.0 m | 73.6 m | 82.2 m |
| 2013-04-05 | 3.7 m | 19.9 m | 87.8 m | 87.8 m |
The driver states 3m sigma (var_xy=9). The median actual error is 2.6-5.7m across all sequences (already at or above the stated 1-sigma), and p95 ranges from 9.7m to 53.1m. RL's gate is calibrated to the stated 3m; it rejects anything beyond roughly 3x that (Mahalanobis distance above the chi2 threshold). On sequences like 2012-02-04 and 2012-11-04, most GPS fixes are outliers by RL's definition of "outlier."
The two sequences where RL wins (2012-06-15 and 2012-08-20) have the cleanest GPS of the set: p95 of 9.7m and 12.7m respectively. RL's tight gate works when the actual noise matches the stated noise. It fails everywhere else.
The contrast on 2012-05-11 (RL drift: 1.25 m/km vs 31.84 m/km on 2012-02-04) makes the mechanism concrete. Same robot, same campus, same config. The only difference is GPS data quality on that specific day. When GPS covariance matches actual sensor noise, both filters perform comparably (9.7m vs 11.5m). The advantage opens when the reported covariance is too tight.
FusionCore's adaptive.gnss: true adjusts GPS measurement noise in real time from the innovation sequence. When actual GPS noise is higher than the driver reports, the adaptive window inflates the noise model and keeps chi2 statistics calibrated. RL has no equivalent.
What would help RL: Increasing position_covariance var_xy in nclt_player from 9 to 25 (5m sigma, closer to actual NCLT GPS accuracy in urban conditions) would reduce RL's catastrophic losses substantially without per-sequence tuning. This does not require modifying robot_localization itself, only the dataset player. However, RL would still lack adaptive noise, and the calibration burden would remain whenever the dataset or environment changes.
What drives FC performance variation¶
The single best predictor of FC ATE is the longest GPS blackout in the sequence:
| Max blackout | Sequences | FC ATE range |
|---|---|---|
| < 200s | 2012-01-08 (203s), 2012-12-01 (173s) | 18-21 m |
| 200-300s | 2012-03-31, 2012-05-11, 2012-09-28, 2012-10-28, 2013-04-05 | 10-30 m |
| 300-480s | 2012-02-04, 2012-06-15, 2012-11-04, 2013-02-23 | 49-60 m |
| Adversarial GPS | 2012-08-20 (228s blackout + 105 corrupt fixes at boundary) | 98.3 m |
FC drift rate is consistent at 1-4 m/km on clean sequences. Values above 6 m/km (2012-06-15, 2012-08-20, 2012-11-04, 2013-02-23) signal heading error accumulated during coast mode. The 2012-08-20 transient is a distinct failure mode: adversarial GPS data at the blackout boundary, not heading drift.
FC performance tiers¶
Excellent (< 20m ATE): 2012-05-11 (9.7m), 2012-09-28 (10.8m), 2013-04-05 (12.1m), 2012-01-08 (18.6m)
High GPS fix count (19k-22k), max blackout under 200s, no adversarial GPS. FC operates as intended.
Good (20-35m ATE): 2012-03-31 (22.0m), 2012-12-01 (21.0m), 2012-10-28 (29.9m)
Moderate GPS density, blackouts under 275s, clean GPS at boundaries. Occasional heading drift corrected quickly on GPS return.
Moderate (35-65m ATE): 2012-02-04 (49.7m), 2012-06-15 (49.2m), 2013-02-23 (59.4m), 2012-11-04 (60.1m)
Long blackouts (240-462s) or low GPS density. Heading drift compounds over coast mode duration before correction.
Poor (> 65m ATE): 2012-08-20 (98.3m)
Structurally different failure: adversarial GPS cluster at blackout boundary. Outside the 2-3 minute transient windows, FC tracks at 5-10m, on-par with RL-EKF.
The two FC losses: honest analysis¶
2012-06-15 (FC 49.2m, RL 18.2m)¶
The lowest-density GPS sequence in the set: 12,399 mode-3 fixes vs 17,000-22,000 on others. One GPS blackout of 462 seconds (7.7 minutes).
During the blackout, FC dead-reckons on encoder and IMU. Coast mode inflates Q_position (coast_q_factor=10) and down-weights IMU WZ (coast_imu_wz_scale=500), so encoder WZ dominates heading. The encoder WZ bias (B_EWZ) is calibrated from GPS heading cross-covariance before the blackout and subtracted during it. However, any residual B_EWZ error compounds over 7.7 minutes. At 100 Hz with even a small uncorrected heading rate, lateral position error grows quadratically.
RL-EKF wins here because its 2D mode has a simpler state vector and accumulates less uncertainty over the blackout. This is a structural advantage for RL on GPS-sparse sequences with very long blackouts on flat terrain. See issue #63.
Path to fixing this:
- Reduce
coast_imu_wz_scalefrom 500 to 50-100 for blackouts exceeding 200s. At 500x down-weighting, the IMU WZ is essentially ignored. Both sensors sharing heading responsibility reduces sensitivity to B_EWZ residual error. - Magnetometer integration closes the observability gap completely: an absolute heading reference during GPS absence makes B_GZ and B_EWZ irrelevant. This is the architecturally correct fix and is on the roadmap.
- Duration-dependent
coast_q_factor: the current fixed 10x multiplier was tuned for the majority of sequences. For blackouts > 300s, a nonlinear ramp may reduce heading drift without sacrificing re-acquisition on short blackouts.
2012-08-20 (FC 98.3m, RL 10.6m)¶
The raw GPS stream contains 105 mode-3 fixes 720-840m off the RTK ground truth in gps.csv. The ground-truth preprocessor excludes them but they appear as valid mode-3 fixes in the real data stream. They cluster in a 24-second window at the end of the second GPS blackout (211s at t=62.5 min).
This is adversarial for any chi2-based gating scheme. During coast mode recovery, FC relaxes the chi2 gate to accept the first valid returning fix after genuine position drift. A dense cluster of corrupt fixes arriving at exactly the re-acquisition moment exploits this window.
Per-minute error analysis:
| Time | FC error | Status |
|---|---|---|
| 0-42 min | 1-10 m | Normal GPS coverage, both filters tracking well |
| 43-46 min | spike to ~100m, recovers | Blackout 1 (228s): boundary GPS errors up to ~70m |
| 47-62 min | 3-10 m | Full recovery |
| 63-67 min | spike to ~788m, recovers in 2 min | Blackout 2 (211s): 105 adversarial fixes at boundary |
| 68-82 min | 5-10 m | Full recovery, remaining 15 minutes on-par with RL |
The 98m ATE RMSE is driven almost entirely by those two transients. RL-EKF wins because its tight Mahalanobis gate (calibrated to the stated 3m sigma, which causes GPS rejection on 10 other sequences) accidentally rejects these outliers too. See issue #64.
Path to fixing this:
- Velocity sanity check: A GPS fix 720m from the dead-reckoned position after a 211s blackout implies ~3400 m/s. A hard
max_implied_speedcheck (e.g., 20 m/s) operating before the chi2 gate rejects this trivially and has zero effect on normal operation. - Cluster consistency gate: Five consecutive fixes all landing 720-840m from the predicted position with geometric consistency (tight cluster, not random scatter) is distinguishable from noise. A secondary check on cluster coherence catches this without affecting single-fix rejection behavior.
- Gate hysteresis on recovery: Instead of a step change in chi2 threshold at re-acquisition, a linear ramp from relaxed back to nominal over the first N returned fixes makes it harder for a dense cluster to slip through entirely.
Reproduce¶
# Build
colcon build --packages-select fusioncore_core fusioncore_ros fusioncore_datasets
# Run one sequence (auto-stops on playback complete, ~15-50 min at 3x)
bash benchmarks/run_one.sh 2012-01-08
# Results written to:
# benchmarks/nclt/2012-01-08/results_full/BENCHMARK.md
# Run all 12 sequences sequentially (plan for 6-8 hours total)
bash benchmarks/run_all.sh
Full tooling and configs in benchmarks/.