AMCF-NET: ADAPTIVE MULTI-SCALE CROSS-MODAL FUSION NETWORK FOR UAV-SATELLITE CROSS-VIEW LOCALIZATION
Abstract
Cross-view localization between Unmanned Aerial Vehicle (UAV) and satellite imagery
is crucial for autonomous navigation in GPS-denied environments. However, large domain
gaps, including viewpoint discrepancies, scale variations, and appearance differences — pose
significant challenges. In this paper, we propose the Adaptive Multi-scale Cross-modal Fusion
Network (AMCF-Net), a novel approach that effectively addresses these limitations through a
shared backbone architecture and adaptive fusion mechanisms. Unlike previous dual-backbone
approaches that process UAV and satellite images separately, our method employs a unified
FocalNet-Tiny backbone to extract cross-modal features, followed by a Spatially-adaptive Crossmodal
Feature Fusion (AMCF) module that dynamically combines multi-scale similarities
using learned adaptive weights. This shared representation learning enables better cross-modal
alignment and significantly reduces computational overhead. Comprehensive experiments on
the UL14 benchmark demonstrate that AMCF-Net achieves state-of-the-art performance, with a
Relative Distance Score (RDS) of 78.12% and meter-level accuracy of 27.25% at 3 m, 50.16%
at 5 m, 84.37% at 10 m, and finally 88.51% at 20 m. Ablation studies further validate the
effectiveness of the shared backbone and adaptive fusion mechanism, demonstrating significant
improvements over traditional separate processing approaches.