• 摘要: 受水下复杂环境影响,侧扫声纳图像普遍存在强噪声干扰、几何畸变及特征模糊等问题,严重制约了传统检测模型的性能。此外,声纳图像数据集规模有限且标注质量不一,进一步加剧了模型训练的难度。针对上述问题,本文提出一种改进检测模型高效声呐图像 YOLOv8 模型(YOLOv8-efficient sonar image, YOLOv8-ESI)。首先在骨干网络中引入小波卷积层 (WTConv),以协同增强对图像低频结构信息与高频细节特征的提取能力;然后对颈部网络进行重构,设计了动态特征金字塔网络 (D-GFPN)架构,融合高效动态上采样器 (DySample)和跨阶段局部-双路径轻量聚合网络模块 (CSP-DLAN),从而有效整合多尺度高层语义与空间细节信息;最后采用WIoU损失函数替代原CIoU,通过抑制低质量样本产生的有害梯度,提升模型训练的鲁棒性。实验结果表明,YOLOv8-ESI 相比原版 YOLOv8n 在侧扫声纳数据集上性能有显著提升,mAP50 提高了 5.2%,参数减少了 10%,计算复杂度降低了 1.1 GFLOPs。本文设计的 YOLOv8-ESI不仅实现了对侧扫声纳图像目标的精确检测,还实现了模型的轻量化设计,适配水下设备的实时检测需求。

       

      Abstract:
      Objective Side-scan sonar (SSS) is an indispensable acoustic imaging technology for seabed exploration, underwater archeology, and search and rescue operations. It provides wide-range, high-resolution images of the seafloor by transmitting acoustic pulses and recording their backscattered echoes. However, due to the complex underwater environment and the inherent characteristics of acoustic imaging, SSS images are typically contaminated by severe speckle noise, geometric distortions, uneven resolution, and blurred target boundaries. These degradations significantly impair the performance of conventional object detection models that are originally designed for optical images. Furthermore, SSS datasets are often limited in scale and suffer from inconsistent annotation quality, as the low resolution and fuzzy features make it difficult to label targets precisely. Such low-quality training samples introduce harmful gradients during optimization, further compromising model robustness. To address these challenges, this paper proposes an improved detection model named YOLOv8-Efficient Sonar Image (YOLOv8-ESI), aiming to achieve accurate, robust, and lightweight underwater target detection that meets the real-time requirements of resource-constrained autonomous underwater vehicles.
      Methods YOLOv8-ESI is developed based on the YOLOv8n baseline through three synergistic innovations tailored to SSS image characteristics. First, a wavelet convolutional layer (WTConv) is integrated into the backbone network. Unlike standard convolutions that operate solely in the spatial domain, WTConv performs cascaded two-dimensional Haar wavelet decomposition, transforming feature extraction into the frequency domain. The low-frequency sub-band (LL) captures global structural information such as the highlight-shadow patterns of sonar targets, while the high-frequency sub-bands (LH, HL, HH) preserve edge details and textures. By applying small-sized depthwise convolutions to these sub-bands, WTConv effectively enlarges the receptive field without introducing excessive parameters. Moreover, the multi-scale wavelet decomposition enables adaptive noise suppression through thresholding in the frequency domain, enhancing the model's ability to distinguish true targets from background clutter. Second, the neck network is systematically reconstructed into a dynamic feature pyramid network (D-GFPN). This architecture incorporates an efficient dynamic upsampler (DySample) that employs a multi-level point sampling strategy to preserve fine-grained details during feature resolution recovery, which is critical for small target detection in SSS images. A cross-stage partial dual-path lightweight aggregation network module (CSP-DLAN) is also designed, utilizing parallel 1×1 and 3×3 grouped depthwise convolutions along with channel shuffle mechanisms to achieve efficient multi-scale feature fusion with minimal computational overhead. This design balances the extraction of spatial details and channel-wise interactions while promoting cross-group information flow. Third, the original CIoU loss is replaced with WIoU, which employs a dynamic non-monotonic focusing mechanism based on outlier degree to assess anchor box quality. By assigning smaller gradient gains to low-quality samples and focusing on medium-quality anchors, WIoU effectively mitigates the harmful effects of annotation noise and bounding box inaccuracies prevalent in SSS datasets, thereby improving training stability and overall detection accuracy.
      Results and Discussions  Experiments are conducted on a self-constructed SSS dataset (SSSD) that integrates publicly available sources (Seabed Objects-KLSG and SONAR-2019) with additional images collected from underwater experiments and online repositories. The dataset contains 956 original images of three target categories: shipwrecks (602 images), aircraft wrecks (279 images), and human remains (75 images), with resolutions ranging from 300 pixel×300 pixel to 800 pixel×500 pixel. To enhance generalization and prevent overfitting, the training set is augmented through horizontal flipping, random rotation, random brightness adjustment, and Gaussian noise addition, resulting in 2926 training samples. The dataset is split into training, validation, and test sets with a ratio of 7:2:1. Ablation studies are performed to validate the contribution of each proposed module. Integrating WTConv alone improves mAP50 by 2.6% and mAP50:95 by 2.5% while reducing parameters, demonstrating its effectiveness in frequency-domain feature enhancement and noise suppression. The proposed D-GFPN outperforms mainstream neck structures including PANet, BiFPN, HS-FPN, and RepGFPN, achieving 87.3% mAP50 with only 2.98 M parameters and 7.8 GFLOPs, representing a 9.1% parameter reduction and 8.3% FLOPs reduction compared to RepGFPN while maintaining the same mAP50. Among different WIoU variants, WIoUv3 yields the best performance, particularly improving detection of challenging categories such as human remains by 1.7%, which validates its effectiveness in handling low-quality samples. The complete YOLOv8-ESI model achieves 88.8% mAP50, and 63.4% mAP50:95, surpassing baseline YOLOv8n by 5.2% and 3.8%, respectively, while reducing parameters by 10% (from 3.01 M to 2.69 M) and computational complexity by 1.1 GFLOPs. Comparative experiments against state-of-the-art detectors including YOLOv3-tiny, YOLOv5s, YOLOv6n, YOLOv9c, YOLOv10n, YOLOv11n, RT-DETR, SCR-YOLOv8, and CSC-YOLO demonstrate that YOLOv8-ESI achieves superior detection accuracy with competitive inference speed (1.3 ms per image) and model compactness. Heatmap visualizations confirm that the model focuses precisely on target regions with suppressed background activation, while detection examples illustrate reduced false positives and improved boundary localization compared to baseline models.
      Conclusions This paper presents YOLOv8-ESI, a lightweight and accurate detection model specifically designed for side-scan sonar imagery. By integrating wavelet convolution for frequency-domain feature enhancement, a dynamic feature pyramid network for adaptive multi-scale fusion, and a wise IoU loss for robust training against annotation noise, the model effectively addresses the core challenges of noise interference, scale variability, feature blurring, and low-quality annotations inherent in SSS data. Experimental results demonstrate that YOLOv8-ESI significantly outperforms baseline and state-of-the-art models in both accuracy and efficiency, meeting the real-time detection requirements of underwater platforms. The ablation studies validate the synergistic effects of the proposed modules, and the comparative experiments confirm the model's superiority across multiple evaluation metrics. This work provides an effective technical solution for advancing underwater object detection in marine exploration, archaeology, and search and rescue operations. Future work may explore the integration of advanced denoising techniques and the extension of the model to other underwater imaging modalities such as synthetic aperture sonar and multi-beam echo sounders.