Simulated and Unsupervised Learning for the Generation of Realistic Imaging Sonar Datasets

Apr 30, 2024·
Tianxiang Lin
Tianxiang Lin
· 5 min read

In this project I present a novel Simulated and Unsupervised (S+U) learning approach for the generation of realistic imaging sonar datasets.

Introduction

In this project I present a novel Simulated and Unsupervised (S+U) learning approach for the generation of realistic imaging sonar datasets. Underwater Autonomous Vehicles (UAVs) rely on imaging sonars for underwater localization and mapping, as they perceive boarder areas and obtain more reliable information than optical sensors in underwater environments. However, sonar data collection is time-consuming and sometimes dangerous. In addition, underwater simulators have yet to provide realistic simulation data. To enable underwater tasks, in particular to train learning-based methods with large datasets, I proposed a SimGAN-based baseline to transfer simulated imaging sonar dataset to an unpaired real-world sonar dataset. In particular, I integrate self-regularization loss and cycle consistency loss into the refiner and introduce masked generative adversarial network (GAN) loss into the discriminator. The masks are generated from the smallest of cell-averaging constant false alarm rate (SOCA-CFAR) detector. The experimental results demonstrate that our proposed method is capable of providing more realistic imaging sonar datasets compared to existing state-of-the-art GAN methods.

Baseline

The figure of baseline is shown as follows. The refiner's task is to transform the noise distribution of simulated inputs to the domain of the real-world dataset. The discriminator, on the other hand, will distinguish the real sonar images and the synthetic outputs.

Given the unpaired dataset with synthetic sonar data from HoloOcean underwater simulator and real-world data from our test tank, our goal is to learn a mapping function between synthetic and real datasets. Denote the simulated inputs and unpaired real data as X={xi}i=1NX=\{x_i\}_{i=1}^N and Y={yj}j=1MY=\{y_j\}_{j=1}^M respectively, the data distribution as xpdata(x)x \sim p_{data}(x) and ypdata(y)y \sim p_{data}(y), and the two mappings as G:XYG:X \to Y and F:YXF: Y \to X. For the refiner, I use a 9-block ResNet architecture. For the discriminator, I use a basic 3-layer PatchGAN classifier with masked inputs whose masks are obtained from a SOCA-CFAR detector. Denote the foreground masks as MX={mx(i)}i=1NM_X=\{m_x^{(i)}\}_{i=1}^N and MY={my(j)}j=1MM_Y=\{m_y^{(j)}\}_{j=1}^M, and background masks as MX\overline{M_X} and MY\overline{M_Y}. Notably, the background masks are the reverse of the foreground masks.

In the project I modify the originally CycleGAN codebase (https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix).The overall objective of the networks can be defined as:

L(G,DY,X,Y,MX,MY)=LGAN+λcycleLcycle+λidtLidt+λregLreg\mathcal{L}(G,D_Y,X,Y,M_X,M_Y) = \mathcal{L}_{\text{GAN}} + \lambda_{cycle}\mathcal{L}_{cycle}+\lambda_{idt}\mathcal{L}_{idt}+\lambda_{reg}\mathcal{L}_{reg}

I aim to solve:

G,F=arg minG,FmaxDX,DYL(G,DY,X,Y,MX,MY)G^*,F^*=\argmin_{G,F}\max_{D_X,D_Y} \mathcal{L}(G,D_Y,X,Y,M_X,M_Y)

The overall training loss consists of three parts: masked adversarial Loss, cycle consistency Loss, and self-regularization loss. The masked adversarial loss can be represented as:

LGAN(G,DY,X,Y,MX,MY)=Eypdata(y)[logDY(myy)]+Eypdata(y)[logDY(myy)]+Expdata(x)[log(1DY(mxG(x)))]+Expdata(x)[log(1DY(mxG(x)))]\begin{align*} \mathcal{L}_{\text{GAN}}(G,D_Y,X,Y,M_X,M_Y) &= \mathbb{E}_{y \sim p_{data}(y)}[\log D_Y(m_y \otimes y)]\\ &+ \mathbb{E}_{y \sim p_{data}(y)}[\log D_Y(\overline{m_y} \otimes y)] \\ &+ \mathbb{E}_{x \sim p_{data}(x)}[\log (1-D_Y(m_x \otimes G(x)))] \\ &+ \mathbb{E}_{x \sim p_{data}(x)}[\log (1-D_Y(\overline{m_x} \otimes G(x)))] \end{align*}

The cycle consistency loss can be expressed as:

Lcycle(G,F,X,Y)=Eypdata(y)F(G(x))xL1+Expdata(x)G(F(y))yL1\begin{align*} \mathcal{L}_{cycle}(G,F,X,Y) &= \mathbb{E}_{y \sim p_{data}(y)}||F(G(x))-x||_{L1} \\ &+ \mathbb{E}_{x \sim p_{data}(x)}||G(F(y))-y||_{L1} \\ \end{align*}

The identity loss is set to make the goal mappings generate identical outputs to inputs when inputs are within the target domain. It is represented as:

Lidt(G,F,X,Y)=Eypdata(y)G(y)yL1=Expdata(x)F(x)xL1\begin{align*} \mathcal{L}_{idt}(G,F,X,Y) &= \mathbb{E}_{y \sim p_{data}(y)}||G(y)-y||_{L1} \\

&= \mathbb{E}{x \sim p{data}(x)}||F(x)-x||_{L1} \

\end{align*}

The self-regularization loss is expressed as:

Lreg(G,F,X,Y)=Eypdata(y)G(x)xL1+Expdata(x)F(y)yL1\begin{align*} \mathcal{L}{reg}(G,F,X,Y) &= \mathbb{E}{y \sim p_{data}(y)}||G(x)-x||{L1} \ &+ \mathbb{E}{x \sim p_{data}(x)}||F(y)-y||_{L1} \ \end{align*}

Dataset

Our unpaired dataset comes from 2 sources. For synthetic datasets, HoloOcean, a realistic underwater simulator, provides us unlimited access to simulated imaging sonar data. I collected our training and testing datasets from 2 different simulated underwater scenarios (submarine and sinked plane). The following figures demonstrate the 3D models of them. (Figure source: https://github.com/rpl-cmu/neusis)

I also collected our real-world dataset from our test watertank located in Newell-Simon Hall using Bluefin Hovering Autonomous Underwater Vehicle equipped with Sound Metric DIDSON 300m imaging sonar. Two datasets are for unsupervised training and post-training experiments. The following figures are the structures deployed in the watertank when collecting the dataset.

Experiment Results

For measuring the performance of our proposed baseline, I did our ablation study on multiple models, including the original version of SimGAN, CycleGAN, CycleGAN with only self-regularization loss and with masked GAN loss. I trained those networks with our collected datasets. I demonstrated the generation results qualitatively. To quantify the performance I created paired datasets from the generated sonar images from the model and the foreground masks of the original simulated dataset. Then I trained the models with the created datasets for the segmentation task. The models are tested with a real-world dataset for the segmentation masks. The generated masks are compared to the groundtruth masks of the real-world dataset under the metric of the intersection over union (IoU). Higher IoU implies a better transformation performance of our proposed baseline over other network architectures.

The following figures demonstrate the testing results from different models. Here the proposed method not only retains the objects of the simulated input but reduces the vertical noises, making them more realistic and closer to the unpaired real-world dataset.

The following figures give the segmentation results with conditional GANs (pix2pix) using different datasets that generated from experiment models. From the results our proposed model predicted better masks than other 4 models.

The following table shows the quantitative results of the segmentation results. Our proposed method gives the best IoU scores among all baselines.

Model IoU
SimGAN 0.001
CycleGAN 0.702
CycleGAN (reg) 0.679
Proposed 0.734

Conclusions and Future Works

The experimental results demonstrate the potential capability of our proposed method to generate realistic sonar datasets from simulation.

The future plans are as follows:

  1. I will try to generate larger datasets from both simulator and real-world scenarios. Currently, the datasets only contain about 100 sonar images. Larger datasets with different scenarios will definitely make the model more robust to sonar images from different underwater environments.
  1. I will try to leverage StableDiffusion to better transform our simulated images to real-world domains.
  1. Instead of sonar data from a single type of sonar, I try to collect imaging sonar datasets from different imaging sonar, for example, Blueprint M1200d imaging sonar, and even other types of sonar, such as Side-scan sonar (SSS).
  1. More applications can be used as metrics for experiments to test the performance of our proposed baseline. For example, 3D NeRF reconstructions and learning-based feature matching of sonar images.