Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (Department of Electrical and Computer Engineering, Inha University, Incheon, Korea presto0408@gmail.com and bcsong@inha.ac.kr)



Dataset generation, Semantic segmentation, Long-wave infrared, Diffusion models

1. Introduction

Semantic segmentation involves assigning semantic labels to each pixel of an image, making it an important task in computer vision. It is used for purposes such as scene comprehension and object recognition and is applied in real-world applications like autonomous driving [1,2]. However, one of the greatest challenges in advancing semantic segmentation is building a dataset. In the process of dataset construction, we may face privacy regulations starting from the data collection stage [3], and an even bigger issue is the high cost of manual annotation. Pixel-level annotation on collected images is an extremely labor-intensive and time-consuming task.

Fortunately, there is an alternative solution to this problem: Constructing synthetic datasets using generative models. The synthetic dataset generation method has the advantage of being free from privacy issues and allows for the creation of data tailored to specific needs. Additionally, it is highly valued for reducing the human resource, time, and financial costs associated with manual annotation. In previous research, efforts have been made to automate manual annotation using models based on generative adversarial networks (GAN) [4] or to generate synthetic images [5,6]. However, these methods have limitations in representing complex real-world scenes and lack diversity compared to variational auto encoder (VAE)-based models. On the other hand, VAE-based models provide sufficient diversity but suffer from lower image quality [7]. On the other hand, with the recent emergence of diffusion models [8], the field of synthetic dataset generation has advanced dramatically. Diffusion-based models, such as large-scale language-image generation (LLIG) models like stable diffusion [9], demonstrate overwhelming performance by automating annotations on generated images without additional training [10,11].

Meanwhile, long-wave infrared (LWIR) imagery is captured outside the visible range of the electromagnetic spectrum, and it has the advantage of being able to detect objects even in situations where visibility is obscured by fog or smoke. Utilizing this characteristic, we are developing an application to detect exits in indoor environments filled with smoke due to fire. However, the currently available LWIR image datasets are mostly captured outdoors, making them unsuitable for training deep learning models that operate in indoor environments. Additionally, capturing indoor LWIR images requires specialized equipment and authorized locations for shooting, which poses a significant hurdle for data collection. We attempted to generate LWIR images using the large-scale language-image generation (LLIG) models, which are currently gaining attention. However, popular models like DALL-E 2 [12] and stable diffusion [9] were trained on RGB-based datasets like LAION-5B [13], making them inadequate for generating LWIR images.

In this paper, we propose a novel dataset generation framework called Noise-to-Dataset to address the severe data scarcity problem. The proposed Noise-to-Dataset framework consists of two stages: the first stage generates a semantic mask from noise, and the second stage synthesizes a synthetic image from the semantic mask. The semantic mask generation stage adopts unconditional generation, requiring only Gaussian noise as input. This allows for the unlimited generation of semantic masks as needed. Moreover, by generating semantic masks that do not exist in the original dataset, the subsequent synthetic image synthesis stage can produce data for new scenes. From the perspective of a segmentation model, this approach enables learning from unseen data, which helps prevent overfitting on small datasets and improves generalization performance.

The proposed method is significant in that it overcomes the limitations of conventional dataset construction approaches. Existing methods synthesize images based on semantic masks obtained from real datasets [6,16,21]. However, this approach is inherently limited to generating images that are structurally identical to real data, reducing their potential to enhance performance as training data. To address this limitation, Noise-to-Dataset introduces a novel approach that directly generates semantic masks, thereby mitigating data scarcity issues and reducing the substantial manual annotation effort required. Of course, the proposed method still requires a minimal amount of training data. To this end, our research team directly constructed a real LWIR dataset of approximately 2.6k images. We demonstrated the effectiveness of the synthetic dataset by showing that adding synthetic data to the real dataset improves the performance of semantic segmentation models. Furthermore, additional experiments on general RGB datasets such as ADE20K [14] and Cityscapes [15] verified the feasibility of the proposed framework's scalability.

2. Method

This section describes Noise-to-Dataset, which generates a semantic segmentation dataset from Gaussian noise to address the issue of data scarcity (see Fig. 1). Noise-to-Dataset simultaneously generates additional semantic masks and synthetic images based on learning from the limited target dataset, reducing the high costs of manual annotation while creating diverse data that did not previously exist. The proposed method consists of two stages. First, unconditional generation is performed using denoising diffusion probabilistic models (DDPM) [8] to generate diverse semantic masks from Gaussian noise (cf. Subsection 2.1). Second, conditional generation using the generated semantic masks is performed by the semantic diffusion model (SDM) [16] to create synthetic images (cf. Subsection 2.2). The generated semantic masks and synthetic images are then used as a dataset for training any semantic segmentation model.

Fig. 1. Overview of our Noise-to-Dataset for synthetic dataset generation.

../../Resources/ieie/IEIESPC.2025.14.4.528/fig1.png

2.1. Semantic Mask Generation

For semantic mask generation, we utilized DDPM. DDPM generates images through a forward process and a reverse process. The forward process involves adding Gaussian noise to the original image $x_0$ at each timestep $t$ based on a variance schedule ${\beta}_t$. The probability distribution at each timestep is defined as follows:

(1)
$ q(x_t\mid x_{t-1})=\mathcal{N}(x_t;~\sqrt{1-{\beta}_t}x_{t-1},\ {\beta}_tI). $

Here, $x_t$ is the image at $t$, $\mathcal{N}$ denotes the Gaussian distribution, and $I$ represents the identity matrix. Using the notation ${\alpha}_t{}:=\prod^t_{s=1}{{}(1-{\beta}_s{})}$, we can express the marginal distribution as follows:

(2)
$ q(x_t\mid x_0)=\mathcal{N}(x_t;~\sqrt{{\alpha}_t}x_0,\ (1-{\alpha}_t)I). $

The reverse process is the opposite of the forward process, gradually restoring the original image from the noise-added image according to each timestep. The probability distribution $p_{\theta} (x_{t-1}\mid x_t)$ at $t$, based on the mean vector ${\mu}_{\theta}(x_t,\ t)$ and the covariance matrix ${{\Sigma}}_{\theta}(x_t,\ t)$, is defined as follows:

(3)
$ p_{\theta}(x_{t-1}\mid x_t)=\mathcal{N}(x_{t-1};~{\mu}_{\theta}(x_t,\ t),\ {{\Sigma}}_{\theta}(x_t,\ t)) . $

DDPM's unconditional generation creates various semantic masks from random noise, enabling the generation of diverse objects and scenes without limitation. This allows the SDM to generate new scenes not present in the existing dataset. Generating a wide range of scenes is crucial, as it directly improves the performance of semantic segmentation models.

For unconditional generation, semantic masks are extracted from the target dataset and used to train the DDPM. Since training with grayscale semantic masks limits the model's expressive range, the semantic mask training is conducted in the RGB domain. The use of RGB representation for semantic masks is important because a broader range of representation helps distinguish between classes more clearly during the subsequent color clustering process. This is also related to the pixel value inconsistency problem in generative models, as the semantic masks generated by the model may have non-uniform pixel values, unlike real semantic masks. Therefore, a color clustering process is necessary to unify pixel values for each class. Only after color clustering is completed can the semantic mask be ready for use as a condition in the SDM, and later be combined with synthetic images to form a synthetic dataset.

2.2. Synthetic Image Generation

SDM generates a synthetic image $y_0$ with the semantic mask generated by DDPM as the condition $x$. To achieve this, the SDM is trained using both real images and semantic masks from the target dataset. SDM follows the forward and reverse processes based on the DDPM framework. Here, $y_t$ represents the image at $t$, and the forward process is the same as in DDPM. The reverse process $p_{\theta}(y_{0:T}\mid x)$ proceeds through a Markov chain and is defined as follows:

(4)
$ p_{\theta}(y_{0:T}\mid x)=p(y_T)\prod^T_{t=1}{p_{\theta}(y_{t-1}\mid y_t,x)}, $
(5)
$ p_{\theta}(y_{t-1}\mid y_t,x)=\mathcal{N}(y_{t-1};~{\mu}_{\theta}(y_t,x,t),\ {{\Sigma}}_{\theta}(y_t,x,t)). $

Here, ${\mu}_{\theta}(y_t,x,t)$ and ${{\Sigma}}_{\theta}(y_t,x,t)$ represent the mean vector and covariance matrix at timestep $t$ conditioned on $x$, which are learned through the SDM, similar to DDPM. Note that SDM is a U-Net-based network that predicts noise from a noisy input image. The noisy input image $y_t$ is processed by the encoder, while the semantic mask $x$ is processed by the decoder, maximizing the characteristics of each input. Multi-layer spatially-adaptive normalization operators are employed to further improve image generation quality. Additionally, classifier-free guidance [17] is used during the sampling process to ensure high consistency with the semantic label map.

Therefore, the synthetic images generated by SDM achieve perfect consistency with the semantic masks generated by DDPM. This eliminates the need for manual annotation and allows the creation of high-quality images that do not exist in the target dataset. As a result, this can contribute to improving the performance of semantic segmentation.

3. Experiments

3.1. Datasets

We constructed an LWIR dataset for the experiment by collecting indoor video data at a university and a karaoke room with permission, using LWIR imaging equipment. The videos were captured at a resolution of 960x600 in grayscale. Noisy videos that made it difficult to identify the indoor structure or objects were excluded, and manual annotation was performed on the remaining videos, resulting in a dataset of 2,650 images. Since our goal is to detect exits in smoke-filled indoor environments due to fires, we annotated only three categories: door, floor, and background. In the semantic mask, the door is represented in light blue and the floor in yellow. Of the 2,650 images, 2,400 were used for training and 250 for validation.

The LWIR dataset, with only three classes, is relatively easier to generate. To further validate our approach, we tested on RGB domain datasets with more classes, using Cityscapes [15] and ADE20K [14]. Like the LWIR dataset, we assumed limited data availability, using only fine annotations from Cityscapes, which includes 2,975 training images, 500 validation images, and 19 classes. The ADE20K-Street dataset, a subset focused on street scenes, has 2,038 training images, 203 validation images, and 64 classes.

3.2. Qualitative Results

Fig. 2 presents the qualitative results of Noise-to-Dataset. Looking at Fig. 2(a), as seen in the first and second rows, when only one class is included in the generated mask, synthetic images that effectively represent the texture and color characteristics of LWIR images are well generated. Even in the third row, which shows the results of a more complex generated mask, it can be observed that not only the door and floor but also the background areas are well generated. However, there are also bad cases, which mostly occur due to incorrectly generated masks. Inappropriate shapes, positions, and sizes of the door/floor classes in the generated mask lead the SDM to produce unrealistic synthetic images. Moreover, in the case of the LWIR dataset, since there are only three classes, the number of possible masks that can be generated is limited, leading to a lack of diversity in the generated mask patterns.

In the case of Cityscapes and ADE20K-Street, since DDPM learns 19 and 64 classes respectively, the generated masks have far more variations compared to the LWIR dataset. Note that as the number of classes increases, the likelihood of generating masks not present in the original dataset also increases. Upon analysis, we confirmed that most generated masks, such as those in Figs. 2(b) and 2(c), do not exist in the original dataset, resulting in the creation of synthetic images from a wide range of scenes.

Fig. 2. Qualitative results of Noise-to-Dataset: (a) Synthetic LWIR dataset. (b) Synthetic Cityscapes dataset. (c) Synthetic ADE20K-Street dataset.

../../Resources/ieie/IEIESPC.2025.14.4.528/fig2.png

3.3. Quantitative Results

To quantitatively evaluate the effectiveness of the synthetic dataset, we adopt mean intersection-over-union (mIoU), a metric commonly used for semantic segmentation performance evaluation. For this experiment, we use the SegFormer model [18], the Segmenter model [19] and the SegNeXt model [20]. First, we observe the performance changes according to the size of the synthetic dataset added to the training dataset (see Table 1). In the case of the SegFormer model with a MiT-B4 backbone (SegFormer-B4), training with only 2,400 real data achieves a performance of 78.14%, while training with a total of 5,400 data, including an additional 3,000 synthetic data, shows a performance of 79.63%, resulting in a gain of $+1.49%$. Meanwhile, SegFormer-B2 shows a relatively smaller gain of $+0.90%$, indicating that the larger the backbone, the greater the improvement. Furthermore, in experiments using the Segmenter model with a ViT-S backbone (Segmenter-S), a maximum gain of $+2.05%$was obtained.

Table 1. The effectiveness of adding synthetic datasets to the LWIR dataset.

Model

Backbone

Training Data

Maximum gain

Real only

+1,000

Synthetic

+2,000

Synthetic

+3,000

Synthetic

SegFormer [18]

MiT-B2

76.74

77.51

77.62

77.64

0.90

MiT-B4

78.14

78.91

79.34

79.63

1.49

Segmenter [19]

ViT-T

76.98

77.66

77.85

77.92

0.94

ViT-S

78.68

79.23

80.14

80.73

2.05

SegNeXt [20]

MSCAN-T

80.03

80.59

80.96

81.05

1.02

MSCAN-S

81.18

81.91

82.42

82.81

1.63

On the other hand, in the case of SegFormer-B2, when comparing the addition of 2,000 synthetic data with 3,000 synthetic data, we observe negligible mIoU improvement, i.e., 77.62% vs. 77.64% (+0.02%). A similar phenomenon was observed with the Segmenter-T model. While the size of the backbone may have some influence, this is mainly attributed to the characteristics of our LWIR dataset, which consists of only three classes. In fact, qualitative results show that due to the limited variety of classes, the diversity of the generated masks is also insufficient. As the number of synthetic datasets increases, there is a saturation of scene patterns, which does not contribute significantly to performance improvement.

Therefore, we conducted experiments on RGB-scale datasets with a larger number of classes. Tables 2 and 3 verify the effectiveness of the synthetic dataset on the Cityscapes dataset and the ADE20K-Street dataset, respectively. Comparing the performance before and after including the synthetic dataset, we observed a maximum gain of +0.93% with the Segmenter-S model on the Cityscapes dataset, and a maximum gain of +1.09% with the Segmenter-S model on the ADE20K-Street dataset. This demonstrates that Noise-to-Dataset is also effective on standard RGB datasets.

Table 2. The effectiveness of the synthetic dataset (3,000 pairs) on the Cityscapes dataset.

Model

Backbone

Training Data

mIoU

Real

Synthetic

Segmenter

ViT-T

59.13

59.80

ViT-S

62.08

63.01

Table 3. The effectiveness of the synthetic dataset (2,000 pairs) on the ADE20K-Street dataset.

Model

Backbone

Training Data

mIoU

Real

Synthetic

Segmenter

ViT-T

20.61

21.11

ViT-S

24.03

25.12

3.4. Ablation Study

Table 4 compares the effectiveness of using real masks from an existing dataset versus generated masks when synthesizing images. For this experiment, the Segmenter-S model was used. Additionally, 2,000 synthetic images generated from real masks and 2,000 synthetic images generated using the proposed mask generation method were used for training, ensuring a fair comparison by maintaining the same data quantity. The experimental results show that training with synthetic images generated from real masks achieved a performance of 79.13%, whereas training with synthetic images generated using the proposed mask generation method achieved 80.14%, resulting in a performance improvement of +1.01%. These results highlight that semantic mask generation not only serves to expand the dataset but also enhances the segmentation model's performance by introducing new variations that do not exist in the original dataset. Furthermore, it demonstrates that the generated masks contain structurally meaningful information beneficial to the segmentation model.

Table 4. Ablation study on the effectiveness of semantic mask generation for the LWIR dataset.

Model

Mask Source

Training Data

mIoU

Real

Synthetic

Segmenter-S

-

78.68

Real

79.13

Generated

80.14

3.5. Experimental Details

We evaluated the performance of semantic segmentation using three models from the MMSegmentation framework: SegFormer, Segmenter, and SegNeXt. The experiments were conducted in an environment with Python 3.8 and PyTorch 1.11.0, utilizing CUDA Toolkit 11.3 for GPU acceleration. The implementation was based on MMSegmentation, with dependencies including MMEngine 0.7.3 and MMCV 2.0.1. For computational resources, we used an NVIDIA RTX A6000 GPU and an AMD EPYC 7413 CPU. This setup provided a stable environment for training and evaluating the models while ensuring compatibility with the MMSegmentation framework.

4. Conclusion

In this paper, we propose a novel dataset generation algorithm, Noise-to-Dataset, to address the challenge of limited training data in semantic segmentation tasks for indoor LWIR images. Through experiments, we have verified that utilizing a synthetic dataset improves segmentation performance and demonstrated that the proposed method can be extended to the RGB domain. However, our approach has certain limitations. In datasets with a small number of classes, such as LWIR datasets, the diversity of generated masks tends to saturate as the number of synthetic samples increases. This issue could potentially be mitigated by incorporating various data augmentation techniques during the training process of the semantic mask generation model. Additionally, controlling mask generation to include only specific classes may further enhance diversity. Addressing these challenges will be our primary focus in future work.

ACKNOWLEDGMENTS

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. IITP-2024-RS-2024-00360227 (Leading Generative AI Human Resources Development) and RS-2022-00155915 (Artificial Intelligence Convergence Innovation Human Resources Development (Inha University)) and No. RS-2021-II212052 (ITRC)) and this work was partly supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2022R1A2C2010095).

REFERENCES

1 
J. Gao, N. Liu, H. Li, Z. Li, C. Xie, and Y. Gou, ``Reinforcement learning decision-making for autonomous vehicles based on semantic segmentation,'' Applied Sciences, vol. 15, no. 3, 1323, Jan. 2025.DOI
2 
J. Tsai, Y.‑T. Chang, Z. Y. Chen, and Z. You, ``Autonomous driving control for passing unsignalized intersections using the semantic segmentation technique,'' Electronics, vol. 13, no. 3, 484, Jan. 2024.DOI
3 
A. Golda, K. Mekonen, A. Pandey, A. Singh, V. Hassija, V. Chamola, and B. Sikdar, ``Privacy and security concerns in generative AI: A comprehensive survey,'' IEEE Access, vol. 12, pp. 48126-48144, 2024.DOI
4 
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ``Generative adversarial networks,'' Communications of the ACM, vol. 63, no. 11, pp. 139-144, 2020.DOI
5 
Y. Zhang, H. Ling, J. Gao, K. Yin, J.-F. Lafleche, A. Barriuso, A. Torralba, and S. Fidler, ``DatasetGAN: Efficient labeled data factory with minimal human effort,'' Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10145-10155, 2021.DOI
6 
V. Sushiko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele, and A. Khoreva, ``You only need adversarial supervision for semantic image synthesis,'' arXiv preprint arXiv:2012.04781, 2020.DOI
7 
K. Pandey, A. Mukherjee, P. Rai, and A. Kumar, ``VAEs meet diffusion models: Efficient and high-fidelity generation,'' Proc. of NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.URL
8 
J. Ho, A. Jain, and P. Abbeel. ``Denoising diffusion probabilistic models,'' Advances in Neural Information Processing Systems, vol. 33, pp. 6840-6851, 2020.DOI
9 
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, ``High-resolution image synthesis with latent diffusion models,'' Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10674-10685, 2022.DOI
10 
Q. nguyen, T. Vu, A. Tran, and K. Nguyen, ``Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,'' Advances in Neural Information Processing Systems, vol. 36, 2024.DOI
11 
W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen, ``Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,'' Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 1206-1217, 2023.DOI
12 
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, ``Hierarchical text-conditional image generation with clip latents,'' arXiv preprint arXiv:2204.06125, 2022.DOI
13 
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, ``LAION-5B: An open large-scale dataset for training next generation image-text models,'' Advances in Neural Information Processing Systems, vol. 35, pp. 25278-25294, 2022.DOI
14 
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ``Scene parsing through ADE20K dataset,'' Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5122-5130, 2017.DOI
15 
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, and R. Benenson, ``The cityscapes dataset for semantic urban scene understanding,'' Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213-3223, 2016.DOI
16 
W. Zhou, W. Wang, W. Zhou, D. Chen, D. Chen, L. Yang, and H. Li, ``Semantic image synthesis via diffusion models,'' arXiv preprint arXiv:2207.00050, 2022.DOI
17 
J. Ho and T. Salimans, ``Classifier-free diffusion guidance,'' arXiv preprint arXiv:2207.12598, 2022.DOI
18 
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, ``SegFormer: Simple and efficient design for semantic segmentation with transformers,'' Advances in Neural Information Processing Systems, vol. 34, pp. 12077-12090, 2021.DOI
19 
R. Strudel, R. Garcia, I. Laptev, and C. Schmid, ``Segmenter: Transformer for semantic segmentation,'' Proc. of the IEEE/CVF International Conference on Computer Vision, pp. 7242-7252, 2021.DOI
20 
M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, ``SegNeXt: Rethinking convolutional attention design for semantic segmentation,'' Advances in Neural Information Processing Systems, vol. 35, pp. 1140-1156, 2022.DOI
21 
Z. Lv, Y. Wei, W. Zho, and K.-Y. Wong, ``PLACE: Adaptive layout-semantic fusion for semantic image synthesis,'' Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9264-9274, 2024.DOI

Author

Jin Young Choi
../../Resources/ieie/IEIESPC.2025.14.4.528/au1.png

Jin Young Choi received his B.S. degree in electronic engineering from Myongji University, Yongin, South Korea, in 2020, and received his M.S. degree in electrical and computer engineering, Inha University, Incheon, South Korea in 2025. His research interests include computer vision and deep learning.

Byung Cheol Song
../../Resources/ieie/IEIESPC.2025.14.4.528/au2.png

Byung Cheol Song received his B.S., M.S., and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 1994, 1996, and 2001, respectively. From 2001 to 2008, he was a Senior Engineer with the Digital Media R&D Center, Samsung Electronics Company Ltd., Suwon, South Korea. In 2008, he joined the Department of Electronic Engineering, Inha University, Incheon, South Korea, and is currently a professor. His research interests include the general areas of image processing and computer vision.