ChoiJin Young1
SongByung Cheol1
-
(Department of Electrical and Computer Engineering, Inha University, Incheon, Korea
presto0408@gmail.com and bcsong@inha.ac.kr)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Dataset generation, Semantic segmentation, Long-wave infrared, Diffusion models
1. Introduction
Semantic segmentation involves assigning semantic labels to each pixel of an image,
making it an important task in computer vision. It is used for purposes such as scene
comprehension and object recognition and is applied in real-world applications like
autonomous driving [1,2]. However, one of the greatest challenges in advancing semantic segmentation is building
a dataset. In the process of dataset construction, we may face privacy regulations
starting from the data collection stage [3], and an even bigger issue is the high cost of manual annotation. Pixel-level annotation
on collected images is an extremely labor-intensive and time-consuming task.
Fortunately, there is an alternative solution to this problem: Constructing synthetic
datasets using generative models. The synthetic dataset generation method has the
advantage of being free from privacy issues and allows for the creation of data tailored
to specific needs. Additionally, it is highly valued for reducing the human resource,
time, and financial costs associated with manual annotation. In previous research,
efforts have been made to automate manual annotation using models based on generative
adversarial networks (GAN) [4] or to generate synthetic images [5,6]. However, these methods have limitations in representing complex real-world scenes
and lack diversity compared to variational auto encoder (VAE)-based models. On the
other hand, VAE-based models provide sufficient diversity but suffer from lower image
quality [7]. On the other hand, with the recent emergence of diffusion models [8], the field of synthetic dataset generation has advanced dramatically. Diffusion-based
models, such as large-scale language-image generation (LLIG) models like stable diffusion
[9], demonstrate overwhelming performance by automating annotations on generated images
without additional training [10,11].
Meanwhile, long-wave infrared (LWIR) imagery is captured outside the visible range
of the electromagnetic spectrum, and it has the advantage of being able to detect
objects even in situations where visibility is obscured by fog or smoke. Utilizing
this characteristic, we are developing an application to detect exits in indoor environments
filled with smoke due to fire. However, the currently available LWIR image datasets
are mostly captured outdoors, making them unsuitable for training deep learning models
that operate in indoor environments. Additionally, capturing indoor LWIR images requires
specialized equipment and authorized locations for shooting, which poses a significant
hurdle for data collection. We attempted to generate LWIR images using the large-scale
language-image generation (LLIG) models, which are currently gaining attention. However,
popular models like DALL-E 2 [12] and stable diffusion [9] were trained on RGB-based datasets like LAION-5B [13], making them inadequate for generating LWIR images.
In this paper, we propose a novel dataset generation framework called Noise-to-Dataset
to address the severe data scarcity problem. The proposed Noise-to-Dataset framework
consists of two stages: the first stage generates a semantic mask from noise, and
the second stage synthesizes a synthetic image from the semantic mask. The semantic
mask generation stage adopts unconditional generation, requiring only Gaussian noise
as input. This allows for the unlimited generation of semantic masks as needed. Moreover,
by generating semantic masks that do not exist in the original dataset, the subsequent
synthetic image synthesis stage can produce data for new scenes. From the perspective
of a segmentation model, this approach enables learning from unseen data, which helps
prevent overfitting on small datasets and improves generalization performance.
The proposed method is significant in that it overcomes the limitations of conventional
dataset construction approaches. Existing methods synthesize images based on semantic
masks obtained from real datasets [6,16,21]. However, this approach is inherently limited to generating images that are structurally
identical to real data, reducing their potential to enhance performance as training
data. To address this limitation, Noise-to-Dataset introduces a novel approach that
directly generates semantic masks, thereby mitigating data scarcity issues and reducing
the substantial manual annotation effort required. Of course, the proposed method
still requires a minimal amount of training data. To this end, our research team directly
constructed a real LWIR dataset of approximately 2.6k images. We demonstrated the
effectiveness of the synthetic dataset by showing that adding synthetic data to the
real dataset improves the performance of semantic segmentation models. Furthermore,
additional experiments on general RGB datasets such as ADE20K [14] and Cityscapes [15] verified the feasibility of the proposed framework's scalability.
2. Method
This section describes Noise-to-Dataset, which generates a semantic segmentation dataset
from Gaussian noise to address the issue of data scarcity (see Fig. 1). Noise-to-Dataset simultaneously generates additional semantic masks and synthetic
images based on learning from the limited target dataset, reducing the high costs
of manual annotation while creating diverse data that did not previously exist. The
proposed method consists of two stages. First, unconditional generation is performed
using denoising diffusion probabilistic models (DDPM) [8] to generate diverse semantic masks from Gaussian noise (cf. Subsection 2.1). Second,
conditional generation using the generated semantic masks is performed by the semantic
diffusion model (SDM) [16] to create synthetic images (cf. Subsection 2.2). The generated semantic masks and
synthetic images are then used as a dataset for training any semantic segmentation
model.
Fig. 1. Overview of our Noise-to-Dataset for synthetic dataset generation.
2.1. Semantic Mask Generation
For semantic mask generation, we utilized DDPM. DDPM generates images through a forward
process and a reverse process. The forward process involves adding Gaussian noise
to the original image $x_0$ at each timestep $t$ based on a variance schedule ${\beta}_t$.
The probability distribution at each timestep is defined as follows:
Here, $x_t$ is the image at $t$, $\mathcal{N}$ denotes the Gaussian distribution,
and $I$ represents the identity matrix. Using the notation ${\alpha}_t{}:=\prod^t_{s=1}{{}(1-{\beta}_s{})}$,
we can express the marginal distribution as follows:
The reverse process is the opposite of the forward process, gradually restoring the
original image from the noise-added image according to each timestep. The probability
distribution $p_{\theta} (x_{t-1}\mid x_t)$ at $t$, based on the mean vector ${\mu}_{\theta}(x_t,\
t)$ and the covariance matrix ${{\Sigma}}_{\theta}(x_t,\ t)$, is defined as follows:
DDPM's unconditional generation creates various semantic masks from random noise,
enabling the generation of diverse objects and scenes without limitation. This allows
the SDM to generate new scenes not present in the existing dataset. Generating a wide
range of scenes is crucial, as it directly improves the performance of semantic segmentation
models.
For unconditional generation, semantic masks are extracted from the target dataset
and used to train the DDPM. Since training with grayscale semantic masks limits the
model's expressive range, the semantic mask training is conducted in the RGB domain.
The use of RGB representation for semantic masks is important because a broader range
of representation helps distinguish between classes more clearly during the subsequent
color clustering process. This is also related to the pixel value inconsistency problem
in generative models, as the semantic masks generated by the model may have non-uniform
pixel values, unlike real semantic masks. Therefore, a color clustering process is
necessary to unify pixel values for each class. Only after color clustering is completed
can the semantic mask be ready for use as a condition in the SDM, and later be combined
with synthetic images to form a synthetic dataset.
2.2. Synthetic Image Generation
SDM generates a synthetic image $y_0$ with the semantic mask generated by DDPM as
the condition $x$. To achieve this, the SDM is trained using both real images and
semantic masks from the target dataset. SDM follows the forward and reverse processes
based on the DDPM framework. Here, $y_t$ represents the image at $t$, and the forward
process is the same as in DDPM. The reverse process $p_{\theta}(y_{0:T}\mid x)$ proceeds
through a Markov chain and is defined as follows:
Here, ${\mu}_{\theta}(y_t,x,t)$ and ${{\Sigma}}_{\theta}(y_t,x,t)$ represent the mean
vector and covariance matrix at timestep $t$ conditioned on $x$, which are learned
through the SDM, similar to DDPM. Note that SDM is a U-Net-based network that predicts
noise from a noisy input image. The noisy input image $y_t$ is processed by the encoder,
while the semantic mask $x$ is processed by the decoder, maximizing the characteristics
of each input. Multi-layer spatially-adaptive normalization operators are employed
to further improve image generation quality. Additionally, classifier-free guidance
[17] is used during the sampling process to ensure high consistency with the semantic
label map.
Therefore, the synthetic images generated by SDM achieve perfect consistency with
the semantic masks generated by DDPM. This eliminates the need for manual annotation
and allows the creation of high-quality images that do not exist in the target dataset.
As a result, this can contribute to improving the performance of semantic segmentation.
3. Experiments
3.1. Datasets
We constructed an LWIR dataset for the experiment by collecting indoor video data
at a university and a karaoke room with permission, using LWIR imaging equipment.
The videos were captured at a resolution of 960x600 in grayscale. Noisy videos that
made it difficult to identify the indoor structure or objects were excluded, and manual
annotation was performed on the remaining videos, resulting in a dataset of 2,650
images. Since our goal is to detect exits in smoke-filled indoor environments due
to fires, we annotated only three categories: door, floor, and background. In the
semantic mask, the door is represented in light blue and the floor in yellow. Of the
2,650 images, 2,400 were used for training and 250 for validation.
The LWIR dataset, with only three classes, is relatively easier to generate. To further
validate our approach, we tested on RGB domain datasets with more classes, using Cityscapes
[15] and ADE20K [14]. Like the LWIR dataset, we assumed limited data availability, using only fine annotations
from Cityscapes, which includes 2,975 training images, 500 validation images, and
19 classes. The ADE20K-Street dataset, a subset focused on street scenes, has 2,038
training images, 203 validation images, and 64 classes.
3.2. Qualitative Results
Fig. 2 presents the qualitative results of Noise-to-Dataset. Looking at Fig. 2(a), as seen in the first and second rows, when only one class is included in the generated
mask, synthetic images that effectively represent the texture and color characteristics
of LWIR images are well generated. Even in the third row, which shows the results
of a more complex generated mask, it can be observed that not only the door and floor
but also the background areas are well generated. However, there are also bad cases,
which mostly occur due to incorrectly generated masks. Inappropriate shapes, positions,
and sizes of the door/floor classes in the generated mask lead the SDM to produce
unrealistic synthetic images. Moreover, in the case of the LWIR dataset, since there
are only three classes, the number of possible masks that can be generated is limited,
leading to a lack of diversity in the generated mask patterns.
In the case of Cityscapes and ADE20K-Street, since DDPM learns 19 and 64 classes respectively,
the generated masks have far more variations compared to the LWIR dataset. Note that
as the number of classes increases, the likelihood of generating masks not present
in the original dataset also increases. Upon analysis, we confirmed that most generated
masks, such as those in Figs. 2(b) and 2(c), do not exist in the original dataset, resulting in the creation of synthetic images
from a wide range of scenes.
Fig. 2. Qualitative results of Noise-to-Dataset: (a) Synthetic LWIR dataset. (b) Synthetic
Cityscapes dataset. (c) Synthetic ADE20K-Street dataset.
3.3. Quantitative Results
To quantitatively evaluate the effectiveness of the synthetic dataset, we adopt mean
intersection-over-union (mIoU), a metric commonly used for semantic segmentation performance
evaluation. For this experiment, we use the SegFormer model [18], the Segmenter model [19] and the SegNeXt model [20]. First, we observe the performance changes according to the size of the synthetic
dataset added to the training dataset (see Table 1). In the case of the SegFormer model with a MiT-B4 backbone (SegFormer-B4), training
with only 2,400 real data achieves a performance of 78.14%, while training with a
total of 5,400 data, including an additional 3,000 synthetic data, shows a performance
of 79.63%, resulting in a gain of $+1.49%$. Meanwhile, SegFormer-B2 shows a relatively
smaller gain of $+0.90%$, indicating that the larger the backbone, the greater the
improvement. Furthermore, in experiments using the Segmenter model with a ViT-S backbone
(Segmenter-S), a maximum gain of $+2.05%$was obtained.
Table 1. The effectiveness of adding synthetic datasets to the LWIR dataset.
Model
|
Backbone
|
Training Data
|
Maximum gain
|
Real only
|
+1,000
Synthetic
|
+2,000
Synthetic
|
+3,000
Synthetic
|
SegFormer [18]
|
MiT-B2
|
76.74
|
77.51
|
77.62
|
77.64
|
0.90
|
MiT-B4
|
78.14
|
78.91
|
79.34
|
79.63
|
1.49
|
Segmenter [19]
|
ViT-T
|
76.98
|
77.66
|
77.85
|
77.92
|
0.94
|
ViT-S
|
78.68
|
79.23
|
80.14
|
80.73
|
2.05
|
SegNeXt [20]
|
MSCAN-T
|
80.03
|
80.59
|
80.96
|
81.05
|
1.02
|
MSCAN-S
|
81.18
|
81.91
|
82.42
|
82.81
|
1.63
|
On the other hand, in the case of SegFormer-B2, when comparing the addition of 2,000
synthetic data with 3,000 synthetic data, we observe negligible mIoU improvement,
i.e., 77.62% vs. 77.64% (+0.02%). A similar phenomenon was observed with the Segmenter-T
model. While the size of the backbone may have some influence, this is mainly attributed
to the characteristics of our LWIR dataset, which consists of only three classes.
In fact, qualitative results show that due to the limited variety of classes, the
diversity of the generated masks is also insufficient. As the number of synthetic
datasets increases, there is a saturation of scene patterns, which does not contribute
significantly to performance improvement.
Therefore, we conducted experiments on RGB-scale datasets with a larger number of
classes. Tables 2 and 3 verify the effectiveness of the synthetic dataset on the Cityscapes dataset and the
ADE20K-Street dataset, respectively. Comparing the performance before and after including
the synthetic dataset, we observed a maximum gain of +0.93% with the Segmenter-S model
on the Cityscapes dataset, and a maximum gain of +1.09% with the Segmenter-S model
on the ADE20K-Street dataset. This demonstrates that Noise-to-Dataset is also effective
on standard RGB datasets.
Table 2. The effectiveness of the synthetic dataset (3,000 pairs) on the Cityscapes
dataset.
Model
|
Backbone
|
Training Data
|
mIoU
|
Real
|
Synthetic
|
Segmenter
|
ViT-T
|
✔
|
|
59.13
|
✔
|
✔
|
59.80
|
ViT-S
|
✔
|
|
62.08
|
✔
|
✔
|
63.01
|
Table 3. The effectiveness of the synthetic dataset (2,000 pairs) on the ADE20K-Street
dataset.
Model
|
Backbone
|
Training Data
|
mIoU
|
Real
|
Synthetic
|
Segmenter
|
ViT-T
|
✔
|
|
20.61
|
✔
|
✔
|
21.11
|
ViT-S
|
✔
|
|
24.03
|
✔
|
✔
|
25.12
|
3.4. Ablation Study
Table 4 compares the effectiveness of using real masks from an existing dataset versus generated
masks when synthesizing images. For this experiment, the Segmenter-S model was used.
Additionally, 2,000 synthetic images generated from real masks and 2,000 synthetic
images generated using the proposed mask generation method were used for training,
ensuring a fair comparison by maintaining the same data quantity. The experimental
results show that training with synthetic images generated from real masks achieved
a performance of 79.13%, whereas training with synthetic images generated using the
proposed mask generation method achieved 80.14%, resulting in a performance improvement
of +1.01%. These results highlight that semantic mask generation not only serves to
expand the dataset but also enhances the segmentation model's performance by introducing
new variations that do not exist in the original dataset. Furthermore, it demonstrates
that the generated masks contain structurally meaningful information beneficial to
the segmentation model.
Table 4. Ablation study on the effectiveness of semantic mask generation for the LWIR
dataset.
Model
|
Mask Source
|
Training Data
|
mIoU
|
Real
|
Synthetic
|
Segmenter-S
|
-
|
✔
|
|
78.68
|
Real
|
✔
|
✔
|
79.13
|
Generated
|
✔
|
✔
|
80.14
|
3.5. Experimental Details
We evaluated the performance of semantic segmentation using three models from the
MMSegmentation framework: SegFormer, Segmenter, and SegNeXt. The experiments were
conducted in an environment with Python 3.8 and PyTorch 1.11.0, utilizing CUDA Toolkit
11.3 for GPU acceleration. The implementation was based on MMSegmentation, with dependencies
including MMEngine 0.7.3 and MMCV 2.0.1. For computational resources, we used an NVIDIA
RTX A6000 GPU and an AMD EPYC 7413 CPU. This setup provided a stable environment for
training and evaluating the models while ensuring compatibility with the MMSegmentation
framework.
4. Conclusion
In this paper, we propose a novel dataset generation algorithm, Noise-to-Dataset,
to address the challenge of limited training data in semantic segmentation tasks for
indoor LWIR images. Through experiments, we have verified that utilizing a synthetic
dataset improves segmentation performance and demonstrated that the proposed method
can be extended to the RGB domain. However, our approach has certain limitations.
In datasets with a small number of classes, such as LWIR datasets, the diversity of
generated masks tends to saturate as the number of synthetic samples increases. This
issue could potentially be mitigated by incorporating various data augmentation techniques
during the training process of the semantic mask generation model. Additionally, controlling
mask generation to include only specific classes may further enhance diversity. Addressing
these challenges will be our primary focus in future work.
ACKNOWLEDGMENTS
This work was partly supported by Institute of Information & communications Technology
Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. IITP-2024-RS-2024-00360227
(Leading Generative AI Human Resources Development) and RS-2022-00155915 (Artificial
Intelligence Convergence Innovation Human Resources Development (Inha University))
and No. RS-2021-II212052 (ITRC)) and this work was partly supported by the National
Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.
2022R1A2C2010095).
REFERENCES
J. Gao, N. Liu, H. Li, Z. Li, C. Xie, and Y. Gou, ``Reinforcement learning decision-making
for autonomous vehicles based on semantic segmentation,'' Applied Sciences, vol. 15,
no. 3, 1323, Jan. 2025.

J. Tsai, Y.‑T. Chang, Z. Y. Chen, and Z. You, ``Autonomous driving control for passing
unsignalized intersections using the semantic segmentation technique,'' Electronics,
vol. 13, no. 3, 484, Jan. 2024.

A. Golda, K. Mekonen, A. Pandey, A. Singh, V. Hassija, V. Chamola, and B. Sikdar,
``Privacy and security concerns in generative AI: A comprehensive survey,'' IEEE Access,
vol. 12, pp. 48126-48144, 2024.

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, ``Generative adversarial networks,'' Communications of the ACM, vol.
63, no. 11, pp. 139-144, 2020.

Y. Zhang, H. Ling, J. Gao, K. Yin, J.-F. Lafleche, A. Barriuso, A. Torralba, and S.
Fidler, ``DatasetGAN: Efficient labeled data factory with minimal human effort,''
Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10145-10155,
2021.

V. Sushiko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele, and A. Khoreva, ``You only
need adversarial supervision for semantic image synthesis,'' arXiv preprint arXiv:2012.04781,
2020.

K. Pandey, A. Mukherjee, P. Rai, and A. Kumar, ``VAEs meet diffusion models: Efficient
and high-fidelity generation,'' Proc. of NeurIPS 2021 Workshop on Deep Generative
Models and Downstream Applications. 2021.

J. Ho, A. Jain, and P. Abbeel. ``Denoising diffusion probabilistic models,'' Advances
in Neural Information Processing Systems, vol. 33, pp. 6840-6851, 2020.

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, ``High-resolution image
synthesis with latent diffusion models,'' Proc. of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp.10674-10685, 2022.

Q. nguyen, T. Vu, A. Tran, and K. Nguyen, ``Dataset diffusion: Diffusion-based synthetic
data generation for pixel-level semantic segmentation,'' Advances in Neural Information
Processing Systems, vol. 36, 2024.

W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen, ``Diffumask: Synthesizing images
with pixel-level annotations for semantic segmentation using diffusion models,'' Proc.
of the IEEE/CVF International Conference on Computer Vision, pp. 1206-1217, 2023.

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, ``Hierarchical text-conditional
image generation with clip latents,'' arXiv preprint arXiv:2204.06125, 2022.

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes,
A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt,
R. Kaczmarczyk, and J. Jitsev, ``LAION-5B: An open large-scale dataset for training
next generation image-text models,'' Advances in Neural Information Processing Systems,
vol. 35, pp. 25278-25294, 2022.

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, ``Scene parsing
through ADE20K dataset,'' Proc. of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 5122-5130, 2017.

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, and R. Benenson, ``The cityscapes
dataset for semantic urban scene understanding,'' Proc. of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 3213-3223, 2016.

W. Zhou, W. Wang, W. Zhou, D. Chen, D. Chen, L. Yang, and H. Li, ``Semantic image
synthesis via diffusion models,'' arXiv preprint arXiv:2207.00050, 2022.

J. Ho and T. Salimans, ``Classifier-free diffusion guidance,'' arXiv preprint arXiv:2207.12598,
2022.

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, ``SegFormer: Simple
and efficient design for semantic segmentation with transformers,'' Advances in Neural
Information Processing Systems, vol. 34, pp. 12077-12090, 2021.

R. Strudel, R. Garcia, I. Laptev, and C. Schmid, ``Segmenter: Transformer for semantic
segmentation,'' Proc. of the IEEE/CVF International Conference on Computer Vision,
pp. 7242-7252, 2021.

M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, ``SegNeXt: Rethinking
convolutional attention design for semantic segmentation,'' Advances in Neural Information
Processing Systems, vol. 35, pp. 1140-1156, 2022.

Z. Lv, Y. Wei, W. Zho, and K.-Y. Wong, ``PLACE: Adaptive layout-semantic fusion for
semantic image synthesis,'' Proc. of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 9264-9274, 2024.

Author
Jin Young Choi received his B.S. degree in electronic engineering from Myongji University,
Yongin, South Korea, in 2020, and received his M.S. degree in electrical and computer
engineering, Inha University, Incheon, South Korea in 2025. His research interests
include computer vision and deep learning.
Byung Cheol Song received his B.S., M.S., and Ph.D. degrees in electrical engineering
from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea,
in 1994, 1996, and 2001, respectively. From 2001 to 2008, he was a Senior Engineer
with the Digital Media R&D Center, Samsung Electronics Company Ltd., Suwon, South
Korea. In 2008, he joined the Department of Electronic Engineering, Inha University,
Incheon, South Korea, and is currently a professor. His research interests include
the general areas of image processing and computer vision.