1. Introduction
Classical neural network-based machine learning has been reborn under the name of
deep learning by virtue of advances in computing resources, Internet services and
its data, and algorithms. In 2012, AlexNet showed significantly better performance
than other computer vision algorithms in the ImageNet Challenge [43]. Since then, deep learning has made remarkable progress and can even surpass human
abilities in image recognition. The impressive results have outperformed conventional
methodologies in various academic and industrial fields since deep learning became
prevalent, particularly in the field of computer vision.
One of the key reasons behind the impressive performance of deep learning is its ability
to organically train and combine multiple learnable modules, which is also known as
end-to-end learning. Conventional methods are hand-crafted by human experts to extract
informative features through data analysis and rely on shallow machine learning techniques
such as linear discriminators, as well as crude feature extraction as input. However,
deep learning integrates the feature extraction module and discriminator into a single
deep neural network, and both are trained by just training the single network. This
enables learning optimal feature extraction and discrimination with a large volume
of data. This approach allows us to find outstanding combinations using the gradient
descent method, which may not be discovered or may be overlooked in designs by humans.
To take these advantages, deep learning models require a substantial amount of data.
Due to the deep cascade structure of deep neural networks, they have many learning
parameters, which require much data to learn while avoiding the problem of overfitting.
Moreover, most of the current successes are based on supervised learning methods,
which require as much label data as the amount of input data.
These requirements pose a variety of challenges to achieving high performance with
deep learning models. The design of deep learning models that are applicable in limited-label
scenarios has emerged as a challenge to overcome. This paper reviews the types and
applications of learning techniques for efficiently learning label data using deep
learning models that can perform computer vision and graphics tasks through case studies
within the main focuses of the authors.
2. Background
Manpower issues for data collection and labeling. Deep learning models based on supervised
learning primarily require the collection of both input and label data. High-quality
label data are needed to train a model effectively, so the paired input and label
data are obtained through manual effort by someone with the skill needed to perform
the target task, e.g., [1]. Thus, high labor costs can be a significant issue (see Fig. 1 for an example).
Errors and noise caused by human factors. Even if the cost problem is addressed through
funding, other problems occur; it is impossible to recruit a large amount of annotator
personnel offline to build large-scale datasets, so crowdsourcing platforms such as
Amazon Mechanical Turk are generally used. Most dataset designers expect data quality
like that in Fig. 1(a), but the actual label often exhibits low quality, as shown in Fig. 1(b). This problem is caused by human factors, and it should be approached in a comprehensive
way such as by improving labeling tools and human-factor management. Some studies
have aimed to address the noisy label problems algorithmically [2-7].
Efficiency and cost issues due to the absence of labeling-efficient tools. Many crowdsourcing
businesses are emerging to professionally address errors and noise caused by human
factors. For some tasks, there are companies that provide services that utilize efficient
labeling tools, e.g., [8,34], that provide convenience and robustness against errors induced by annotators. However,
developing tools that are tailored to target tasks for all problems can be challenging
and may not improve efficiency in all cases.
Privacy and copyright issues. Large-scale data are usually acquired through crawling,
which may include personal information and raise privacy issues during the labeling
process. Obtaining individual permission from the original owners is necessary but
can be challenging due to the numerous web-data creators and corresponding licenses.
Additionally, the standard laws differ between countries, so arbitrary use of data
could lead to legal problems.
Difficulty in labeling according to the problem definition and efficiency. Defining
a problem and determining the forms of labels pose several challenges. For instance,
medical data labeling requires hiring experts with medical knowledge, resulting in
exponential increases in both time and cost compared to general labeling, which makes
it difficult to build large-scale datasets. In other cases, the parts that should
be provided with annotations and the form for effective and learnable labels are unclear.
For instance, the output of a problem is undefined when constructing a dataset to
identify the cause of a problem for which the cause has not been figured out.
Another challenge is when the labeling target exceeds the range of human capability.
As a simple example, when labeling pixels’ correspondence points between two image
frames, e.g., optical flow [29], it is almost impossible to manually label the motions of all pixels in real scene
data. There may be ways to exploit other equipment, but they cannot be expected to
exist for all problems.
Illusion of data omnipotence. It cannot be assumed that large datasets or additional
data will always exist for many problems. Therefore, in a situation where a small
amount of label data is given when designing a problem solution, it is more practical
to pursue learning techniques to overcome the limitations of small data compared to
relying only on supervised learning.
Fig. 1. An example of semantic segmentation label data, one of the representative computer vision problems [1]. Labels of semantic segmentation represent all boundaries of objects in the form of polygons. To label a segmentation map (a), it takes about 1.5 hours per image when done by experts. It is necessary to label at least hundreds of thousands of such images. If we use a commonly used crowdsourcing platform, most of them will have low-quality labels (b). This is a fundamental problem caused by the human factor.
3. Label-efficient Training
In this section, we discuss representative approaches that can train deep learning
models with a small amount of labeled data and give examples of each method: Leveraging
synthetic data (Sec. 3.1), prior knowledge (Sec. 3.2), heterogeneous datasets (Sec.
3.3), multi-modal data together (Sec. 3.4), domain generalization (Sec. 3.5), self-/semi-supervised
learning (Sec. 3.6), leveraging classical algorithms (Sec. 3.7), and meta-learning
(Sec. 3.8).
3.1 Synthetic Data
In cases where input-output paired data for supervised learning cannot be obtained,
or labeling is not possible, data can be synthesized by simulating the input-output
relationship. Although it is challenging to perfectly replicate an actual relationship,
the model can be designed to learn the input-output relationship through the synthetic
data.
For example, one study proposed a deep model that generates a motion-magnified video
from an input video with subtle motion [9]. To train the model, the authors synthesized the motion-magnified video for output
through a simple first-order motion vector model. It is difficult to modify the image
because it is already rasterized, but it was possible when synthesizing motion magnification
videos for each segmented object to learn how to extract and manipulate motion from
the deep learning model. A similar approach is used to synthesize optical flow datasets
[29,44].
As another example, another study proposed a reverse-knitting method that scans the
pattern of knitted fabric and extracts it into a knitting program map [10]. Knitting patterns and corresponding program maps are needed as data to do this.
However, it is difficult to uniformly photograph the yarn tension or texture of knitting
patterns, so there are limitations in constructing a large-scale dataset. To overcome
this problem, simulation data were synthesized through rendering using a graphics
engine from program maps. The model was successfully trained using both real and synthetic
data, which mitigated the data shortage and demonstrated significant performance improvement.
In another study, a supervised learning dataset was constructed using a simple image-based
rendering technique [11]. The authors synthesized human images taken with various lighting angles to represent
different lighting environments. DFlow [29] uses a differentiable optical flow-data-generation pipeline to efficiently synthesize
a dataset that is effective for a target domain without the need for cumbersome try-and-error.
Recently, generative models have enabled the generation of virtual data with indistinguishable
quality from real data. However, it is still in the research stage, and it shows good
performance for only limited data types such as long tail class distribution [45] and faces. Nevertheless, data generation through generative models is likely to be
utilized as data-efficient learning techniques.
3.2 Prior Knowledge
Human knowledge can be seen as an aggregation of data accumulated over a lifetime.
Similarly, a physical formula can be regarded as a compressed expression of data found
from observations accumulated over hundreds or thousands of years. Thus, we might
consider using such prior knowledge to replace the data.
For video motion magnification, a deep learning model was trained with synthetic input
and output data pairs [9]. However, due to the gap between real and synthetic data and the difficulty of learning
with only motion data, learning perfect motion is difficult. To overcome this, the
architecture of a deep learning model can be designed to follow simple physics constraints
on speed, as shown in Fig. 2. It has been shown that a model design based on such physical knowledge can implement
knowledge that cannot be derived from data alone [9]. This was not achievable with a simple convolutional neural network (CNN), but with
the proposed method inducing the physical relationship, i.e., velocity. With these
developments, deep video motion magnification enables extensive and critical non-contact-based
applications, such as health applications for heart rate estimation and blood pressure
[42], an infrastructure safety monitoring system, etc.
Fig. 2. (Left) Motion magnification comparison results (x-t slices); (Right) Motion magnification architecture.
Fig. 3. (a) Speech2Face [20]; (b) Inverse Neural Knitting [10]; (c) Semantic Soft Segmentation [27].
Similarly, when designing a deep learning model, we can design parts of modules to
be interpretable or physically meaningful, rather than a black box module. Thereby,
we can induce some algorithmic bias out of inductive biases. It is often named modular
neural networks [46]. A modular network has one large differentiable structure that is designed by combining
multiple deep learning models that already have interpretable output stages. For instance,
HDR-Plenoxels employs a differentiable tone mapping module that is parameterized to
comply with the digital in-camera imaging pipeline [32]. This enables disentangling radiometric effects.
There are some works exploiting prior knowledge of a large-scale pre-trained model.
CLIP-Actor recommends a motion sequence and optimizes mesh style attributes based
on a text prompt by exploiting the large-scale language-image pre-trained model CLIP
[31]. Another method handles the challenge of generalization in audio captioning due to
a lack of audio-text paired data by leveraging a pre-trained language model to deal
with small-scaled datasets [38]. We can also inject prior knowledge with a simple operation. FastMETRO imposes prior
knowledge of the human body’s morphological relationships via attention masking and
mesh up-sampling operations for faster convergence of the model with higher accuracy
[33].
3.3 Heterogenous Datasets
In the same way that synthetic data can compensate for a lack of data diversity, existing
heterogeneous datasets can be combined and leveraged. All of the existing datasets
have been designed considering each independent target task; thus, the problem to
solve and the input/output are often different. These differences can be adapted to
a similar task by manipulating the input and output with only a simple heuristic,
or they can be combined into a new task dataset that reflects the characteristics
of each label.
One study defined a dense relation captioning task describing the relationship between
various objects in an image and proposed a suitable model for dense relation captioning
[12,50]. Since this method was being proposed for the first time, there was no suitable large-scale
dataset. To address this lack of a dataset, they constructed a new dataset by combining
a dataset of visual genome (VG) relationships with a VG attribute dataset. The VG
relationship dataset has labels of image-relationship graphs and a simple subject-verb-object
representation (e.g., ``building-has-window''), while the VG attribute dataset describes
the characteristics of each object in detail.
Transfer learning involves learning prior knowledge using similar data and then transferring
it to the target task data and is commonly used to compensate for a small amount of
data [13]. However, na{\"{i}}ve transfer learning can cause a catastrophic forgetting effect
[15], which yields the loss of previously learned knowledge. To prevent catastrophic forgetting,
progressive learning-based transfer learning was used to adapt knowledge from a video
captioning dataset to a text-diary-based video-summary task [14]. Transfer learning was used to help adapt to domain differences according to text
sentences.
In general, there are many cases that each input data is labeled for multiple different
tasks with different label types in a dataset. If the different target tasks are fundamentally
related to each other, they can be used for complementary learning, assuming that
the knowledge learned in one dataset can be applied to the other, and vice versa.
This method is called multi-task learning.
Multi-task learning in neural networks branches the output head of the model, allowing
the lower layer of the model to learn the common knowledge of the multiple tasks and
the head of each task to learn the task-specific knowledge. This increases data efficiency
because a task-specific head does not require much data to achieve a certain level
of performance due to the shared layers. Furthermore, there are regularization effects
that generalize better than single-task models in some cases [41]. However, it is often uncommon for the same image to have all the types of labels,
called the partial semi-supervised learning or disjoint multi-task learning setup
[16]. Thus, we must train the model by alternating training with only available labels
for each data, but this is known to cause inefficiency due to catastrophic forgetting
[15].
A study took advantage of the fact that video captions usually try to describe human
actions [16]. To leverage this observation, an action recognition task dataset and a video captioning
dataset were exploited with knowledge distillation-based multi-task learning, which
prevents catastrophic forgetting. The same approach is applied to human and animal
3D reconstruction with a single model [47]. In addition, there is a federated learning problem that leverages locally stored
heterogeneous private data, which cannot be shared directly. FedPara uses a communication-efficient
parameterization to handle the communication bottleneck occurring due to the model
parameter transmission [30].
3.4 Multi-modal Learning
We can use different data modalities in a training process. For example, a video always
includes a sound event along with a visual event. This co-occurrence can be used as
a valuable signal to replace supervised learning labels. One study dealt with different
audio-visual relationships in the video data by designing a multi-head model with
individual event-specific layers to enable audio-visual fusion [35].
There are other works that exploit the co-occurrence feature of audio-visual signals.
Arda et al. [17,18] have localized the sound source in an image from a video by maximizing the common
information from the image and sound pairs from the video without any special labels.
They showed that useful knowledge can be learned by maximizing correlation from co-occurrence
based on unsupervised learning. The applications include utilizing co-occurrence to
improve performance and the speed of action recognition in long videos [19] and visualizing whether a computer can imagine a person's face from a person's voice
[20] or a scene from an ambient sound [36].
In addition to video and sound, useful signals can be obtained by combining various
sensors. Using the location and direction information of a vehicle acquired from a
dashcam video, one method learns the perceptual ability of the visuomotor based on
unsupervised learning and predicts the future frame according to the direction of
the vehicle [21]. To complement the characteristics of lidar sensors with sparse data points, another
study created a high-density 3D depth map using color features of camera images [22].
With the recent emerging trend of large language models, Ye-Bin et al. [48] leverage the embeddings of text attributes obtained from a pre-trained large language
model as informative data augmentation perturbation to enhance visual representation
learning. This is a profound result in that, while the language models have never
seen any visual data during training, the understanding of some text attributes (visually
descriptive attributes) is effectively transferable to visual representation learning.
3.5 Domain Adaptation and Normalization
When utilizing heterogeneous datasets, there are cases where only the style and characteristics
of the input data are different, but the output label is the same. Multi-modal data
are one example, and another example is when the domains of datasets are different
despite the same modality of data. The domain difference refers to a difference in
characteristics of data contents (the shape of the data is the same), e.g., real vs.
simulation data, and sketch vs. real image.
One study proposed a training method using two datasets with different domains given
sparse labels [23]. By defining a self-supervised objective function considering domain differences,
they proposed a general pre-training recipe that can lead to improved final performance.
A similar pre-training method is proposed in Park et al. [49].
In the study on inverse knitting [10], real images and simulated images were used. Real data include various real factors
of variations such as lighting, noise, shadow, tension, non-uniformity, and color,
while the simulated rendered images have monotonous data characteristics with a simplified
rendering pipeline. The idea was to transform real data as if it looks synthetic rendered
data, which makes real data monotonous, i.e., reducing the domain gap. Normalizing
to a monotonous image form makes it easy to match the data distribution of the two
different domains by inducing many-to-one correspondence relationships. This is the
opposite of the common trends of other methods that try to make a rendered image look
as real as possible. Due to the difficulty of learning to map one-to-many relationships
in other studies, the image conversion quality and performance often deteriorated.
Another key contribution was to propose a theory that this property is advantageous
to generalization because it is related to minimizing the upper bound of the theoretical
generalization error [10].
When we transfer the complex data into a normalized space, the level of label required
for training is lower. Finding a favorable normalized space is different for the target
tasks, so having expert knowledge about the problem is advantageous. One study [11] normalized facial images with various angles and facial expressions into a canonical
space, called UV space, through 3D model estimation. It was shown that unseen expressions
during the training can be generalized. Based on a similar hypothesis that various
real variations, e.g., viewpoint changes, color variations, etc., of general images
degrade recognition performance, another study showed strong generalization performance
by normalizing and iconizing the reality elements [24] in a few-shot learning scenario. The study applied a few-shot recognition task to
unseen logos, signals, icons, etc.
3.6 Self- and Semi-supervised Learning
Obtaining unlabeled data is much easier compared to labeled data. Self-supervised
learning aims to learn good feature representation for general tasks with large-scale
unsupervised data and has drawn significant attention recently. However, in most cases,
self-supervised learning alone is not a means of solving a specific task of interest
and also has pitfalls that are easy to overlook some aspects as follows.
An example is appeared in the unsupervised version of sound source localization [17,18]. Every time the model hears the sound of a car engine, the model also sees a video
of the road. Compared to the diversity of cars and their size in video frames, asphalt
roads have a much simpler and uniform texture and shape, and the ratio of the area
occupied in the image is also much larger. That is, the correlation with the car engine
sound is much higher in the road area than in the car area. Therefore, when the model
is trained by unsupervised learning, it incorrectly learns that the engine sound is
from the road. This problem occurs not only in machine learning, but also in the learning
process of animals, including humans, called pigeons’ superstition. It is a phenomenon
often cited in animal learning theory, and Arda et al. [17,18] show that it also applies to machines. This phenomenon explains the misjudgment problem
that arises when incorrectly assuming that causality and correlation are the same.
According to the ``no free lunch'' theory in machine learning, it is known that such
bias is impossible to avoid without prior knowledge.
Other than model architecture as prior knowledge, the most direct way to provide prior
knowledge is to use at least a small amount of labeled data. Along with the aforementioned
unsupervised method, one can leverage a small amount of labeled data, which forms
a semi-supervised learning setting. Arda et al. show that even a small amount of labeled
data can solve the misjudgment problem of causality and has much better performance
than the supervised learning counterpart using more labeled data.
In addition to semi-supervised learning through the combination of the objective functions
for unsupervised and supervised learning, transudative learning propagates information
from small label data to similar unsupervised learning data. In one study, a video
captioning model was trained by label-efficient learning through three datasets: a
small video caption dataset, a large unlabeled image dataset, and a separate large
unlabeled captioned-text dataset [25].
We can train a model in a weakly supervised learning manner when there is no label
directly corresponding to the target task but instead a sort of sub-level (or simpler)
label in the form that can be led from the label of the target task. A weakly supervised
image retargeting method is proposed [26], of which the goal is to learn an image retargeting task that minimizes the distortion
of the main content of the image. To train for this task, labels of the same form
as the output of the final system were not used. The method trained a model to recommend
unnecessary spatial regions of an image one by one sequentially, which is determined
according to objectness obtained from a pre-trained visual recognition model.
A similar type of data generation scheme was suggested for denoising [40]. Recently, label-efficient methods such as self- and weakly supervised learning have
received much attention and are developing rapidly. Alternative methods and supervised
learning can be easily merged and applied, so many practical possibilities are expected
in the future.
3.7 Fusion of Classical Algorithms with Learning-based Ones
When we define a new problem, there are usually no large-scale datasets for the task.
To side-step this challenge, some works combine classical algorithms. One study proposed
a semantic soft segmentation method [27]. Unlike the other soft segmentation methods, each segment represents the semantic
boundary of an object. The study used an existing classical soft segmentation algorithm
that does not use a neural network but combined with a pre-trained semantic segmentation
neural network to provide semantic information. This simple combination solved the
problem of lacking data.
During training, the compatibility of the input and output data type can be maintained
through an objective function considering compatibility with classical algorithms,
so they substituted labels with other forms that are easy to obtain. In this form,
data that are difficult to access can be replaced with data that is easy to access.
This development may also improve other segmentation-based practical applications,
e.g., [39].
3.8 Meta-learning (Few-shot Learning)
Few-shot learning utilizes only a few labeled data samples considering a few-shot
test scenario during training. The method of constructing similar training scenarios
with the test phase is called the episodic learning method. Although there have been
many few-shot learning methods, metric learning-based methods have been widely used.
These methods aim to train a well-generalized feature space. Thereby, the model effectively
determines the nearest category through a simple nearest-neighbor search in the learned
feature space in the test phase. Given this, the key to the metric-based method is
how to design and derive the feature space.
In one study, a metric learning technique was derived from the relationship among
quadruplet samples, which are used to effectively induce cluster learning [28]. Another study used a method of inducing understanding of content through a normalization
process [24]. In addition to these methods, Model-agnostic meta-learning (MAML)-based methods
are also actively studied, which lead to fast and label-efficient adaptation to given
tasks using a differential direction. There are also works that handle the segmentation
task with few-shot learning [34,37]. As aforementioned, a segmentation mask label is more expensive than other class
or bounding-box labels. In this regard, segmenting a target object with only a few
examples would be a good direction for developing an efficient annotation tool or
the segmentation task.