Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 02, p.120-128

ISSN (online) :

2287-5255

Received : 20 April 2023Accepted : 24 June 202330 April 2024

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.2.120

Regular Paper

Review Paper: This paper reviews the recent progress, possibly including previous works in a particular research topic, and has been accepted by the editorial board through the regular reviewing process.

Label-efficient Learning Methods for Computer Vision Applications

Moon Ye-Bin¹ Tae-Hyun Oh²

(Department of Electrical Engineering, POSTECH, Korea ybmoon@postech.ac.kr )
(Department of Electrical Engineering and Graduate School of AI, POSTECH, Korea, and Institute for Convergence Research and Education in Advanced Technology, Yonsei Univ. Korea taehyun@postech.ac.kr )

^*Corresponding Author: Tae-Hyun Oh

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

In this work, we review the challenges of data scarcity and label inefficiency in deep learning and survey efforts to overcome these challenges. Many label-efficient learning methods have been proposed, but there is still room to develop more effective methods. We introduce potential yet promising directions to achieve label-efficient learning in terms of data, learning methods, and efficient use of prior knowledge. We also present case studies involving the latest methods.

Keywords

Label-efficient learning, Synthetic data, Heterogeneous datasets, Transfer learning, Self-supervised learning, Semi-supervised learning, Computer vision

1. Introduction

Classical neural network-based machine learning has been reborn under the name of deep learning by virtue of advances in computing resources, Internet services and its data, and algorithms. In 2012, AlexNet showed significantly better performance than other computer vision algorithms in the ImageNet Challenge ^[43]. Since then, deep learning has made remarkable progress and can even surpass human abilities in image recognition. The impressive results have outperformed conventional methodologies in various academic and industrial fields since deep learning became prevalent, particularly in the field of computer vision.

One of the key reasons behind the impressive performance of deep learning is its ability to organically train and combine multiple learnable modules, which is also known as end-to-end learning. Conventional methods are hand-crafted by human experts to extract informative features through data analysis and rely on shallow machine learning techniques such as linear discriminators, as well as crude feature extraction as input. However, deep learning integrates the feature extraction module and discriminator into a single deep neural network, and both are trained by just training the single network. This enables learning optimal feature extraction and discrimination with a large volume of data. This approach allows us to find outstanding combinations using the gradient descent method, which may not be discovered or may be overlooked in designs by humans.

To take these advantages, deep learning models require a substantial amount of data. Due to the deep cascade structure of deep neural networks, they have many learning parameters, which require much data to learn while avoiding the problem of overfitting. Moreover, most of the current successes are based on supervised learning methods, which require as much label data as the amount of input data.

These requirements pose a variety of challenges to achieving high performance with deep learning models. The design of deep learning models that are applicable in limited-label scenarios has emerged as a challenge to overcome. This paper reviews the types and applications of learning techniques for efficiently learning label data using deep learning models that can perform computer vision and graphics tasks through case studies within the main focuses of the authors.

2. Background

Manpower issues for data collection and labeling. Deep learning models based on supervised learning primarily require the collection of both input and label data. High-quality label data are needed to train a model effectively, so the paired input and label data are obtained through manual effort by someone with the skill needed to perform the target task, e.g., ^[1]. Thus, high labor costs can be a significant issue (see Fig. 1 for an example).

Errors and noise caused by human factors. Even if the cost problem is addressed through funding, other problems occur; it is impossible to recruit a large amount of annotator personnel offline to build large-scale datasets, so crowdsourcing platforms such as Amazon Mechanical Turk are generally used. Most dataset designers expect data quality like that in Fig. 1(a), but the actual label often exhibits low quality, as shown in Fig. 1(b). This problem is caused by human factors, and it should be approached in a comprehensive way such as by improving labeling tools and human-factor management. Some studies have aimed to address the noisy label problems algorithmically ^[2-^7].

Efficiency and cost issues due to the absence of labeling-efficient tools. Many crowdsourcing businesses are emerging to professionally address errors and noise caused by human factors. For some tasks, there are companies that provide services that utilize efficient labeling tools, e.g., ^[8,^34], that provide convenience and robustness against errors induced by annotators. However, developing tools that are tailored to target tasks for all problems can be challenging and may not improve efficiency in all cases.

Privacy and copyright issues. Large-scale data are usually acquired through crawling, which may include personal information and raise privacy issues during the labeling process. Obtaining individual permission from the original owners is necessary but can be challenging due to the numerous web-data creators and corresponding licenses. Additionally, the standard laws differ between countries, so arbitrary use of data could lead to legal problems.

Difficulty in labeling according to the problem definition and efficiency. Defining a problem and determining the forms of labels pose several challenges. For instance, medical data labeling requires hiring experts with medical knowledge, resulting in exponential increases in both time and cost compared to general labeling, which makes it difficult to build large-scale datasets. In other cases, the parts that should be provided with annotations and the form for effective and learnable labels are unclear. For instance, the output of a problem is undefined when constructing a dataset to identify the cause of a problem for which the cause has not been figured out.

Another challenge is when the labeling target exceeds the range of human capability. As a simple example, when labeling pixels’ correspondence points between two image frames, e.g., optical flow ^[29], it is almost impossible to manually label the motions of all pixels in real scene data. There may be ways to exploit other equipment, but they cannot be expected to exist for all problems.

Illusion of data omnipotence. It cannot be assumed that large datasets or additional data will always exist for many problems. Therefore, in a situation where a small amount of label data is given when designing a problem solution, it is more practical to pursue learning techniques to overcome the limitations of small data compared to relying only on supervised learning.

Fig. 1. An example of semantic segmentation label data, one of the representative computer vision problems [1]. Labels of semantic segmentation represent all boundaries of objects in the form of polygons. To label a segmentation map (a), it takes about 1.5 hours per image when done by experts. It is necessary to label at least hundreds of thousands of such images. If we use a commonly used crowdsourcing platform, most of them will have low-quality labels (b). This is a fundamental problem caused by the human factor.

3. Label-efficient Training

In this section, we discuss representative approaches that can train deep learning models with a small amount of labeled data and give examples of each method: Leveraging synthetic data (Sec. 3.1), prior knowledge (Sec. 3.2), heterogeneous datasets (Sec. 3.3), multi-modal data together (Sec. 3.4), domain generalization (Sec. 3.5), self-/semi-supervised learning (Sec. 3.6), leveraging classical algorithms (Sec. 3.7), and meta-learning (Sec. 3.8).

3.1 Synthetic Data

In cases where input-output paired data for supervised learning cannot be obtained, or labeling is not possible, data can be synthesized by simulating the input-output relationship. Although it is challenging to perfectly replicate an actual relationship, the model can be designed to learn the input-output relationship through the synthetic data.

For example, one study proposed a deep model that generates a motion-magnified video from an input video with subtle motion ^[9]. To train the model, the authors synthesized the motion-magnified video for output through a simple first-order motion vector model. It is difficult to modify the image because it is already rasterized, but it was possible when synthesizing motion magnification videos for each segmented object to learn how to extract and manipulate motion from the deep learning model. A similar approach is used to synthesize optical flow datasets ^[29,^44].

As another example, another study proposed a reverse-knitting method that scans the pattern of knitted fabric and extracts it into a knitting program map ^[10]. Knitting patterns and corresponding program maps are needed as data to do this. However, it is difficult to uniformly photograph the yarn tension or texture of knitting patterns, so there are limitations in constructing a large-scale dataset. To overcome this problem, simulation data were synthesized through rendering using a graphics engine from program maps. The model was successfully trained using both real and synthetic data, which mitigated the data shortage and demonstrated significant performance improvement.

In another study, a supervised learning dataset was constructed using a simple image-based rendering technique ^[11]. The authors synthesized human images taken with various lighting angles to represent different lighting environments. DFlow ^[29] uses a differentiable optical flow-data-generation pipeline to efficiently synthesize a dataset that is effective for a target domain without the need for cumbersome try-and-error.

Recently, generative models have enabled the generation of virtual data with indistinguishable quality from real data. However, it is still in the research stage, and it shows good performance for only limited data types such as long tail class distribution ^[45] and faces. Nevertheless, data generation through generative models is likely to be utilized as data-efficient learning techniques.

3.2 Prior Knowledge

Human knowledge can be seen as an aggregation of data accumulated over a lifetime. Similarly, a physical formula can be regarded as a compressed expression of data found from observations accumulated over hundreds or thousands of years. Thus, we might consider using such prior knowledge to replace the data.

For video motion magnification, a deep learning model was trained with synthetic input and output data pairs ^[9]. However, due to the gap between real and synthetic data and the difficulty of learning with only motion data, learning perfect motion is difficult. To overcome this, the architecture of a deep learning model can be designed to follow simple physics constraints on speed, as shown in Fig. 2. It has been shown that a model design based on such physical knowledge can implement knowledge that cannot be derived from data alone ^[9]. This was not achievable with a simple convolutional neural network (CNN), but with the proposed method inducing the physical relationship, i.e., velocity. With these developments, deep video motion magnification enables extensive and critical non-contact-based applications, such as health applications for heart rate estimation and blood pressure ^[42], an infrastructure safety monitoring system, etc.

Fig. 2. (Left) Motion magnification comparison results (x-t slices); (Right) Motion magnification architecture.

Fig. 3. (a) Speech2Face [20]; (b) Inverse Neural Knitting [10]; (c) Semantic Soft Segmentation [27].

Similarly, when designing a deep learning model, we can design parts of modules to be interpretable or physically meaningful, rather than a black box module. Thereby, we can induce some algorithmic bias out of inductive biases. It is often named modular neural networks ^[46]. A modular network has one large differentiable structure that is designed by combining multiple deep learning models that already have interpretable output stages. For instance, HDR-Plenoxels employs a differentiable tone mapping module that is parameterized to comply with the digital in-camera imaging pipeline ^[32]. This enables disentangling radiometric effects.

There are some works exploiting prior knowledge of a large-scale pre-trained model. CLIP-Actor recommends a motion sequence and optimizes mesh style attributes based on a text prompt by exploiting the large-scale language-image pre-trained model CLIP ^[31]. Another method handles the challenge of generalization in audio captioning due to a lack of audio-text paired data by leveraging a pre-trained language model to deal with small-scaled datasets ^[38]. We can also inject prior knowledge with a simple operation. FastMETRO imposes prior knowledge of the human body’s morphological relationships via attention masking and mesh up-sampling operations for faster convergence of the model with higher accuracy ^[33].

3.3 Heterogenous Datasets

In the same way that synthetic data can compensate for a lack of data diversity, existing heterogeneous datasets can be combined and leveraged. All of the existing datasets have been designed considering each independent target task; thus, the problem to solve and the input/output are often different. These differences can be adapted to a similar task by manipulating the input and output with only a simple heuristic, or they can be combined into a new task dataset that reflects the characteristics of each label.

One study defined a dense relation captioning task describing the relationship between various objects in an image and proposed a suitable model for dense relation captioning ^[12,^50]. Since this method was being proposed for the first time, there was no suitable large-scale dataset. To address this lack of a dataset, they constructed a new dataset by combining a dataset of visual genome (VG) relationships with a VG attribute dataset. The VG relationship dataset has labels of image-relationship graphs and a simple subject-verb-object representation (e.g., ``building-has-window''), while the VG attribute dataset describes the characteristics of each object in detail.

Transfer learning involves learning prior knowledge using similar data and then transferring it to the target task data and is commonly used to compensate for a small amount of data ^[13]. However, na{\"{i}}ve transfer learning can cause a catastrophic forgetting effect ^[15], which yields the loss of previously learned knowledge. To prevent catastrophic forgetting, progressive learning-based transfer learning was used to adapt knowledge from a video captioning dataset to a text-diary-based video-summary task ^[14]. Transfer learning was used to help adapt to domain differences according to text sentences.

In general, there are many cases that each input data is labeled for multiple different tasks with different label types in a dataset. If the different target tasks are fundamentally related to each other, they can be used for complementary learning, assuming that the knowledge learned in one dataset can be applied to the other, and vice versa. This method is called multi-task learning.

Multi-task learning in neural networks branches the output head of the model, allowing the lower layer of the model to learn the common knowledge of the multiple tasks and the head of each task to learn the task-specific knowledge. This increases data efficiency because a task-specific head does not require much data to achieve a certain level of performance due to the shared layers. Furthermore, there are regularization effects that generalize better than single-task models in some cases ^[41]. However, it is often uncommon for the same image to have all the types of labels, called the partial semi-supervised learning or disjoint multi-task learning setup ^[16]. Thus, we must train the model by alternating training with only available labels for each data, but this is known to cause inefficiency due to catastrophic forgetting ^[15].

A study took advantage of the fact that video captions usually try to describe human actions ^[16]. To leverage this observation, an action recognition task dataset and a video captioning dataset were exploited with knowledge distillation-based multi-task learning, which prevents catastrophic forgetting. The same approach is applied to human and animal 3D reconstruction with a single model ^[47]. In addition, there is a federated learning problem that leverages locally stored heterogeneous private data, which cannot be shared directly. FedPara uses a communication-efficient parameterization to handle the communication bottleneck occurring due to the model parameter transmission ^[30].

3.4 Multi-modal Learning

We can use different data modalities in a training process. For example, a video always includes a sound event along with a visual event. This co-occurrence can be used as a valuable signal to replace supervised learning labels. One study dealt with different audio-visual relationships in the video data by designing a multi-head model with individual event-specific layers to enable audio-visual fusion ^[35].

There are other works that exploit the co-occurrence feature of audio-visual signals. Arda et al. ^[17,^18] have localized the sound source in an image from a video by maximizing the common information from the image and sound pairs from the video without any special labels. They showed that useful knowledge can be learned by maximizing correlation from co-occurrence based on unsupervised learning. The applications include utilizing co-occurrence to improve performance and the speed of action recognition in long videos ^[19] and visualizing whether a computer can imagine a person's face from a person's voice ^[20] or a scene from an ambient sound ^[36].

In addition to video and sound, useful signals can be obtained by combining various sensors. Using the location and direction information of a vehicle acquired from a dashcam video, one method learns the perceptual ability of the visuomotor based on unsupervised learning and predicts the future frame according to the direction of the vehicle ^[21]. To complement the characteristics of lidar sensors with sparse data points, another study created a high-density 3D depth map using color features of camera images ^[22].

With the recent emerging trend of large language models, Ye-Bin et al. ^[48] leverage the embeddings of text attributes obtained from a pre-trained large language model as informative data augmentation perturbation to enhance visual representation learning. This is a profound result in that, while the language models have never seen any visual data during training, the understanding of some text attributes (visually descriptive attributes) is effectively transferable to visual representation learning.

3.5 Domain Adaptation and Normalization

When utilizing heterogeneous datasets, there are cases where only the style and characteristics of the input data are different, but the output label is the same. Multi-modal data are one example, and another example is when the domains of datasets are different despite the same modality of data. The domain difference refers to a difference in characteristics of data contents (the shape of the data is the same), e.g., real vs. simulation data, and sketch vs. real image.

One study proposed a training method using two datasets with different domains given sparse labels ^[23]. By defining a self-supervised objective function considering domain differences, they proposed a general pre-training recipe that can lead to improved final performance. A similar pre-training method is proposed in Park et al. ^[49].

In the study on inverse knitting ^[10], real images and simulated images were used. Real data include various real factors of variations such as lighting, noise, shadow, tension, non-uniformity, and color, while the simulated rendered images have monotonous data characteristics with a simplified rendering pipeline. The idea was to transform real data as if it looks synthetic rendered data, which makes real data monotonous, i.e., reducing the domain gap. Normalizing to a monotonous image form makes it easy to match the data distribution of the two different domains by inducing many-to-one correspondence relationships. This is the opposite of the common trends of other methods that try to make a rendered image look as real as possible. Due to the difficulty of learning to map one-to-many relationships in other studies, the image conversion quality and performance often deteriorated. Another key contribution was to propose a theory that this property is advantageous to generalization because it is related to minimizing the upper bound of the theoretical generalization error ^[10].

When we transfer the complex data into a normalized space, the level of label required for training is lower. Finding a favorable normalized space is different for the target tasks, so having expert knowledge about the problem is advantageous. One study ^[11] normalized facial images with various angles and facial expressions into a canonical space, called UV space, through 3D model estimation. It was shown that unseen expressions during the training can be generalized. Based on a similar hypothesis that various real variations, e.g., viewpoint changes, color variations, etc., of general images degrade recognition performance, another study showed strong generalization performance by normalizing and iconizing the reality elements ^[24] in a few-shot learning scenario. The study applied a few-shot recognition task to unseen logos, signals, icons, etc.

3.6 Self- and Semi-supervised Learning

Obtaining unlabeled data is much easier compared to labeled data. Self-supervised learning aims to learn good feature representation for general tasks with large-scale unsupervised data and has drawn significant attention recently. However, in most cases, self-supervised learning alone is not a means of solving a specific task of interest and also has pitfalls that are easy to overlook some aspects as follows.

An example is appeared in the unsupervised version of sound source localization ^[17,^18]. Every time the model hears the sound of a car engine, the model also sees a video of the road. Compared to the diversity of cars and their size in video frames, asphalt roads have a much simpler and uniform texture and shape, and the ratio of the area occupied in the image is also much larger. That is, the correlation with the car engine sound is much higher in the road area than in the car area. Therefore, when the model is trained by unsupervised learning, it incorrectly learns that the engine sound is from the road. This problem occurs not only in machine learning, but also in the learning process of animals, including humans, called pigeons’ superstition. It is a phenomenon often cited in animal learning theory, and Arda et al. ^[17,^18] show that it also applies to machines. This phenomenon explains the misjudgment problem that arises when incorrectly assuming that causality and correlation are the same. According to the ``no free lunch'' theory in machine learning, it is known that such bias is impossible to avoid without prior knowledge.

Other than model architecture as prior knowledge, the most direct way to provide prior knowledge is to use at least a small amount of labeled data. Along with the aforementioned unsupervised method, one can leverage a small amount of labeled data, which forms a semi-supervised learning setting. Arda et al. show that even a small amount of labeled data can solve the misjudgment problem of causality and has much better performance than the supervised learning counterpart using more labeled data.

In addition to semi-supervised learning through the combination of the objective functions for unsupervised and supervised learning, transudative learning propagates information from small label data to similar unsupervised learning data. In one study, a video captioning model was trained by label-efficient learning through three datasets: a small video caption dataset, a large unlabeled image dataset, and a separate large unlabeled captioned-text dataset ^[25].

We can train a model in a weakly supervised learning manner when there is no label directly corresponding to the target task but instead a sort of sub-level (or simpler) label in the form that can be led from the label of the target task. A weakly supervised image retargeting method is proposed ^[26], of which the goal is to learn an image retargeting task that minimizes the distortion of the main content of the image. To train for this task, labels of the same form as the output of the final system were not used. The method trained a model to recommend unnecessary spatial regions of an image one by one sequentially, which is determined according to objectness obtained from a pre-trained visual recognition model.

A similar type of data generation scheme was suggested for denoising ^[40]. Recently, label-efficient methods such as self- and weakly supervised learning have received much attention and are developing rapidly. Alternative methods and supervised learning can be easily merged and applied, so many practical possibilities are expected in the future.

3.7 Fusion of Classical Algorithms with Learning-based Ones

When we define a new problem, there are usually no large-scale datasets for the task. To side-step this challenge, some works combine classical algorithms. One study proposed a semantic soft segmentation method ^[27]. Unlike the other soft segmentation methods, each segment represents the semantic boundary of an object. The study used an existing classical soft segmentation algorithm that does not use a neural network but combined with a pre-trained semantic segmentation neural network to provide semantic information. This simple combination solved the problem of lacking data.

During training, the compatibility of the input and output data type can be maintained through an objective function considering compatibility with classical algorithms, so they substituted labels with other forms that are easy to obtain. In this form, data that are difficult to access can be replaced with data that is easy to access. This development may also improve other segmentation-based practical applications, e.g., ^[39].

3.8 Meta-learning (Few-shot Learning)

Few-shot learning utilizes only a few labeled data samples considering a few-shot test scenario during training. The method of constructing similar training scenarios with the test phase is called the episodic learning method. Although there have been many few-shot learning methods, metric learning-based methods have been widely used. These methods aim to train a well-generalized feature space. Thereby, the model effectively determines the nearest category through a simple nearest-neighbor search in the learned feature space in the test phase. Given this, the key to the metric-based method is how to design and derive the feature space.

In one study, a metric learning technique was derived from the relationship among quadruplet samples, which are used to effectively induce cluster learning ^[28]. Another study used a method of inducing understanding of content through a normalization process ^[24]. In addition to these methods, Model-agnostic meta-learning (MAML)-based methods are also actively studied, which lead to fast and label-efficient adaptation to given tasks using a differential direction. There are also works that handle the segmentation task with few-shot learning ^[34,^37]. As aforementioned, a segmentation mask label is more expensive than other class or bounding-box labels. In this regard, segmenting a target object with only a few examples would be a good direction for developing an efficient annotation tool or the segmentation task.

4. Conclusion

In this paper, we introduced the methods to address the lack of data problems and label-inefficient learning methods in deep learning. Supervised learning methods have been designed without considering efficient methods, so it is difficult to significantly change data efficiency. To overcome this, much more label-efficient learning techniques have been actively studied beyond the mentioned methods in this work. Many label-efficient learning methods have been designed by being motivated by the fact that people can easily learn new knowledge by looking at a few examples. However, it is more reasonable to assume that the cognitive abilities of people are highly adaptable to new tasks due to the compressed experience of vast knowledge imprinted in their genes and the experiences accumulated since birth. Ultimately, the problem of how well a label-efficient method uses the form of prior knowledge compressed in any form would be important in this field. Solving this problem might be a big step towards obtaining general artificial intelligence.

ACKNOWLEDGMENTS

This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-02068, Artificial Intelligence Innovation Hub; No. 2019-0-01906, Artificial Intelligence Graduate School Program (POSTECH)) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2021R1C1C1006799).

REFERENCES

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele, ``The Cityscapes Dataset for Semantic Urban Scene Understanding,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

J. Li, Y. Wong, Q. Zhao and M. S. Kankanhalli, ``Learning to learn from noisy labeled data,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

T.-H. Oh, H. Kim, Y.-W. Tai, J.-C. Bazin and I. S. Kweon, ``Partial Sum Minimization of Singular Values in RPCA for Low-Level Vision,'' in IEEE International Conference on Computer Vision, (ICCV), 2013.

T.-H. Oh, Y. Matsushita, Y.-W. Tai and I. S. Kweon, ``Fast Randomized Singular Value Thresholding for Low-rank Optimization,'' IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.

T.-H. Oh, Y. Matsushita, Y.-W. Tai and I. S. Kweon, ``Fast Randomized Singular Value Thresholding for Nuclear Norm Minimization,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

T.-H. Oh, Y.-W. Tai, J.-C. Bazin, H. Kim and I. S. Kweon, ``Partial Sum Minimization of Singular Values in Robust PCA: Algorithm and Applications,'' IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.

T.-H. Oh, D. Wipf, Y. Matsushita and I. S. Kweon, ``A Pseudo-Bayesian Algorithm for Robust PCA,'' in Neural Information Processing Systems (NeurIPS), 2016.

D. Acuna, H. Ling, A. Kar and S. Fidler, ``Efficient Annotation of Segmentation Datasets with Polygon-RNN++,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

T.-H. Oh, R. Jaroensri, C. Kim, M. Elgharib, F. Durand and W. T. Freeman, ``Learning-based Video Motion Magnification,'' in European Conference on Computer Vision (ECCV), 2018.

A. Kaspar, T.-H. Oh, L. Makatura, P. Kellnhofer, J. Aslarus and W. Matusik, ``Neural Inverse Knitting: From Images to Manufacturing Instructions,'' in International Conference on Machine Learning (ICML), 2019.

M. B. R., A. Tewari, T.-H. Oh, T. Weyrich, B. Bickel, H.-P. Seidel, H. Pfister, W. Matusik, M. Elgharib and C. Theobalt, ``Monocular Reconstruction of Neural Face Reflectance Fields,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

D.-J. Kim, J. Choi, T.-H. Oh and I. S. Kweon, ``Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

M. Oquab, L. Bottou, I. Laptev and J. Sivic, ``Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.

J. Choi, T.-H. Oh and I. S. Kweon, ``Contextually Customized Video Summaries via Natural Language,'' in IEEE Winter Conference on Applications of Computer Vision (WAVC), 2018.

Z. Li and D. Hoiem, ``Learning without forgetting,'' IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.

D.-J. Kim, J. Choi, T.-H. Oh, Y. Yoon and I. S. Kweon, ``Disjoint Multi-task Learning between Heterogeneous Human-centric Tasks,'' in IEEE Winter Conference on Applications of Computer Vision (WAVC), 2018.

A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang and I. S. Kweon, ``Learning to Localize Sound Source in Visual Scenes,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang and I. S. Kweon, ``Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications,'' IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.

R. Gao, T.-H. Oh, K. Grauman and L. Torresani, ``Listen to Look: Action Recognition by Previewing Audio,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein and W. Matusik, ``Speech2Face: Learning the Face Behind a Voice,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

S. Lee, J. Kim, T.-H. Oh, Y. Jeong, D. Yoo, S. Lin and I. S. Kweon, ``Visuomotor Understanding for Representation Learning of Driving Scenes,'' in The British Machine Vision Conference (BMVC), 2019.

I. Shim, T.-H. Oh and I. S. Kweon, ``High-fidelity Depth Upsampling using Self-learning Framework,'' MDPI Sensors, 2019.

D. Kim, K. Saito, T.-H. Oh, B. A. Plummer, S. Sclaroff and K. Saenko, ``CDS: Cross-Domain Self-supervised Pre-training,'' in IEEE International Conference on Computer Vision, (ICCV), 2021.

J. Kim, T.-H. Oh, S. Lee, F. Pan and I. S. Kweon, ``Variational Prototyping-Encoder: One-Shot Learning with Prototypical Images,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

D.-J. Kim, J. Choi, T.-H. Oh and I. S. Kweon, ``Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach,'' in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.

D. Cho, J. Park, T.-H. Oh, Y.-W. Tai and I. S. Kweon, ``Weakly- and Self-Supervised Learning for Content-aware Deep Image Retargeting,'' in IEEE International Conference on Computer Vision, (ICCV), 2017.

Y. Aksoy, T.-H. Oh, S. Paris, M. Pollefeys and W. Matusik, ``Semantic Soft Segmentation,'' ACM Transactions on Graphics (ACM SIGGRAPH), 2018.

J. Kim, S. Lee, T.-H. Oh and I. S. Kweon, ``Co-domain Embedding using Deep Quadruplet Network for Unseen Traffic Sign Recognition,'' in AAAI Conference on Artificial Intelligence (AAAI), 2018.

K. Byung-Ki, N. Hyeon-Woo, and T.-H. Oh, ``DFlow: Learning to Synthesize Better Optical Flow Datasets via a Differentiable Pipeline,'' in International Conference on Learning Representations (ICLR), 2023.

N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh, ``FedPara: Low-rank Hadamard Product for Communication-Efficient Federated Learning,'' in International Conference on Learning Representations (ICLR), 2022.

K. Youwang, K. Ji-Yeon, and T.-H. Oh, ``CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes,'' in European Conference on Computer Vision (ECCV), 2022.

K. Jun-Seong, K. Yu-Ji, M. Ye-Bin, and T.-H. Oh, ``HDR- Plenoxels: Self-Calibrating High Dynamic Range Radiance Fields,'' in European Conference on Computer Vision (ECCV), 2022.

J. Cho, K. Youwang, and T.-H. Oh, ``Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers,'' in European Conference on Computer Vision (ECCV), 2022.

B. Han, T.-H. Oh, ``Learning Few-shot Segmentation from Bounding Box Annotations,'' in IEEE Winter Conference on Applications of Computer Vision (WACV), 2023.

A. Senocak, J. Kim, T.-H. Oh, D. Li, and I. S. Kwon, ``Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding,'' in IEEE Winter Conference on Applications of Computer Vision (WACV), 2023.

K. Sung-Bin, A. Senocak, H. Hyunwoo, A.Owens, and T.-H. Oh, ``Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment,'' in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

M. Ye-Bin, D. Choi, Y. Kwon, J. Kim, and T.-H. Oh ``ENInst: Enhancing Weakly-supervised Low-shot Instance Segmentation,'' in Pattern Recognition (2023).

M. Kim, K. Sung-Bin, and T.-H. Oh, ``Prefix Tuning for Automated Audio Captioning,'' International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023.

S.-E. Lee, S.-E. Choi, G. Park, Y.-Y. Kang, J.-W. Baek, and K. Chung, ``Mask R-CNN-based Occlusion Anomaly Detection Considering Orientation in Manufacturing Process Data,'' in IEIE Transactions on Smart Processing and Computing (IEIESPC), Vol. 11, No. 06, p. 393-399, 2022.

S.-M. Woo, S.-E. Lee, and J.-O. Kim, ``Deep Texture-adaptive Image Denoising for Practical Application,'' in IEIE Transactions on Smart Processing and Computing (IEIESPC), Vol. 11, No. 06, p. 412-420, 2022.

J. Kim, S. Kim, C. Pyo, H. Kim, and C. Yim, ``Progressive Dehazing and Depth Estimation from a Single Hazy Image,'' in IEIE Transactions on Smart Processing and Computing (IEIESPC), Vol. 11, No. 05, p. 343-350, 2022.

Y. Seo, J. Lee, U. Sunarya, K. Lee, and C. Park, ``Continuous Blood Pressure Estimation using 1D Convolutional Neural Network and Attention Mechanism,'' in IEIE Transactions on Smart Processing and Computing (IEIESPC), Vol. 11, No. 03, p. 169-173, 2022.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ``ImageNet Classification with Deep Convolutional Neural Networks,'' in Advances in Neural Information Processing Systems, 2012.

K. Byung-Ki, K. Sung-Bin, T.-H. Oh, ``The Devil in the Details: Simple and Effective Optical Flow Synthetic Data Generation,'' in arXiv preprint arXiv:2308.07378, 2023.

M. Ye-Bin, N. Hyeon-Woo, W. Choi, N. Kim, S. Kwak, T.-H. Oh, ``Exploiting Synthetic Data for Data Imbalance Problems: Baselines from a Data Perspective,'' in arXiv preprint arXiv:2308.00994, 2023.

G. Auda, M. Kamel, ``Modular neural networks: a survey,'' in International Journal of Neural Systems, Vol. 9, No. 02, 129-151, 1999.

K. Youwang, K. Ji-Yeon, K. Joo, T.-H. Oh, ``Unified 3D Mesh Recovery of Humans and Animals by Learning Animal Exercise,'' in The British Machine Vision Conference (BMVC), 2021.

M. Ye-Bin, J. Kim, H. Kim, K. Son, T.-H. Oh, ``TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation,'' in IEEE International Conference on Computer Vision, (ICCV), 2023.

S. Park, M. Song, B. Kim, T.-H. Oh, ``Unsupervised Pre-Training for Data-Efficient Text-to-Speech on Low Resource Languages,'' in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

D. J. Kim, T.-H. Oh, J. Choi, I. S. Kweon, ``Dense relational image captioning via multi-task triple-stream networks,'' in IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 7348-7362, 2021.

Author

Moon Ye-Bin

Moon Ye-Bin is currently pursuing the Ph.D. degree and received the M.S. degree in 2022 with Electrical Engi-neering at POSTECH, South Korea. She received the B.E. degree in Electrical and Electronic Engineering from Chung-Ang University, South Korea in 2020. Her research interests include data-centric multi-modal learning, vision-language models (VLMs), large language models (LLMs), and few-/zero-shot learning.

Tae-Hyun Oh

Tae-Hyun Oh is an Associate Professor with Electrical Engineering (adjunct with Graduate School of AI and Dept. of Convergence IT Engineering) at POSTECH, South Korea. He received the B.E. degree (First class honors) in Computer Engineering from Kwang-Woon University, South Korea in ‘10, and the M.S. and Ph.D. degrees in Electrical Engineering from KAIST, South Korea in ‘12 and ‘17, respectively. He was a postdoctoral associate at MIT CSAIL, Cambridge, MA, US, was with Facebook AI Research, Cambridge, MA, US. He was jointly affiliated with OpenLab, POSCO-RIST, South Korea, as a research director from ‘21 to ‘23. He was a research intern at Microsoft Research in ‘14 and ‘16. He serves as an area chair for CVPR, ICCV, NeurIPS, ICRA, and an associate editor for the Visual Computer journal. He was a recipient of Microsoft Research Asia fellowship, Samsung HumanTech thesis gold award, Qualcomm Innovation awards, top research achievement awards from KAIST, and outstanding reviewer awards from CVPR'20 and ICLR'22.

Article Information (continued)

Regular Paper

Keywords :

Keywords

Keyword :

Label-efficient learning

Keyword :

Synthetic data

Keyword :

Heterogeneous datasets

Keyword :

Transfer learning

Keyword :

Self-supervised learning

Keyword :

Semi-supervised learning

Keyword :

Computer vision

This display is generated from NISO JATS XML with jats-style.xsl. The XSLT engine is Saxonica.