IBC2023: This Technical Paper provides a comprehensive overview of state-of-the-art deep learning-based super-resolution methods and their respective advantages and drawbacks, focusing on how they can be tailored for practical deployments in the cloud to mitigate their typical limitations.
Abstract
High production costs have been a key factor in delaying a widespread deployment of UHD broadcast offerings. Only a few special events tend to be produced and broadcast on UHD, with most of the true 4K content coming from streaming providers such as Netflix, Amazon Video or Disney+, and even on those cases, content availability is still significantly more limited than for their non-4K assets. As a result, the potential of UHD displays is not fully exploited, with the final picture representation relying on the viewing device upscale capabilities, usually highly constrained by computational and power consumption limitations.
High-quality and reliable up-conversion can present a viable solution to accelerate UHD availability, allowing content providers to significantly reduce costs by complementing their offerings with high-quality content upscaled from existing HD libraries and leverage on their current production pipelines all the way up to the final up-conversion stage, while retaining control over how content is rendered on UHD screens. Widely investigated deep learning-based methods are perfect candidates for such applications, greatly outperforming traditional techniques and being particularly well suited for cloud deployments, where GPU acceleration can help providing high-throughput inference.
This paper provides a comprehensive overview of state-of-the-art deep learning-based super-resolution methods and their respective advantages and drawbacks, focusing on how they can be tailored for practical deployments in the cloud to mitigate their typical limitations.
Introduction
Super-resolution (SR) methods [1] refer to the process of generating high-resolution images or videos from low-resolution inputs. Such techniques have been an important topic of research for several decades, with early SR methods relying on spatial interpolation techniques [2,3]. While those methods were simple and effective, the quality of the upscaled images was constrained by their inability to generate high frequency details. Some progress was made over the years with the introduction of more complex approaches, including statistical, prediction-based, patch-based, or edge-based methods [4-16]. The most significant advances were however delivered by emerging deep learning techniques [17,18] and particularly convolutional neural networks (CNNs). Although Convolutional Neural Networks (CNNs) have been around since the 1980s, it wasn’t until the mid-1990s that they started to gain widespread attention in the research community [20], mainly due to the lack of hardware suited to train and run sizeable networks. CNNs have since undergone numerous improvements and became one of the most powerful and widely used deep learning techniques for image analysis and processing tasks. In recent years, CNNs have achieved state-of-the-art performance in tasks ranging from image classification [21,22], object detection [23], or semantic segmentation [24], among many others [25].
The first convolutional neural network (CNN) based super-resolution method is generally attributed to Dong et al., who proposed the “SRCNN” (Super-Resolution Convolutional Neural Network) in their 2015 paper “Image super-resolution using deep convolutional networks” [26]. The authors developed a three-layer CNN architecture able to learn the mapping from low-resolution to high-resolution images by using a large training dataset. Numerous CNN-based super-resolution methods followed, each improving in areas such as the data mapping, networks architecture and size, optimization function or computational efficiency, with many of those methods achieving state-of-the-art performance on various benchmark datasets over the years [27,31].
Another crucial development was delivered with the inception of Residual Networks [32]. In a traditional deep neural network, as the number of layers increases, gradients become weaker and weaker during the training process as they are propagated back through the network. Some of these gradients may vanish or explode, causing instability or stopping the learning process from converging. This made it increasingly challenging to train very deep networks. The ResNet architecture tackles the issue by introducing the concept of residual connections, where the output of some layer can bypass others to be directly added to the input of a subsequent layer. This allows the network to learn residual mappings rather than full mappings, making it possible to train significantly deeper networks that can often reach hundreds of layers. This made the ResNet architecture highly popular for many computer vision tasks, including super-resolution.
Building up in those innovations and in the increase of hardware capabilities to train and run larger and more complex networks, the super-resolution field has been evolving very quickly over the past years. Advances in generative models such as Auto-Encoders and Generative Adversarial Networks (GANs) opened new possibilities, providing high-quality upscales that match the underlying distribution of high-resolution images even in cases where the input data is noisy or incomplete. New trends such as transformer models and diffusion are still pushing the boundaries of what can be achieved even further.
However, each network architecture comes with its own advantages and drawbacks, so it becomes of great importance to tailor each solution to its target application, especially since the balance between computational complexity and performance is often the most important constraint in a practical system’s design.
Download the paper below.
No comments yet