ABSTRACT

In recent years there has been a significant appetite to move away from a channel-based, one-size-fits-all audio broadcast to a more adaptive scenebased audio capture and rendering, commonly referred to as Next Generation Audio (NGA).

NGA aims to retain as much as possible of the information of the audio scene right the way through the signal chain such that the end-user has all of the audio components at their disposal, with the final rendering happening on their AV Receiver (AVR) to match their specific requirements.

In recent years sports broadcasts have begun to benefit from this paradigm predominantly for a more immersive rendering to match the audio reproduction system at the user-end.

In this paper we propose an enhanced system where objects are more granular in time and space, with metadata descriptors to allow interaction and personalisation of the content at the user end. In order to achieve this new capture and source extraction schemes are needed and are presented in this paper along with a proposal for metadata extraction.

INTRODUCTION

The emergence of so-called Next Generation Audio (NGA), and object-based audio (OBA), requires a new set of production tools to enable audio sources in each scene to be captured as separate ‘objects’ and tagged with metadata. For the cinema, where sources are generally captured separately or created in post-production, the creation of audio objects is achieved predominantly with non-live workflows and is relatively straightforward.

Applying NGA principles to live contexts however introduces several logistical difficulties as in a live scene it is most likely not possible to have discrete microphones for each of the sources (which are often transient and unpredictable in nature), making source separation and the generation of metadata describing the scene difficult.

Currently, for a standard sports broadcast, the sound engineer will typically mix the signals from the available microphones around the field-of-play dynamically to produce a channelbased mix intended for a target reproduction setup. This approach does not allow the separation of sound sources and provides little or no metadata/information about the captured content.

New capture and production methodologies therefore need to be employed to allow the sound scene to be split into its individual components and each one tagged with the necessary metadata (duration, location, event type etc.) for manipulation later on. In a complex scene such as football match this is a non-trivial task. 

In this paper, we present a solution for live sports production for NGA broadcast. Our approach utilises novel audio pattern-matching algorithms to automatically detect and extract audio objects from the scene. Using only the microphones that are present in a standard broadcast environment, the SALSA system automatically picks out only content that matches audio signatures of events considered salient in the context (ball-kicks, racquet hits, referee whistle-blows etc.). 

Download the full technical paper below