IBC2023: This Technical Paper explores how machine learning can be used to extract the dominant emotion of a video segment and generate appropriate background music for it.

Abstract

Background music scores are an integral part of movies or shows that we enjoy. It amplifies the impact and aesthetic appeal of the dialogues and scenes. With exponential growth in content production, what can be an agile, scalable and cost-effective mechanism to attach the best-fit background score for a scene? Relying and churning a musician’s creativity every time is costly and slow.

Social media platforms pre-acquire music rights, and provide an asset library to content creators to choose from. What if the same can be extended for TV show production? For an intense visual scene, what will be the best background music? Beethoven’s 5th symphony or an Indian classical raga clip? Can we match this by emotion determination, using ML?

In this solution, we analyze the dominant emotion of a video scene, through artists facial expressions, lighting conditions and dialogues. The dialogues are analyzed for transcript text, tonality such as pitch, loudness, pause, mid- level features such as spectrogram, MFCC, chroma, etc. The emotions are classified leveraging ‘The Circumplex model of Emotions’, in a 2D space of valence and arousal. Similarly, pre-acquired music tracks are analysed for high, mid and low level features to infer the best possible emotion is depicts. Classical music (example: Indian Ragas) do have well documented literatures, outlining the principal mood they emit. Such choices are deterministic and will have higher accuracy in mood labelling. Once the music track is chosen based on commonality of emotions, it may be possible to programmatically alter its tempo and pitch, to blend aesthetically into the video scene. Even if a musician has to play the same tune on a different instrument (different timbre), pitch, and tempo – it still does significantly reduce the production effort, time and cost.

Introduction

Background music in video content is an essential component to boost user experience and engagement. With the prolific growth in rate of content production, the cost of associated background music production also increases. Time and effort have to be budgeted separately. In this paper, we explore how machine learning can be used to extract the dominant emotion of a video segment and generate an appropriate background music for the same.

Download the paper below.