A Star Is Trained

Did AI get its first hit single?

Niranjan Sridhar
April 17, 2023

In the last 48 hours, a new song featuring Drake and The Weekend has been going viral on every platform. On Youtube, on Spotify, on Tiktok.

What makes this interesting is that it was not released by an artist or label. The user who posted this on every platform is going by the alias 'ghostwriter'. The account released a short on Tiktok claiming they made this song by first writing and performing it, and then using AI to convert their voice into Drake and The Weekend!!

The audio is so good that some are even doubting this claim. They are wondering maybe Drake and The Weekend did actually drop this single and the ghostwriter gimmick is a marketing stunt. That we can’t tell for sure which of the two is true, means the music industry is soon going to grapple with the same disruption that the visual art industry is undergoing right now.

One of the pieces of the puzzle is figuring out what technology could generate audio of this quality. The internet sleuths have narrowed it down to Singing Voice Conversion - a voice conversion project built on GANs. The story is barely a day old and still developing, I am sure we will be hearing more about this. So today I want to take a deeper look and understand the intuition behind the underlying model of the SVC.

Intuition 1: Multi Scale Convolution Speak

Audio synthesis has a long history, most of which we are going to skip. WaveNet (2016) was a monumental breakthrough in audio synthesis. There were many interesting innovations in WaveNet such as dilated convolutions, sequential inference and conditional generation. In the subsequent years, most of these were found to not be essential and were replaced by more efficient methods. However, one idea that has stood the test of time is that, unlike text generation which was dominated by RNNs and later Transformers, audio generation is really enabled by CNNs, especially CNNs at multiple time scales.

The reason for this CNN preference seems obvious in hindsight. Audio is sampled at much higher sampling rate than text or images. If we assume each spoken word (i.e. 1 text datum) takes 1 sec to speak, it might contain somewhere between 16k to 44k data points of audio depending on the sampling rate, i.e. a data ratio of 1:16000. A 128×128 image only contains 49k floats, which only corresponds to 1-3 seconds of speech. Thus audio is in a way harder than both speech and images combined. Audio such as music contains both shorter-range and longer-range temporal patterns than text. At the same time, for the same amount of overall conceptual information, audio is extremely oversampled with granular information, even more so than images.

Hence, multiscale convolutional layers are suited for this task. CNNs do not scale with the size of the input, therefore can efficiently run on data with high sampling rates. Using multiple scales are necessary to capture the details of audio at both the extremely short millisecond range and extremely long range of seconds, minutes or hours. WaveNet achieved the multi scale effect by using stacks of dilated convolutional layers with different strides (2,4,8,16).

Intuition 2 : Quadruple down on Multiscale

The HiFiGAN paper was published in 2020 in Neurips. GANs for audio generation were not new, yet HiFiGAN models beat both autoregressive models like Wavenet and existing GAN models to create audio with much higher fidelity. How did they do it? While the details of the paper's implementation can be complicated there is one extremely simple through line intuition that can be summarized as "throw in as many multiscale CNNs as possible".

They add the following additional modules -

Multi-Period Discriminator (MPD)
Each block of the MPD has a fixed period and splits the audio into chunks of the period. The discriminator then sees the audio stream as a sequence of chunks and uses the features thus extracted to improve the quality of generated audio. The MPD has many such blocks for different periods. Crucially they chose the periods to be primes (2,3,5,7,11) instead of (2,4,8,16) to minimize overlaps between the blocks.
Multi-Scale Discriminator (MSD)
The MSD has 3 discriminators at 3 scales - raw audio, audio pooled by factor 2 and audio pooled by factor 4. (Pooling here fuses adjoining datapoints, i.e. creates a shorter audio where each point is made by fusing adjacent points.) Thus pooled audio discriminators look at 4x the length of time as the raw audio to determine the quality of generated audio.
Multi-field Receptive-field Fusion (MRF)
Instead of using ResNet blocks, the generator is made up of MRF blocks. An MRF block is simply a parallel stack of multiple ResNet blocks, each with a different field of view. Therefore when audio is created, the different blocks act at different scales, some create long range patterns and some focus on short range patterns.

MSD and MPD discriminators.

Layout of the MRF blocks.

None of these are really breakthrough ideas. The key insight that the authors discuss in the “experiments” section of the paper is that stacked more and more multiscale networks both in the generator and network kept increasing the performance. So they just continued stacking more of them.

What does this mean? A conjecture.

To me this indicates that we have not yet found an efficient way to map audio to concepts. The HiFiGAN route of stacking more and more constraints was also common in image generation before text conditioning and diffusion models blew the lid off the field and made these models imaginative and steerable (See my posts on imagination, diffusion). It is possible that describing audio is not as easy as describing images, so text conditioning might not be a good way to define audio concepts that can be used to guide these models. Without reliable guidance, audio generation cannot cross the uncanny valley.

One reliable way to constrain the model is to mimic another speaker. This is what HiFiGAN does very well by stacking its constraints in the discriminator thus forcing the model to produce audio nearly indistinguishable from speech that it was trained to mimic. This approach obviously sacrifices all imagination for fidelity. Thus if ‘Heart on my sleeve’ by 'ghostwriter’ is what it claims to be, then it seems audio generation, while not at sci-fi levels yet, has at least caught up to Hollywood spy movies tech.

————————————————————————————————————
There you have it, my intuitions on the technology behind the latest viral sensation in audio generation. For more such intuitions on AI/ML, subscribe and follow me on Twitter. You can also check out my other projects on nirsd.com.