

Next, we train the prior models whose goal is to learn the distribution of music codes encoded by VQ-VAE and to generate music in this compressed discrete space. However, it retains essential information about the pitch, timbre, and volume of the audio. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level. To allow the model to reconstruct higher frequencies easily, we add a spectral loss that penalizes the norm of the difference of input and reconstructed spectrograms.To maximize the use of the upper levels, we use separate decoders and independently reconstruct the input from the codes of each level.

To alleviate codebook collapse common to VQ-VAE models, we use random restarts where we randomly reset a codebook vector to one of the encoded hidden states whenever its usage falls below a threshold.We draw inspiration from VQ-VAE-2 and apply their approach to music. A simplified variant called VQ-VAE-2 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. Jukebox's autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE. Novel raw audio 44.1k samples per second Approach Compressing Music to Discrete Codes Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing. Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. We chose to work on music because we want to continue to push the boundaries of generative models. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.

One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. Generating music at the audio level is challenging since the sequences are very long. A different approach is to model music directly as raw audio. This has led to impressive results like producing Bach chorals, polyphonic music with multiple instruments, as well as minute long musical pieces.īut symbolic generators have limitations-they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music. A prominent approach is to generate music symbolically in the form of a piano roll, which specifies the timing, pitch, velocity, and instrument of each note to be played. Automatic music generation dates back to more than half a century.
