MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Main paper accepted by ICLR 2022

https://openreview.net/pdf?id=UseMOjWENv

Code | Blog | Colab Notebook | Huggingface Space | Command-line MIDI Synthesis

Contents

Overview

Fine Grained Control (Figure 2)

Full End-to-End Generation (Figure 6)

Dataset

Performance Synthesis & Adjustment

MIDI Synthesis

Bach Quartet Synthesis

Note Expression Control (Figure 4, 8, 9, and more)

Listening Test (Figure 5, 7)

An Example User Experience

Improvement of GAN (Figure 12)

Pitch Variants in Synthesis

Overview

In this work, we introduce MIDI-DDSP, a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. We infer explicit interpretable features for note, performance, and synthesis level, and we build models at each level to generate audio in a hierarchy. Here, we present the audio example results for MIDI-DDSP. For details about the system and architecture, please refer to the paper linked above.

Model Architecture

Fine Grained Control

The given performance on violin (top) is manipulated at all three levels (notes, expression, synthesis parameters) to create a new realistic yet personalized performance (bottom). An initial generation is automatically generated from the full MIDI-DDSP model (top), and an expert musician adjusted notes (blue), performance attributes (green), and low-level synthesis parameters (yellow) to craft a personalized expression of a musical piece (bottom).

Fine Grained Control
Automatic Generation Adjusted by Human Expert

Full End-to-End Generation

MIDI-DDSP can take note input from a score level Bach composition model CocoNet and automatically synthesizing a Bach quartet by generating explicit latent for each level in the hierarchy.

Full End-to-End Generation


CocoNet re-harmonization of Ode to Joy in woodwinds set.

Mix
Soprano - Flute Alto - Oboe Tenor - Clarinet Bass - Bassoon


Thanks to wide range of instrument recordings provided by the URMP dataset, we can also render CocoNet re-harmonization of Ode to Joy in string set (an octave higher).

Mix
Soprano - Violin Alto - Violin Tenor - Viola Bass - Cello

Dataset

To demonstrate modeling a variety of instruments, we use the URMP dataset, a publicly-available audio dataset containing monophonic solo performances of a variety of instruments. Here we present a selection of the training data we used. Also, the recordings in the URMP dataset are played by students, and the performance quality is substantially lower compared to virtuoso datasets (MAESTRO) used in other work. Typically, some recordings have bad intonation accuracy, which makes MIDI-DDSP also having intonation accuracy problems in synthesis. Here we also present some playings of the training data that are out of tune.

URMP Dataset


Training Set:

Training set playing out of tune:

Performance Synthesis & Adjustment

MIDI-DDSP enables detailed control, especially performance level control. Here we present given a violin MIDI score, how a human expert violin performer will adjust the automatic generation to makes it to the way an expert will perform.

Adjust Process

Viva La Vida:

Automatic Generation After Adjustment

Phantom:

Automatic Generation After Adjustment

And we can also turn a performance from score into another interpretation:

Pirates of the Caribbean:

Automatic Generation After Adjustment Play as Staccato

Automatic MIDI Synthesis

Automatic MIDI Synthesis

Here we present the automatic score-to-audio synthesis of MIDI-DDSP on generic MIDI input. Learned to perform from data, MIDI-DDSP can generate idiomatic details of human performance. For example, in some syntheses, breathing sounds between notes are generated.

Game of Thrones - Cello La Vie En Rose - Trumpet Viva La Vida - Violin
Beauty and the Beast - Flute Cannon in D - Cello Phantom - Violin

Bach Quartet Synthesis

Here we present MIDI-DDSP automatic Bach quartet synthesis by taking note input from Bach Chorales. Here, we present three types of instrument ensemble: string set, brasswind set and woodwind set. The volume of four parts are mixed with different gains:
String set: S +2dB, A +2dB, T -1dB, and B -1dB,
Woodwind set: S +2dB, A +2dB, T +5.6dB, and B 2.9dB,
Brasswind set: S +1.5dB, A +1.2dB, T 0dB, and B 1.8dB,

BWV 227.1 - String Set

Ensemble
Soprano - Violin Alto - Violin Tenor - Cello Bass - Double Bass

BWV 227.1 - Brasswind Set

Ensemble
Soprano - Trumpet Alto - French Horn Tenor - Trombone Bass - Tuba

BWV 227.1 - Woodwind Set

Ensemble
Soprano - Flute Alto - Oboe Tenor - Clarinet Bass - Bassoon

More Bach Quartet

BWV165.6 - String Set BWV178.7 - String Set BWV180.7 - String Set BWV245.14 - String Set

Note Expression Control

Note Expression Control

In MIDI-DDSP, manipulating note-level expression can effectively change the synthesis-level quantities. We show by taking a test-set sample (middle row) and adjusting each expression control value to lowest (bottom row) and highest (upper row), how each synthesis quantities (rightmost legend) would change. The dashed gray line in each plot indicates the note boundary.

Original Sample
Volume - Low Volume - High Vol. Fluc. - Low Vol. Fluc. - High
Vol. Peak Pos. - Low Vol. Peak Pos. - High Vibrato - Low Vibrato - High
Brightness - Low Brightness - High Attack Noise - Low Attack Noise - High

More details about note expression controls including Figure 8, 9 and more can be found here.

Listening Test

The listening test results on Test-set MIDI Synthesis. All the audio samples below are generated with only MIDI from test-set as input. MIDI-DDSP synthesizes more realistic audio than a prior work MIDI2Params, and two concatenative samplers.

Listening Test Bar Plot Listening Test
Ground-truth DDSP Inference MIDI-DDSP
Ableton MIDI2Params FluidSynth

For access to all the samples used in listening test, please check: https://github.com/MIDI-DDSP/MIDI-DDSP.github.io/blob/master/violin_listening_test.zip

For test-set generation of all other instruments, please check: https://github.com/MIDI-DDSP/MIDI-DDSP.github.io/blob/master/test_set_all_other_instruments.zip

An Example User Experience

This section presents a step-by-step user experience of an expert using and adjusting MIDI-DDSP to obtain satisfied music performance. Specifically, the human expert as a violin player will iterate on adjusting note expression controls, starting from what the expression generator predicts for a given piece. Figure below shows the original music score "Viva La Vida" as input to MIDI-DDSP.

1. Design and Adjust Articulations

The expert first designs the articulations of the piece based on automatic generation, marking up/down bow and the tie between the note, indicating the connections between notes. The results are shown in Figure below, where ties indicate the notes connected should be played as legato (played consecutively) and should have less attack noise. The expert adjusts the attack noise based on what the articulations designed. In here, the expert lowers the attack noise of the legato notes. The expert first applies the adjustments to the notes. Then, the expert fine-tuned the attack noise adjustments based on the synthesis result of the earlier adjustment.

2. Design and Adjust Fingering

The expert designs the fingering based on one's knowledge and experience of violin performance. The results are shown in below. The expert designs the first two measures of the piece to be played on E string ("E- - - -..." mark), and assign the fingering of each note (numbers above the notes). The choice of the string and the fingering of notes affects the brightness of the timbre. That is, playing on the E string should have a brighter timbre, and playing on an open string (fingering number "0") will have a brighter timbre compared to pressing the string (fingering number other than "0"). The expert adjusts the notes' brightness based on the fingering design, based on the principle described above. The expert will first apply an initial adjustment to the brightness, then fine-tune the number according to the synthesis result.

3. Adjust Vibrato

The expert adjusts the vibrato of each note. Longer notes will be adjusted to have more vibrato to match how the expert would play the violin. Similarly, the expert first adjusts the note's vibrato based on the model prediction and fine-tune the numbers based on the synthesis result. The synthesized audio is shown in Figure below.

4. Design Special Articulations

The expert designs special articulations of the notes for artistic purposes. In this example, the expert designed some notes to play as staccato. The results are shown in Figure below where the purple dot above the notes indicates it to play as staccato. To adjust notes to staccato, the volume fluctuation, volume peak position, and attack noise needs to be adjusted from the original value. Specifically, the expert increases the volume fluctuation, decreases the volume peak position, and increases attack noise. Similarly, the expert first adjusts the vibrato of the note and fine-tune the numbers based on the synthesis result. In this step, the expert report that decreasing vibrato can help generate better staccato notes.

5. Fine-tune the Piece

After all designing and adjustments are made, if needed, the expert takes a final revision of the piece by fine-tuning the expression controls of the notes. In this example, the expert does not fine-tune the piece in the end as the audio synthesized by the previous step is good enough.

Improvement of GAN

Please check here improvement of GAN in synthesis quality.

Pitch Variants in Synthesis

Due to intonation problem in some recordings in the dataset, MIDI-DDSP will also suffer from inaccurate intonation in synthesized audio. Also because the pitch generation in MIDI-DDSP is an auto-regressive sampling process, each time the pitch can be different and MIDI-DDSP would accurately recreate the intonation problems in the dataset. Here we present some samples that has varying pitch accuracy. We plan to address this problem in the future works.

CocoNet Generation La Vie En Rose Phantom

Training Loss Ablation

Here we present the ablation experiment results on training loss of the synthesis generator. We compare original MIDI-DDSP which trained on cross-entropy loss on f0, multi-scale spectral loss and GAN objective, with MIDI-DDSP params loss and MIDI-DDSP without GAN. MIDI-DDSP params loss is trained on L1 loss on synthesis parameters other than f0 (amplitudes, harmonic distribution and noise magnitudes), cross-entropy loss on f0 and GAN objective. MIDI-DDSP without GAN is trained on only multi-scale spectral loss and cross-entropy loss on f0 but no GAN objective. In conclusion, MIDI-DDSP params loss has similar sound quality than MIDI-DDSP, while MIDI-DDSP without GAN has notable worse sound quality.
Full audio samples can be found here: https://github.com/MIDI-DDSP/MIDI-DDSP.github.io/blob/master/eval_set_sample_for_loss_ablation.zip

MIDI-DDSP MIDI-DDSP params loss MIDI-DDSP without GAN

Note Expression Ablation

Here we present the ablation experiment results on note expression to find out if the performance quality improvement to MIDI2Params comes from the better/bigger model or the additional stage. For that purpose, we trained a synthesis generator with the same loss and GAN objective and uses the same hyperparameters as MIDI-DDSP. An audio sample of MIDI synthesis using this model is presented below. In conclusion, using only a synthesis generator without note expression can generate performance audio almost as good as MIDI-DDSP. However, the model will occasionally generate “dead-pan” notes without any expression (the note starts from 6s in the sample show below). Thus, the performance improvement of MIDI-DDSP largely comes from using a better/bigger model, namely a synthesis generator. However, the additional stage (note expression modeling) makes the audio synthesis more robust and more importantly enables the user editing on performance level.
Full audio samples can be found here: https://github.com/MIDI-DDSP/MIDI-DDSP.github.io/blob/master/no_expression_experiment.zip

Open Source Image Attribution

The icons used throughout the paper and webpages are used under the Creative Commons license via the Noun Project. We gratefully acknowledge the following creators of these images:
Equalizer by fahmionline from the Noun Project
composer by Magicon from the Noun Project
Music Note by Randomhero from the Noun Project
composer by Amethyst Studio from the Noun Project
Violine by Symbolon from the Noun Project
Trumpet by Symbolon from the Noun Project
Audio by cataicon from the Noun Project.
bassoon by Symbolon from the Noun Project.
Clarinet by Symbolon from the Noun Project.
composer by Pham Duy Phuong Hung from the Noun Project.
Flute by Symbolon from the Noun Project.
Neural Network by Ian Rahmadi Kurniawan from the Noun Project.
oboe by Symbolon from the Noun Project.
Synthesizer by Jino from the Noun Project.
Violin by Olena Panasovska from the Noun Project.
Violinist by Luis Prado from the Noun Project.