Goal : Blind estimation of guitar AFX's chains including electronics settings using DDSP
DDSP autoencoder
Specifications/Research :
Given the ubiquity of audio effects in the creation and production of music, there exists a necessity to efficiently estimate the effects used in order to recreate a sound or tone. While traditional machine learning techniques have produced some promising results, a massive amount of properly labeled data is needed and extrapolation of unseen configurations still leaves something to be desired. By using DDSP, a developing field which integrates differentiable modules of traditional DSP techniques into neural networks, both the complexity of the models and the vastness of the required data can be reduced appreciably. Additionally, the nature of DDSP is such that it can be used for a large variety of tasks ranging from classification to timbre transfer between datasets. This study plans to explore the possibility of DDSP to estimate an entire guitar effects chain, starting with the pickup(s) used and electronics settings, continuing to varying numbers of cascade
Progress:
Thankfully, I got the opportunity to prepare for the summer during a quarter of ECE 199 at UCSD. Coming from an audio signal processing background, I wanted to research something that was a fusion of existing digital signal processing(DSP) techniques with more modern machine learning(ML) ones. I used the quarter to narrow down my area of research by reading numerous research papers in order to survey areas where work has and has not been done. One area I was considering during this time was effects estimation, where you have a system capable of identifying effects such as reverb or equalization used to create a given audio sample. It wasn’t until I came across Google’s DDSP project that I really saw a clear path forward. DDSP is a set of differentiable digital synthesizers and effects whose parameters can become trainable features inside a neural network. It was this ability to combine a trainable source instrument in the form of an auto-encoder and differentiable effects that finally inspired me to choose the estimation of an entire guitar signal chain as my area of research. This is because, in essence, the different sounds that a guitar’s different pickups produce can be thought of as distinct instruments in the way while playing the same note, the presence and distribution of the harmonics are the prime factor in differentiation of the pickups. The DDSP autoencoder takes audio and learns how to recreate it using the sum of a harmonic synthesizer and an additive noise synthesizer. From my experience, the harmonic synth captures the majority of the harmonic content of notes while the noise synth captures more cacophonous and unpredictable moments such as the very start of a plucked note. By default, the autoencoder used a loss function that minimizes spectral magnitude and log magnitude error.
Outputs from DDSP encoder: loudness(top), fundamental frequency/F0(middle), and fundamental frequency confidence(bottom)
Before I could start training, I needed to record some sample audio. I used the USB out on my Boss Katana to send the audio to my computer, where it was recorded using REAPER. The amp was left on the clean channel with the EQ and gain controls set at 12 o’clock. Though the Fender Stratocaster I used to record the samples had five “positions”, I only recorded tracks for the individual pickup positions(1,3,5). This is because positions 2 and 4 are simply sums of the adjacent positions(2 is the sum 1 and 3 while 4 is the sum of 3 and 5), so these could be created after the fact using samples from the three distinct pickup positions. For my first training attempt, I recorded a little over 20 minutes of audio for each pickup position, trying my best to play the same thing, melodically and dynamically, for all three pickups.
When first trying to use DDSP, I started off on Google Colab trying to run DDSP’s sample notebooks. To my surprise, these didn’t run at all. I eventually found a work around in the form of specifying certain python packages to be installed before running it. This worked, but now the model training was being done without a GPU, meaning it would take days to generate each model. After some digging, I found that this issue was an incompatibility between Google Colab’s current version of Linux(Ubuntu 22.04) and the Nvidia CUDA toolkit, so I decided to try training the models through a Linux WSL on my personal PC using an RTX 3080. It was during this time that I gained a familiarity with both basic Linux commands and the underlying structure of Python and how its package system works. After much trial and error, I finally found the winning formula of Linux, CUDA, Python, and Python package versions to start training models. Now, with my CUDA-enabled desktop, I was able to train models in 1-2 hours instead of the 1-2 days on Google Colab. I evaluated the amount of training steps hyper-parameter by looking at the spectral loss, the magnitude of the noise synthesis(one of the most direct signs of overfitting for this model), and by simply listening to the resynthesized audio. This led me to the conclusion that, using my dataset, around 5000 training steps was optimal to minimize both loss and overfitting.
Sample from training data(left) and resynthesis after training(right)
To evaluate the models, I used the three trained models to resynthesize a never before seen(not in the training data) audio clip. Then, the recreation that was the most similar to the sample would be the guess. Initially, I used a simple magnitude or log magnitude spectral error as a means of comparison but found this to be inconsistent. The magnitude error created a tendency for the signal with the closest average spectral power to the original to be the guess while the log magnitude error would be too ignorant of more minute spectral differences between the recreations and the original sample. My first thought was to use a spectral threshold such that the frequencies between harmonics which contained little to no information were ignored in spectral comparisons. While initially promising using hand selected threshold values, this ended up working poorly. This is because automating the threshold value with a statistic such as the spectral average proved to be inconsistent at best. I also noticed there were times in the original sample where the harmonics would die off in a way that was not being represented by the recreations. Using a threshold based off of the original, these moments where the models were clearly failing would be ignored. In other words, some spots where all spectrums were near zero would be ignored as intended, but there were also areas where the spectrums did differ, and this was also being ignored unintentionally.
Next, I decided to threshold the spectral comparisons not by a magnitude, but by the fundamental frequency confidence provided by the DDSP encoder. I did this because the start of notes, which would largely be represented by noise synthesis, is not useful information in comparing the models as the original and recreations matched almost exactly. The spot where valid comparisons could be made was after the string vibration settled to a fundamental frequency and its harmonics. Thankfully, this corresponds almost exactly to when the F0 confidence from the encoder reached a certain level(0.8-0.95 out of 1 max in practice). Again, this seemed effective at first but had its own pitfalls. While more accurate than the initial mag and log mag comparisons, the F0-confidence gated comparisons still had the same tendencies to focus too much on average spectral energy or ignore smaller spectral differences. It was at this point that I went back and recorded a second set of training and validation samples, this time focusing more on actual guitar playing instead of just playing every note on the fretboard at different levels of loudness. While this did indeed improve the models, the pickup positions were still just too similar for accurate differentiation using this method and the erroneous non-decaying harmonics of the recreations were still present. Thankfully, it was at this time that I got the opportunity to present my project at UCSD’s SRC 2024 and for my PI Tara Javidi and all of her research students. This gave me a chance to step back a little and get some valuable feedback. It was in the meeting with my PI that one of her students recommended using the DDSP losses used for model training as opposed to my own.
For training, the DDSP autoencoder used a loss that is an evenly weighted linear combination of the spectral magnitude loss and the spectral log magnitude loss. As soon as I moved to this, my guessing accuracy started to become more consistent. Not previously mentioned is the fact that for each recreation, both the overall pitch and loudness(high level abstraction of mag and log mag) could be adjusted manually to tune the results. Previously, I would have to set and sometimes adjust these manually to make the comparisons between the recreations more valid. To prevent these adjustable parameters from interfering with the guessing, I developed a system to only make a recreation which had the pitch aligned with the original sample and the loudness error minimized between the recreation and the original. With the optimized resynthesis, I tested the guessing with the numerous different losses provided by DDSP. These losses included magnitude, log magnitude, loudness, delta time, delta frequency, and a cumulative sum of frequencies. Using the optimized resynthesis, fourteen out of the fifteen 45-second samples(5 for each pickup) were able to be guessed by at least one of these metrics; with the cumulative sum of frequencies loss being the most individually accurate metric, being correct nine out of the fifteen samples.

Pitch and loudness optimized sample resynthesis using trained DDSP instrument model. Note on mask(left), loudness(center), and pitch/fundamental frequency(right)
Loss testing results using pitch and L2 loudness optimized resynth where a 1 represents a correct guess using only that metric
While I would have loved to continue the effects estimation, the summer is over, and I am happy that I still got to make good progress on what I see as the most proprietary part of the project. If this work were to be continued, I estimate the 16 kHz sample rate as the single biggest current bottleneck to performance. This is because the guitar, especially with effects, is capable of outputting frequencies right up the limit of human perception. It is doubtless that the 16 kHz sample rate cuts out a significant portion of the signals information in order to integrate it more efficiently into the neural network. Another issue with recreating audio using this method was the non-decaying harmonics, or the tendency for the harmonics inside the recreations to not decay like they do in the original samples, sometimes not at all. Whether this was a problem with the model or with my implementation, I was never able to elucidate. In either case, I believe a system could be implemented to clip the non-decaying harmonics using the loudness signal, as there is a strong correlation between the decay of the loudness and the decay of the harmonics.
Conclusion:
Though I was unable to attain my original goal of complete effects estimation, spending my summer on this project under UCSD’s SRIP was as rewarding as it was challenging. Before this summer, I had never used python, linux, or a professional DAW. Throughout the project, I used python and tensorflow to train instrument models using my own CUDA-enabled desktop through a linux WSL. For the training and testing data, I personally recorded hours of sample audio using REAPER. To analyze the accuracy of these models, I developed a system which normalized the loudness and pitch of the different models’ recreations such that they could be compared spectrally to an original sample. While this method has its disadvantages, this project has confirmed the possibility for instrument models to be integrated into effects estimation systems in order to reduce the amounts of required training data. At this point, I would like to thank Tara Javidi at UCSD for sponsoring my research under the SRIP and continually providing insightful feedback.
Comments
Post a Comment