Larynx microphones (LMs) provide a practical way to obtain crosstalk-free recordings of the human voice by picking up vibrations directly from the throat. This can be useful in a multitude of music information retrieval scenarios related to singing, e.g., the analysis of individual voices recorded in environments with lots of interfering noise. However, LMs have a limited frequency range and barely capture the effects of the vocal tract, which makes the recorded signal unsuitable for downstream tasks that require high-quality recordings. We publish a dataset with over 3.5 hours of popular music we recorded with four amateur singers accompanied by a guitar, where both LM and clean close-up microphone signals are available. This dataset is then used to train a data-driven baseline approach for singing voice reconstruction from LM signals using differentiable signal processing, inspired by a source-filter model that emulates the missing vocal tract effects.
We introduce and compare two methods to adaptively modify the partials of simultaneously sounding synthesized tones to minimize roughness. By changing their amplitude and/or frequency over time, it is possible to dynamically control the timbre of a polyphonic sound in real time. This introduces an additional parameter for sound synthesis that may allow for changing the roughness of a sound without modifying other perceptual attributes of the individual tones, like their fundamental frequency (F0) or loudness. We draw inspiration from choir singers, who may not only dynamically adapt their pitch, but also control their vocal formants (i.e., the prevalence of certain partials) as an additional means to facilitate intonation and voice blending between musicians.
We introduce and compare two methods to adaptively modify the partials of simultaneously sounding synthesized tones to minimize roughness. By changing their amplitude and/or frequency over time, it is possible to dynamically control the timbre of a polyphonic sound in real time. This introduces an additional parameter for sound synthesis that may allow for changing the roughness of a sound without modifying other perceptual attributes of the individual tones, like their fundamental frequency (F0) or loudness. We draw inspiration from choir singers, who may not only dynamically adapt their pitch, but also control their vocal formants (i.e., the prevalence of certain partials) as an additional means to facilitate intonation and voice blending between musicians.
Intonation is the process of choosing an appropriate pitch for a given note in a musical performance. Particularly in polyphonic singing, where all musicians can continuously adapt their pitch, this leads to complex interactions. We formulate intonation adaptation as a cost minimization problem and introduce a differentiable cost measure by adapting and combining existing principles for measuring intonation. In particular, our measure consists of two terms, representing a tonal aspect (the proximity to a tonal grid) and a harmonic aspect (the perceptual dissonance between salient frequencies). Our measure can be used to flexibly account for different artistic intents while allowing for robust and joint processing of multiple voices in real-time, which we for the task of intonation adaptation of amateur choral music using recordings from a publicly available multitrack dataset.