Reformulating Soft Dynamic Time Warping: Insights into Target Artifacts and Prediction Quality

Johannes Zeitler and Meinard Müller

In: Proceedings of the 26th International Society for Music Information Retrieval Conference (ISMIR), Daejeon, South Korea, 2025

Abstract

Training deep neural networks for music information retrieval (MIR) often relies on strongly aligned data, where each frame has a precisely annotated target label. To reduce this dependency, soft dynamic time warping (SDTW) enables training with weakly aligned data by replacing hard decisions with weighted sums, allowing for gradient-based learning while aligning feature sequences to shorter, often binary, target sequences. However, SDTW introduces gradient artifacts that can cause blurring and degrade predictions, impacting the learning process. In this work, we analyze the sources and effects of these artifacts and propose a reformulation of SDTW that expresses its gradient in terms of an equivalent strongly aligned target representation. This reformulation provides an intuitive interpretation of learned representations and insights into the impact of SDTW hyperparameters on the prediction quality. Using multi-pitch estimation as a case study, we systematically investigate these modified targets and demonstrate their potential for improving training stability, interpretability, and alignment quality in MIR tasks.

Visualizations

Sonifications

Links and Resources

Acknowledgments

DFG Logo

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Grant No. 500643750 (MU 2686/15-1) and under Grant No. 521420645 (MU 2686/17-1). The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS.