Background & Literature Review

What singing voice synthesis is, and what the last few years changed

Most of the first week was reading. This page collects what I learned getting oriented in singing voice synthesis (SVS), organized around the three layers of a modern system and the open questions at each.

What an SVS system is

A singing voice synthesizer takes a musical score (phonemes, note pitches, and durations, the kind of thing you draw in a piano-roll editor) and produces a waveform of someone singing it. Almost every modern system splits this into two stages:

  1. An acoustic model maps the score to an intermediate acoustic representation, almost always a mel-spectrogram.
  2. A vocoder turns that spectrogram back into an audio waveform.

The reference point for the acoustic model is DiffSinger (Liu et al., 2021), which paired a FastSpeech-style encoder with a diffusion decoder; Sinsy and NNSVS came before it. The reference vocoder is HiFi-GAN (Kong et al., 2020). Both are good, and both are old by the standards of this field: DiffSinger predates the flow-matching and modern-backbone era. A question I kept returning to this month was which of these pieces a 2026 practitioner would still choose, and which survive only on inertia.

I worked from and replicated the OpenVPI community fork of DiffSinger, which is well-engineered and already implements a flow-matching training objective. That gave me a clean baseline to build on instead of re-deriving one.

Layer 1: Data

The biggest finding of my literature review is that the bottleneck in open SVS is data, not modeling. Singing models need phoneme-level temporal alignment, which is laborious to produce, so the field recycles a small set of academic datasets (OpenCpop, M4Singer, PopCS, GTSinger, and the OpenVPI-curated collections in en/ja/zh). Two points made this concrete:

  • Quality is an axis orthogonal to scale. You can train on an arbitrary amount of 128 kbps audio and the output will still carry compression artifacts, because that is the distribution it learned. Scaling the data does not fix fidelity; restoration does.
  • Datasets care more about internal consistency than correctness. My mentor pointed this out and the experiments confirmed it: a dataset with consistently wrong labels still trains a usable model, because the distribution is stable. The trouble starts when you mix datasets with different labeling conventions, which is exactly what scale requires.

This reframed the project. To use web-scale data, you have to manufacture the alignment and normalize the conventions yourself. That motivated nearly everything on the deliverables page: vocal separation (mel-band-roformer + FCPE for f0), restoration (a REAPER FX chain inspired by SingNet), quality filtering with SingMOS (South-Twilight), forced alignment adapted from STARS (2507.06670), and re-aligning drifted .lrc lyric files via voice-activity detection (a problem the anime-subtitling world already has tools for, such as alass and ffsubsync).

Breath and silence

A sub-problem that ran deeper than expected was handling <AP> (audible breath) and <SP> (silent pause) markers. Different datasets annotate these very differently; some auto-generated by MFA produce pathological patterns such as <SP><AP><SP> runs, and they turn out to matter a lot for downstream quality. The design I converged on with my mentor treats them as acoustic events recovered from a side-channel, instead of as lyric phonemes:

audio  -> vocal/non-vocal + breath/silence side-channel
lyrics -> G2P phones without AP/SP
STARS logits + side-channel -> pause-aware decoder

Detecting silence is tractable with a VAD (Silero); detecting breath is hard and has its own small literature (Nakano et al. and others). The right output is a continuous confidence score, which the downstream aligner can condition on far more easily than a hard binary label.

Layer 2: Sequence modeling

The most enjoyable reading was on the modeling side. The key realization:

Flow matching and diffusion are two views of the same object: one comes from score-based modeling, the other from continuous normalizing flows and ODEs. Rectified flow matching is much simpler to understand and implement.

I read the Rectified Flow line of work and the broader flow-matching literature, then confirmed on the OpenVPI baseline that a flow-matching objective matches diffusion on PopCS and M4Singer while being far cleaner to train. The open question is the backbone. DiffSinger’s non-causal WaveNet is, improbably, still the thing to beat: swapping in a naive Transformer dropped SingMOS from about 4.0 to 3.22. Why convolutions are such a good inductive bias for audio, and what should replace them, is the question I am most interested in carrying forward.

Layer 3: Vocoders and the frontend

On vocoders I mostly surveyed, again leaning on the OpenVPI ecosystem: the Neural Homomorphic Vocoder (Liu et al., 2020, used in Synthesizer V), lightweight DDSP vocoders, and HiFi-GAN derivatives. I have handed this layer to a mentee, on the premise that there should be something better than HiFi-GAN in 2026.

Finally, the frontend: the editor a person uses to compose. I spent time inside VOCALOID reproducing songs as informal market research on controllability and UX pain points (written up in the deliverables). The takeaway is that the parameters are hard to reason about without hands-on trial, and that vocal timbre transfer is the feature people most want.

The hardest open problem: evaluation

A theme across all three layers is that evaluation is unsolved. SingMOS can tell random noise (about 2.0) from real singing (about 4.0), which makes it a useful sanity check, but it cannot separate a 4.03 generated sample from a 4.02 ground-truth one. Without a metric you trust, the Pareto frontier in my third essential question is hard to even measure, which is itself one of the more important findings of the month.

Key references


Next: the development log →