Deliverables & Artifacts | Christian Y. Zhou-Zheng

ISP 2026 · Overview · Background & Lit Review · Development Log · Deliverables

A month is short for a field this deep, so I aimed at durable, reusable artifacts rather than a flashy demo. Each item below is something that outlasts the ISP: released code, an upstream contribution, a replicated baseline, or a team that keeps building.

1. `svs_datasets`: a unified loader for singing data [released]

github.com/Neolyre/svs_datasets

The main code deliverable. Public singing datasets each ship their own phoneset, directory layout, and labeling quirks, which is what makes mixing them for scale so painful. The package provides:

A set of canonical phonesets for English, Japanese, and Mandarin, informed by the Synthesizer V phonesets, with explicit, documented definitions.
Adapters that map each public dataset (OpenCpop, M4Singer, PopCS, GTSinger, the OpenVPI collections, and more) into the canonical representation.
A single clean entry point, load_dataset(...) returning a CanonicalExample, so a user never has to touch individual adapters, plus a deliberately small export surface.

It was released mid-project so my first SOAR mentee could build on it, and it has already taken on real bug-fixes found in use, most memorably the PopCS empty-string-as-<SP> issue that had been quietly corrupting Mandarin alignment for days.

2. OpenUTAU command-line batch rendering [upstream PR]

stakira/OpenUtau#2162

A contribution to the widely used open-source OpenUTAU editor adding headless command-line batch rendering. The motivation is data: combining a corpus of singing project files (UST, VSQX, SVP) with a library of voicebanks lets you generate large quantities of audio that come with perfect phoneme annotations straight from the project file, a clean synthetic-data source that sidesteps the alignment problem entirely. It took picking up enough C# to do properly, and was tested extensively before submission.

3. A replicated DiffSinger baseline [experiments]

I read, annotated, and then replicated the DiffSinger paper on PopCS (and M4Singer for a harder test) using the OpenVPI fork, with a flow-matching training objective. This is the baseline the rest of the project intends to beat. Alongside it I set up SingMOS-based evaluation:

Sample	Mean SingMOS
Ground-truth PopCS	4.02
DiffSinger (flow matching)	4.03
Naive Transformer backbone	~3.22
Random noise (control)	~2.0

Two things stand out. Flow matching matches diffusion while being much simpler, and the metric we have can separate garbage from music but cannot resolve differences near the top, which makes “better than baseline” hard to measure. See the background page for the modeling discussion.

4. A web-scale data pipeline [in progress]

The pipeline that runs from raw, unlabeled web audio to aligned, training-ready singing data. Components built or assembled this month:

Vocal separation: mel-band-roformer lead-vocal extraction plus FCPE for f0.
Restoration via a SingNet-inspired REAPER FX chain, run as a batch job.
Quality filtering with SingMOS as a sanity gate.
Lyric re-alignment for drifted .lrc files, using voice-activity detection to recover onsets.
Forced alignment adapted from STARS, with a Viterbi decoder over the canonical phoneset.
A breath/silence side-channel (Silero VAD plus breath detection) producing continuous <AP>/<SP> confidence scores in place of binary labels.

5. VOCALOID UX study [market research]

To understand what a frontend for SVS should do, I spent time inside a commercial editor (VOCALOID) reproducing songs note by note and recording the pain points: informal product research for the editor layer of the project.

VOCALOID arrangement view with stacked vocal phrases

The arrangement view: layered vocal phrases across a full song, reconstructed by hand to feel out the workflow.

VOCALOID piano-roll showing per-note phonemes and pitch curves

The piano-roll: each note carries phonemes (e.g. ke [k e], se [s e]) and an editable pitch curve. Drawing these by hand is exactly the labeling structure our synthetic-data path recovers for free.

Findings: most vocal parameters are hard to reason about without hands-on trial, with pitch bend the exception; getting a voice to sound the way you imagine is hard; and the single most-wanted capability is vocal timbre transfer, which points clearly at where controllability work should go.

6. Mentoring and team-building: SOAR project A-4, Neolyre [ongoing]

eleuther.ai/soar · project A-4, Neolyre

I registered as a mentor in the EleutherAI Summer of Open AI Research to grow this from a solo effort into a team, and as an organizer of the program I also designed application questions, built frontend pieces, and processed admissions. Over the month I onboarded a small team with clear ownership:

Mentee 1: replicating the DiffSinger baseline (16×V100).
Mentee 2: the vocoder, on the premise that there should be something better than HiFi-GAN.
Mentee 3: data and annotation models.
Mentee 4: forced alignment.

The point of this deliverable is that the project does not end with the ISP: it now has the people and the released groundwork to keep going through the summer and beyond.