Development Log | Christian Y. Zhou-Zheng

ISP 2026 · Overview · Background & Lit Review · Development Log · Deliverables

This is the working log I kept through the project. It covers the month as it happened: long stretches of data wrangling, some dead ends, and the occasional clean result. If one sentence sums it up, it is the one I wrote on May 18: “this entire project has just been holy-crap-web-scale-data-is-horrible-to-work-with.”

Week 1: Getting the data in the door

Thu, May 7. Downloaded 100,000 songs with synced lyric metadata from the Spotify dump (347.7 GiB at 128 kbps) and collected every major public phoneme-annotated singing dataset in en/ja/zh, thanks to the OpenVPI list, another 33.6 GiB at mostly 44.1 kHz. Finalized canonical phonesets, drawing on the Synthesizer V phonesets, and wrote adapters mapping each dataset into one unified phoneset for joint training. Adapted STARS to phoneme alignment and trained a few first models.

Takeaway: audio quality is a different axis than scale. You can train on all the 128 kbps data you want and the output still sounds compressed, because that is the distribution it learned.

Fri, May 8. Squashed two nasty bugs from day one: vocab entry 1 was being dropped by the Viterbi decoder, which killed <AP> pauses and tanked English/Japanese performance, and GTSinger sample names were colliding in the binarized metadata. Stood up a TPUv4-32 JAX environment for the eventual acoustic model, and fought the Google Cloud multislice docs. Surveyed vocoder formats and began setting up the vocal-extraction pipeline (mel-band-roformer + FCPE for f0), plus a SingNet-style cleaning FX chain in REAPER.

Takeaway: data preprocessing takes forever without the right equipment.

Mon, May 11. Decided how to handle <AP>/<SP> breath and silence: a VAD model filtered by energy. Implemented <SP> detection with Silero VAD plus a 2007 breath-detection paper for <AP>. Found real pathologies in MFA-annotated datasets (M4Singer, OpenCpop), especially the lengths of <SP> markers around <AP>, and began normalizing them.

Takeaway: data work will be the end of me.

Tue, May 12. Learned the central lesson from my mentor: datasets care more about intra-dataset consistency than accuracy; consistently wrong labels still train fine, but mixing inconsistent ones does not. Kicked off vocal extraction on the raw song corpus, a roughly 4.5-day job that hogs the 4090. Discovered that most of my time-aligned .lrc files are not aligned to the YouTube audio, which is fixable by deriving onsets from vocal activity. Settled on REAPER batch FX processing and chose SingMOS for vocal-segment scoring.

Week 2: Cleaning, scoring, and the breath problem

Wed, May 13 · Thu, May 14. (Academic Awards in the morning, half-day on campus.) Ran the truncated SingNet REAPER pipeline at about 0.2 files/sec, roughly the same rate as the vocal extractor, so the two could run in parallel. Tested SingMOS on cheap versus expensive cleaning configs; the scorer itself is expensive. Started a dedicated repository of “the scripts that work,” to grow into a one-stop data-prep pipeline as I pass checkpoints.

Fri, May 15. Realized breath/silence detection needs a continuous output instead of binary classification targets, since confidence scores are much easier for the decoder to condition on. This opens questions about calibration and about training the downstream aligner on synthetic data, since labeling conventions differ so widely across datasets.

Takeaway: I now understand why frontier labs pay so much for data work, and why Kanru Hua just paid annotators himself instead of sourcing open data.

Week 3: Baselines, mentees, and DiffSinger

Mon, May 18. Vocal separation finished over the weekend; kicked off REAPER processing on the English set. Settled a clean .lrc re-alignment recipe with my mentor: strip leading silence, anchor the first non-empty timestamp to 0s, verify with Whisper and Levenshtein distance. Opened applications for the EleutherAI Summer of Open AI Research, which I am counting as ISP work, since I am mentoring project A-4, Neolyre and did the frontend and application-design work as an organizer.

Tue, May 19. Accepted my first SOAR mentee. Packaged the phoneset and adapter code and released it publicly: github.com/Neolyre/svs_datasets. Took a break from data to read and annotate the DiffSinger literature.

Takeaway: flow matching and diffusion are two sides of the same coin; rectified flow matching is easy to implement.

Wed, May 20. Onboarded the first mentee, who has 16×V100, so compute will not be his problem, to replicate DiffSinger as our baseline. SingNet post-processing finished on English and moved to Chinese. The 44.1 kHz intermediates are eating my NAS.

Takeaway: at web scale the I/O-bound regime is as real as the compute- or memory-bound ones.

Thu, May 21. Replicated the DiffSinger paper on PopCS locally; OpenVPI already implements flow matching as an objective. Set up SingMOS scoring: 4.03 generated versus 4.02 ground-truth PopCS, with random noise around 2.0, good enough for sanity checks but not for fine distinctions. Retrained on M4Singer to make it an actual challenge.

Takeaway: SVS models are small; evaluation is really, really hard.

Fri, May 22. Pointed my mentee to better signal-processing resources. Explored OpenUTAU as a way to generate synthetic data with perfect labels from UST project files, though it needs programmatic rendering, which does not exist yet. Found that swapping DiffSinger’s non-causal WaveNet for a naive Transformer drops SingMOS to about 3.22.

Takeaway: I do not yet have good priors on what makes a good inductive bias for audio. Why convolutions?

Week 4: Synthetic data, the team, and the editor

Tue, May 26. Fixed empty-string handling in PopCS in the svs_datasets package: PopCS uses empty strings as <SP>, and mishandling them had been silently corrupting interval offsets and hurting Mandarin alignment for days. Retrained the Mandarin aligners with the fix and saw much better performance. More SOAR application processing and group logistics.

Takeaway: LOOK AT YOUR DATA. And trust, but verify, the output of AI coding tools.

Wed, May 27. Wrote a PR to OpenUTAU for command-line batch rendering (stakira/OpenUtau#2162), picking up some C# to do it, so we can mass-generate audio with perfect annotations from voicebank and project-file combinations. Concluded that silence handling across datasets is messy enough that it is cleaner to re-derive all breath/silence annotations ourselves with one unified method, which also lets the annotator train on the whole corpus.

Takeaway: design good standards so you do not end up like xkcd 927.

Thu, May 28. Onboarded another mentee, on the vocoder, with more arriving across data, annotation, and forced alignment. Tested and submitted the OpenUTAU PR. Spent hours in VOCALOID reproducing songs as UX and controllability market research.

Takeaway: vocal parameters are nearly impossible to understand without trying them yourself; vocal timbre transfer is the feature people actually want.

Fri, May 29. Onboarded the last mentee, on forced alignment. Met with another SOAR organizer about automated application review, did some manual filtering, and traded ideas with a music-tech startup founder who reached out. A day of back-to-back meetings with training runs in the background.

Takeaway: music tech is well-regarded but easy to sell badly; product-market fit is hard to measure, because people do not tell you what they actually want (faster horses).

Where things stand

By the end of the month the data layer has real, released components and a working, if compute-bound, pipeline; the acoustic-model baseline is replicated and understood; the vocoder and frontend are scoped and handed to mentees; and the project has a team to carry it past June. The full deliverables (code, the upstream contribution, figures, and the market-research artifacts) are on the deliverables page.

Next: deliverables & artifacts →