Neural Approaches to Vocal Synthesis
Independent Senior Project 2026
Christian Zhou-Zheng | Mentor: Ronald McClellan Jr.
Abstract
Singing voice synthesis (SVS) is the task of generating a human singing voice from digital input. It is the technology behind Hatsune Miku, Kasane Teto, and other VOCALOIDs and “virtual singers.” State-of-the-art SVS models are either closed-source (VOCALOID, SynthV) or use older architectures and train on limited data (DiffSinger, NNSVS). This project aims to bring open-source SVS into the modern era by developing modern open-source SVS methods, scaling up data pipelines, and applying modern architectures and modeling paradigms. This project aims to release a full data pipeline, pretrained acoustic model, and pretrained vocoder.
For my ISP I spent the month attacking this problem end to end, as the capstone of a longer-running project I run under the codename Neolyre. The thesis is simple: if we can build a pipeline that turns web-scale unlabeled audio into aligned, restoration-cleaned training data, the bitter lesson - scaling compute and data beats all else - can finally be brought to bear on singing voice. From there, every classical component of an SVS stack (data, acoustic model, vocoder, and the editor a human actually uses) becomes a place to ask: what has the last few years of sequence modeling made obsolete?
The work spans four layers: data pipelines, the acoustic model, the vocoder, and the frontend. For the latter part of the program, work was conducted in parallel with project A-4 “Neolyre” of the EleutherAI Summer of Open AI Research (project A-4, Neolyre), where I serve as a mentor and organizer, and where the project will continue after ISPs.
Essential questions
- How can we meaningfully characterize, encode, and represent the natural structure of human speech patterns?
- How can recent advancements in sequence modeling be applied to speech and/or singing voice synthesis?
- Where is the Pareto frontier of speech and/or singing voice synthesis?
What’s on this site
Background & Lit Review
What SVS is, why DiffSinger and HiFi-GAN are the things to beat, and the literature I read to get oriented: flow matching, vocoders, alignment, evaluation.
Development Log
A near-daily account of the month: 100k songs downloaded, datasets unified, aligners trained, breath detectors tuned, and a great deal of fighting with data.
Deliverables
Released code (the svs_datasets package), an upstream OpenUTAU contribution, a replicated DiffSinger baseline, a breath/silence detector, and mentoring.
Hosted on GitHub Pages as the deliverable for Pingry ISP 2026. The broader project continues at github.com/Neolyre.