Neural Approaches to Vocal Synthesis | Christian Y. Zhou-Zheng

ISP 2026 · Overview · Background & Lit Review · Development Log · Deliverables

Independent Senior Project · The Pingry School · May 2026
Christian Zhou-Zheng | Mentor: Ronald McClellan Jr.

Abstract

Singing voice synthesis (SVS) is the task of generating a human singing voice from digital input. It is the technology behind Hatsune Miku, Kasane Teto, and other VOCALOIDs and “virtual singers.” State-of-the-art SVS models are either closed-source (VOCALOID, SynthV) or use older architectures and train on limited data (DiffSinger, NNSVS). This project aims to bring open-source SVS into the modern era by developing modern open-source SVS methods, scaling up data pipelines, and applying modern architectures and modeling paradigms. This project aims to release a full data pipeline, pretrained acoustic model, and pretrained vocoder.

For my ISP I spent the month attacking this problem end to end, as the capstone of a longer-running project I run under the codename Neolyre. The thesis is simple: if we can build a pipeline that turns web-scale unlabeled audio into aligned, restoration-cleaned training data, the bitter lesson - scaling compute and data beats all else - can finally be brought to bear on singing voice. From there, every classical component of an SVS stack (data, acoustic model, vocoder, and the editor a human actually uses) becomes a place to ask: what has the last few years of sequence modeling made obsolete?

The work spans four layers: data pipelines, the acoustic model, the vocoder, and the frontend. For the latter part of the program, work was conducted in parallel with project A-4 “Neolyre” of the EleutherAI Summer of Open AI Research (project A-4, Neolyre), where I serve as a mentor and organizer, and where the project will continue after ISPs.

Essential questions

How can we meaningfully characterize, encode, and represent the natural structure of human speech patterns?
How can recent advancements in sequence modeling be applied to speech and/or singing voice synthesis?
Where is the Pareto frontier of speech and/or singing voice synthesis?

Abstract

Essential questions

What’s on this site

Background & Lit Review

Development Log

Deliverables