Neural Approaches to Vocal Synthesis

Independent Senior Project 2026

Independent Senior Project · The Pingry School · May 2026
Christian Zhou-Zheng  |  Mentor: Ronald McClellan Jr.
A data pipeline for web-scale singing voice synthesis

Abstract

Singing voice synthesis (SVS) is the task of generating a human singing voice from digital input. It is the technology behind Hatsune Miku, Kasane Teto, and other VOCALOIDs and “virtual singers.” State-of-the-art SVS models are either closed-source (VOCALOID, SynthV) or use older architectures and train on limited data (DiffSinger, NNSVS). This project aims to bring open-source SVS into the modern era by developing modern open-source SVS methods, scaling up data pipelines, and applying modern architectures and modeling paradigms. This project aims to release a full data pipeline, pretrained acoustic model, and pretrained vocoder.

For my ISP I spent the month attacking this problem end to end, as the capstone of a longer-running project I run under the codename Neolyre. The thesis is simple: if we can build a pipeline that turns web-scale unlabeled audio into aligned, restoration-cleaned training data, the bitter lesson - scaling compute and data beats all else - can finally be brought to bear on singing voice. From there, every classical component of an SVS stack (data, acoustic model, vocoder, and the editor a human actually uses) becomes a place to ask: what has the last few years of sequence modeling made obsolete?

The work spans four layers: data pipelines, the acoustic model, the vocoder, and the frontend. For the latter part of the program, work was conducted in parallel with project A-4 “Neolyre” of the EleutherAI Summer of Open AI Research (project A-4, Neolyre), where I serve as a mentor and organizer, and where the project will continue after ISPs.

Essential questions

  1. How can we meaningfully characterize, encode, and represent the natural structure of human speech patterns?
  2. How can recent advancements in sequence modeling be applied to speech and/or singing voice synthesis?
  3. Where is the Pareto frontier of speech and/or singing voice synthesis?

What’s on this site

Background & Lit Review

What SVS is, why DiffSinger and HiFi-GAN are the things to beat, and the literature I read to get oriented: flow matching, vocoders, alignment, evaluation.

Development Log

A near-daily account of the month: 100k songs downloaded, datasets unified, aligners trained, breath detectors tuned, and a great deal of fighting with data.

Deliverables

Released code (the svs_datasets package), an upstream OpenUTAU contribution, a replicated DiffSinger baseline, a breath/silence detector, and mentoring.


Hosted on GitHub Pages as the deliverable for Pingry ISP 2026. The broader project continues at github.com/Neolyre.