In a significant step to broaden access to classic literature, Project Gutenberg partnered with the Massachusetts Institute of Technology (MIT) and Microsoft to craft a vast collection of audiobooks using AI. The project releases thousands of free audiobooks to major platforms like Spotify, Apple, and Google podcasts.
The project leverages new advancements in human-like neural text to speech to bring thousands of beloved books to life in a new, accessible audio format and can even read books in a user’s voice given only 5 seconds of audio.
This initiative, led by Mark Hamilton (MIT) and Brendan Walsh (Microsoft), along with supervising professor William T. Freeman (MIT), seeks to democratize access to literature to include individuals with visual impairments, language learners, children, and those who simply prefer to listen to their books.
Harnessing AI to Scale Audiobook Production
No matter whether you are learning to read, looking for inclusive reading technology, or about to head out on a long drive, audiobooks can be a great resource. However, creating audiobooks isn’t quite as easy as pressing play. Recording professional human readers can be time-consuming and costly, requiring hundreds of hours of reading time per book.
With book publication rates on the rise, creators are hunting for faster solutions. Automated audiobook production offers a promising alternative, but has historically been plagued with clunky, robotic narration. Furthermore, it’s hard for algorithms to understand what to read from an e-book. Humans know to skip page numbers, tables of contents, and footnotes, but algorithms must be clever to avoid these pitfalls.
Project Gutenberg, the oldest online e-book library with over 60,000 works, is acutely aware of these challenges. “We had tried to make audiobooks in the past, but the quality just wasn’t very good so we abandoned the effort,” says Project Gutenberg CEO Greg Newby. “With this new technology, our partners were able to create audiobooks of vastly better quality much faster than ever before.”
The project uses new advances in neural text to speech to create lifelike voices that sound similar to native human speakers. The approach uses a deep network that’s trained to mimic the quality and tone of native speakers, can speak a variety of languages, and can even identify and stylize the reading of emotional text.
Judging books by their Structure
With a high quality text to speech model in hand, the team set out to transcribe as many of Project Gutenberg’s 60,000+ books as possible. Mark Hamilton, one of the project leads, shares that this was the toughest part. “It’s difficult to find even two books in Project Gutenberg that have exactly the same structure. Though the books display nicely for online readers, they contain all sorts of text you wouldn’t want to hear in your audiobook. It became more of an art than a science to find what users would want to hear in a given book.”
To address this, the team searched the collection for large groups of books with a similar look and file format. This made it possible to create targeted parsers that could adapt to each book’s idiosyncrasies. In the end, the team identified over 5,000 books which could be parsed with reasonable accuracy.
Speaking Millions of Sentences
The next challenge the team faced was how to efficiently speak the millions of sentences extracted from the five thousand books. Ordinarily, this would take quite a lot of time even for a computer. To make sure these algorithms could scale, the team used the SynapseML distributed computing library [Github, Paper] to orchestrate millions of model inference calls across hundreds of machines. This allowed the researchers to quickly use modern text-to-speech services such as VALL-E and Microsoft AI to create over 35,000 hours of audiobooks in a little more than two hours, at no cost to the Project Gutenberg nonprofit.
For interested audiophiles, the complete collection of audiobooks can be streamed for free on most major podcast platforms including Spotify, Google Podcasts, Apple Podcasts, and the Internet Archive.
Creating Audiobooks in your Own Voice
After donating 5,000 books back to the public domain, the team demonstrated an application that could make an entire audiobook in someone's own voice, using only 5 seconds of example audio. This demonstration, Large-Scale Automatic Audiobook Creation, which was showcased at the Interspeech 2023 conference, illustrated how the latest advancements in generative speech could be quickly used to make custom audiobooks for anyone with a microphone. The team hopes to explore whether this technology can help create more inclusive audiobooks that foster a more personal connection between the listeners and their favorite works.
Bringing Classic Literature to a Global Audience
Thanks to the partnership, Project Gutenberg has broadened its audiobook collection by nearly 5,000 titles, which are now available on popular platforms such as Spotify and Apple Podcasts. Newby views this as a milestone in Project Gutenberg's journey, expressing optimism that “our library is more accessible than ever.”