Publishers need to embrace the exciting potential of AI narration for audiobooks.
Check out my first comment piece for The Bookseller in full below, or the original by following this link. It is great to be back writing about publishing.
If you want to listen to the excellent Matt Jamie reading my book N-4 DOWN the hunt for the Arctic airship Italia, then click on this link.
Since my comment piece was published
The Guardian published this comment piece Why AI audiobook narrators could win over some authors and readers, despite the vocal bumps
The first review last year of my non-fiction book’s audiobook complimented narrator Matt Jamie on his confident British accent and engaging delivery. Will an AI ever be as good as Jamie is?
While panels at FutureBook discussed the future of audiobooks, with the occasional polite mention of “artificial voice”, at lunchtime, Google Play Books demonstrated to a half-empty auditorium its text-to-speech (TTS) technology, which allows publishers to create high-quality audiobooks at little cost.
From TikTok to the New Scientist, text to ever-more-convincing speech is everywhere – except publishing. This silence says to me that, despite five years of conversations with AI start-ups, publishing isn’t an industry ready to grapple with the inevitable – the way artificial voice will revolutionise audiobooks – nor to face the ethical dilemmas it will present, such as using the voice of a beloved deceased actor to narrate an audiobook.
Of course, publishers will talk about consumer demand for human narrators. Yet, despite the caution of publishers, the arguments for using AI to produce audiobooks are indisputable and will only become stronger. I click play and am astonished by the quality of the artificial speech samples I am listening to. My 12-year-old son listens with me and says simply, audiobooks are going to be free.
The argument that a human narrator is intrinsically better is flawed and subjective – I have pressed stop many times because I thought an AI could do a better job.
For over 200 years scientists have attempted to generate human speech through mechanical means by mimicking various organs used by humans to produce speech, such as bellows for the lungs and a tube for the vocal tract. Now computer models such as WaveNet, DeepMind and Tacotron have achieved that virtually using technologies like neural networks that mimic the way the human brain works to continuously improve speed and accuracy.
Startups use these models as the base on which to build their own applications, each with its own unique selling point. For UK-based DeepZen this is to capture emotion, and it is credited with producing the world’s first AI-generated audiobook in 2021, the 350-page psychological thriller She Chose Me by Tracey Emerson.
The argument that a human narrator is intrinsically better is flawed and subjective – I have pressed stop many times because I thought an AI could do a better job. Artificial voice applications use human voices to learn from, and use cloned human voice replicas to deliver realistic tone and emotion.
The production of a traditional audiobook is an expensive and time-consuming business. Each needs roles including actor, editor and proofer; a ten-hour audiobook can take 60 hours to complete, over a couple of months. The use of artificial voice can cut the cost of production down from roughly $2,500 for a standard-length piece of fiction to $400, and reduce the time required to a matter of days.
“It is a no-brainer,” DeepZen co-founder and c.e.o. Taylan Kamis says. There are, he tells me, over 100 million print books in the world and 20–25 million e-books, but only half a million audiobooks, 90% of which are in English and half of which were produced in the last four to five years. The cost and time it can take to produce an audiobook is the “main bottleneck”, particularly when it comes to non-English markets. For a publisher, the opportunity is vast.
Then there is the democratisation that artificial voice can deliver. The small, independent publishers whose books might not usually be turned into audiobooks can now consider it using artificial voice. Similarly, the publishers of books in minority languages will be able to create new audiobooks for their communities. TTS can give a chance to all those new books that aren’t licensed for audio due to the cost of production, overlooked backlists, dry academic tomes and books in minority languages to find a voice, literally.
Audiobooks have gone from an afterthought in a contract negotiation to a format that is considered on acquisition.
Yes, artificial voices many never be good enough in every language, but they are getting better and better in their accents, pacing and intonation. “Every now and then there is one that blows my mind,” says Nathan Hull, chief strategy officer, Beat Technology AS.
There may be a clash between publishers, who can be as dedicated to the production of an audiobook and the human voice as they are to a print book, and the “tech bros” who want to solve the “problem” of all those unrecorded books. However, there will be room for both. The market in the future is likely to resemble a pyramid, with the mass of texts produced cheaply by TTS, and the rest by high-quality human-voice productions.
For now, artificial voice might work better in business or academic texts, with the buyers of those books tending to buy them for information rather than storytelling. That doesn’t mean that artificial voice won’t work well for fiction in the end.
One issue will be distribution. The world’s largest distributor and creator of audiobooks, Audible, currently only distributes audiobooks narrated by humans, but this will change. “As publishers, we have a responsibility to our authors,” Jon Watt, audio director, Bonnier Books UK, tells me – and for now that means human narrators.
The market for audiobooks – and publishers’ attitudes to them – has changed radically over the past five years. Audiobooks have gone from an afterthought in a contract negotiation to a format that is considered on acquisition, and Spotify is now distributing audiobooks to millions of users who haven’t listened to them before.
Publishers’ attitudes will continue to evolve as the technology does and Generation Z grows up, and over the next five years the growing demand for audio content, the cost of production of artificially voiced audiobooks and the demand for democratisation will break the dam. TTS may even create a completely new art form.
I tracked down Matt Jamie, the human narrator of my book, to ask what he thinks about artificial voice. He tells me that he has considered licensing his voice, and that at least one agency has considered creating a separate section for artificial voice, but for now he is not losing any sleep. Perhaps he should.