‘Low latency critical for enterprise-grade voice AI: Gnani.ai CEO Ganesh Gopalan | Technology News

With AI-powered voice assistants set to reshape customer support, Indian AI startups are racing to build foundational voice AI models that go beyond simple text-to-speech translation. By focusing on phonetics, prosody, semantics, and intent, these homegrown conversational AI tools aim to generate speech that preserves tone, emotion, pacing and pauses in order to make these interactions more natural and human-like.
One such speech-to-speech AI model was released under research preview amid the ongoing AI Impact Summit hosted by India at Bharat Mandapam in New Delhi. The model called InyaVoice OS was developed by Gnani.ai, which is among the cohort of Indian AI startups selected under the Centre’s Rs 10,372-crore IndiaAI Mission to build sovereign AI models and strengthen India’s position in the global AI race.
Ganesh Gopalan, CEO and co-founder of Gnani.ai, spoke to The Indian Express about various issues regarding voice AI on the sidelines of the AI Impact Summit. He discussed why it is important to build voice AI models to scale and the challenges that come with sourcing Indic language voice datasets for AI training. The Bengaluru-based AI startup, which is backed by tech giants such as Samsung and InfoEdge, is also reportedly in talks with investors for fresh funding amid rising demand for enterprise voice automation.
Q: What are the advantages of voice as an interface for interacting with AI systems? Does integrating voice AI capabilities make it easier for applications to onboard users?
Gopalan: Voice is the most natural form of communication in the world. If humans love talking to humans using voice, and humans talk to machines, why should it be any different? Voice AI capabilities that are multilingual make the most sense because it is similar to how we have colloquial conversations. And whatever is true for human conversations is true for human-machine conversations. That’s why we believe that voice AI will be the game-changer.
Q: How important is it for voice AI systems to support multiple languages?
Gopalan: To have the ability to do a multilingual conversation is a basic necessity for voice AI systems. It is also critical for India to focus on developing speech-to-speech AI models rather than models that find the meaning in text and play it back as voice, which is what most startups are building.
Story continues below this ad
This approach is flawed because when I’m talking to you, even if you don’t see me, your ears are listening to not only what I say, but also the emotion behind what I’m saying. We’re trying to bring emotions to human-machine conversations with our AI models.
We are also looking to reduce the risk of hallucinations in our AI models by having fewer layers. Low latency is also critical for voice AI. If you are having a telephony conversation with a voice AI system, if that system even hesitates for a second, you’ll slam the phone.
Q: Could you walk us through the development cycle of InyaVoiceOS? Where do you see it being deployed first, and are you targeting any specific sectors or use cases?
Gopalan: One of the most important elements of AI is the data you collect and use for training purposes. A few years back, we started collecting voice data to build our earlier voice AI models. We have the largest annotated voice dataset for Indian languages and it is proprietary.
Q: What do you mean by proprietary data?
Story continues below this ad
Gopalan: Essentially, when you develop an AI system, there are two or three components that go into it. In order to train the model, you need a lot of data. And that data we’ve been collecting from 2017. We had this map of India in our office, and every day we would mark it when we used to get data from one part of the country. We were never satisfied until we had covered every district of India.
Q: To what extent did you rely on Indic language datasets made available under government initiatives such as Bhashini and AIKosh for AI training purposes?
Gopalan: We used some datasets under AIKosh, now the repository is being enhanced with data from Doordarshan, etc. A lot of our training data was proprietary, some of it was publicly available data. And for certain datasets, we relied on synthetic datasets. It depends on what you need the data for, and it is always a mix of various things.
Q: How do you scale voice-based language models? Does it require more or less compute than frontier, multimodal AI models?
Story continues below this ad
Gopalan: It’s a great question. It’s so important, especially in voice AI, to have control over each and every component of your AI pipeline. If you say that I’ll call global APIs and I’ll make things work, you can give great demos, you can win some customers, but you’ll not be able to survive because in production at scale, those things don’t work.
What happens at scale is your pricing and cost structures get tested, the accuracy of your voice AI models also get tested because we are talking about real-time systems. The latencies of the time taken to respond also get tested.
Unless you own each and every element of your software ecosystem and AI pipeline, you’re going to face problems. That’s why we are proud to be working on developing foundational AI models as opposed to wrappers, where a company says they do AI but are actually just calling APIs of foundational models developed by others. I don’t think wrappers stand a chance, at least considering the state of AI today.
Q: Given the crowded and fast-commoditising AI landscape, how important is it to develop voice-first AI devices? And are you concerned about voice AI becoming absorbed into larger smartphone or OS ecosystems?
Story continues below this ad
Gopalan: I think everything has a purpose. A B2C use case is very different from a B2B use case. OpenAI is not necessarily meant for the kinds of systems we build. For example, we use SLMs to solve industry-specific problems. We’re also working with smartphone OEMs, helping them with deep tech capabilities. The market is huge. There’s room for many companies.
Take smartphones, for instance. It may be somewhat tangential, but except for Gen Z, most of us still primarily use touch. Gen Z users tend to speak into their devices most of the time. Maybe that behavior will shift things. But until user behavior changes more broadly, companies may not heavily invest in voice-first approaches. That said, everyone will have a role to play.
Compute is often cited as a major barrier for AI startups in India. Under the IndiaAI Mission, was compute made available at a more affordable rate compared to sourcing it independently or through global cloud providers?
Compute is definitely one of the biggest barriers right now. Even today, for example, we don’t have Nvidia’s H100 GPUs in India. They’re starting to come in now, but they’ve been available in the US for a while. So access to cutting-edge compute is still an issue.
Story continues below this ad
That said, making compute available is one thing the government has done well. We’ve benefited from it and do make use of it. They’re offering it at a fairly low price. If you look at the IndiaAI Mission website and compare the pricing with global rates, you’ll see it’s probably among the lowest in the world.
They’ve done a good job, without a doubt. It’s all transparent. You can check the pricing directly on the IndiaAI Mission website. I’m not sure of the latest numbers, but I think it’s a little over a dollar per hour for an H100. Some private providers are still charging Rs 500–600 per hour. So overall, it’s quite reasonable.
Q: On the regulatory front, one of the major concerns is the misuse of voice cloning for scams and fraud. While there has been significant focus on provenance and watermarking for AI-generated text and images, are there emerging technical standards or frameworks for identifying and labelling AI-generated audio and voice?
Gopalan: We’re one company that has both technologies, and we have two separate teams that effectively compete with each other. One team works on voice cloning, which is important for personalisation and similar use cases. This product is called Inya Assist, which does voice cloning and text-to-speech.
Story continues below this ad
We also have a product called Inya Shield that handles voice biometrics and voice authentication. So we build both capabilities. One system aims to create cloning technology that’s highly sophisticated, while the other works to detect any kind of cloning or spoofing. It’s an ongoing battle, and that dynamic will exist in any system.
There are multiple ways to address voice AI-enabled identity theft beyond basic guardrails and analysing voice. You can also look at behavioral patterns and use dynamic passphrases. For example, older voice AI systems relied on a single passphrase to access your bank account. You would say, “My voice is my password”. But the newer systems we’re building for many banks use dynamic passphrases that change every time.



