Imagine having a voice assistant that doesn’t just respond from predefined answers but actually retrieves relevant context from your own local documents and responds using your own cloned voice.
Welcome to the era of real-time voice-based Retrieval-Augmented Generation (RAG) agents!
In this post, I’ll walk you through:
How I built a real-time voice-based RAG agent
How I integrated AssemblyAI for transcription
How the agent uses LiveKit for voice streaming
How it responds in my cloned voice using Cartesia AI
Practical real-life use cases that make this more than a fun demo
Here’s what’s happening under the hood:
🧠 LLM: Google’s Gemma 3 via LlamaIndex
📄 Indexing: Local PDFs indexed with LlamaIndex
🗣️ STT: AssemblyAI
🧏♂️ TTS: Cartesia AI
🎤 VAD: Silero
📡 Orchestration: LiveKit agents playground
pip install -r requirements.txt
Create a .env
file with:
OPENAI_API_KEY=your_openai_api_key
CARTESIA_API_KEY=your_cartesia_api_key
LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
ASSEMBLYAI_API_KEY=your_assemblyai_api_key
Place your PDFs in a docs/
folder and run:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
reader = SimpleDirectoryReader("docs")
docs = reader.load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="chat-engine-storage")
Use the VoicePipelineAgent from LiveKit. Key configurations:
agent = VoicePipelineAgent(
vad=silero.VAD.load(),
stt=assemblyai.STT(),
llm=llama_index.LLM(chat_engine=chat_engine),
tts=cartesia.TTS(
model="sonic-2",
voice="your-cloned-voice-id",
),
chat_ctx=chat_context,
)
python voice_agent.py start
Now head to agents-playground.livekit.io, and click Connect. Speak your question out loud—something relevant to the documents you indexed in Step 2.
🧠 The agent will detect your voice, transcribe it using AssemblyAI, fetch context from your documents, generate a response using Gemma 3, and speak it back using Cartesia’s voice synthesis engine.
By default, your agent uses a robotic voice provided by Cartesia. But you can replace that with your own cloned voice.
Here’s how:
Visit Cartesia.ai
Click on Instant Voice Clone
Provide a name and a short 5-second audio sample (e.g., “Hey, this is Aditya and this is my voice for cloning”)
Cartesia will return a voice_id
Replace the default voice ID in your .env
file:
CARTESIA_VOICE_ID=your_cloned_voice_id
Restart your agent to use your own voice!
Below is the full Python code I used to build this agent:
import logging
import os
from dotenv import load_dotenv
from livekit.agents import JobContext, JobProcess, WorkerOptions, cli
from livekit.agents.job import AutoSubscribe
from livekit.agents.llm import (
ChatContext,
)
from livekit.agents.pipeline import VoicePipelineAgent
from livekit.plugins import cartesia, silero, llama_index, assemblyai
load_dotenv()
logger = logging.getLogger("voice-assistant")
from llama_index.llms.ollama import Ollama
from llama_index.core import (
SimpleDirectoryReader,
StorageContext,
VectorStoreIndex,
load_index_from_storage,
Settings
)
from llama_index.core.chat_engine.types import ChatMode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
load_dotenv()
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm=Ollama(model="gemma3", request_timeout=120.0)
Settings.llm = llm
Settings.embed_model = embed_model
# check if storage already exists
PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
# load the documents and create the index
documents = SimpleDirectoryReader("docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# store it for later
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
def prewarm(proc: JobProcess):
proc.userdata["vad"] = silero.VAD.load()
async def entrypoint(ctx: JobContext):
chat_context = ChatContext().append(
role="system",
text=(
"You are a funny, witty assistant."
"Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
),
)
chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)
logger.info(f"Connecting to room {ctx.room.name}")
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
participant = await ctx.wait_for_participant()
logger.info(f"Starting voice assistant for participant {participant.identity}")
stt_impl = assemblyai.STT()
agent = VoicePipelineAgent(
vad=ctx.proc.userdata["vad"],
stt=stt_impl,
llm=llama_index.LLM(chat_engine=chat_engine),
tts=cartesia.TTS(
model="sonic-2",
voice="794f9389-aac1-45b6-b726-9d9369183238",
),
chat_ctx=chat_context,
)
agent.start(ctx.room, participant)
await agent.say(
"Hey there! How can I help you today?",
allow_interruptions=True,
)
if __name__ == "__main__":
print("Starting voice agent...")
cli.run_app(
WorkerOptions(
entrypoint_fnc=entrypoint,
prewarm_fnc=prewarm,
),
)
Here’s where this gets seriously powerful:
Ask research papers questions by simply speaking. No UI. No ChatGPT. Just voice.
Replace dumb IVRs. Let users talk to internal wikis, HR docs, SOPs.
Make documentation searchable and explorable without screens.
Let engineers talk to API docs, CLI tools, changelogs—all via voice.
Build your own Jarvis-style assistant trained on your workflows.
Stay tuned for more experiments in real-time AI agents 🚀