How I Built a Talking AI Agent That Sounds Like Me (in 100 Lines of Code)

Imagine having a voice assistant that doesn’t just respond from predefined answers but actually retrieves relevant context from your own local documents and responds using your own cloned voice.

Welcome to the era of real-time voice-based Retrieval-Augmented Generation (RAG) agents!

In this post, I’ll walk you through:

How I built a real-time voice-based RAG agent
How I integrated AssemblyAI for transcription
How the agent uses LiveKit for voice streaming
How it responds in my cloned voice using Cartesia AI
Practical real-life use cases that make this more than a fun demo

⚙️ Architecture Overview

Here’s what’s happening under the hood:

Technologies Used:

🧠 LLM: Google’s Gemma 3 via LlamaIndex
📄 Indexing: Local PDFs indexed with LlamaIndex
🗣️ STT: AssemblyAI
🧏‍♂️ TTS: Cartesia AI
🎤 VAD: Silero
📡 Orchestration: LiveKit agents playground

🛠️ Building the Agent

1. Set up your environment

				
					pip install -r requirements.txt

Create a .env file with:

				
					OPENAI_API_KEY=your_openai_api_key
CARTESIA_API_KEY=your_cartesia_api_key
LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
ASSEMBLYAI_API_KEY=your_assemblyai_api_key

2. Index your documents

Place your PDFs in a docs/ folder and run:

				
					from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
reader = SimpleDirectoryReader("docs")
docs = reader.load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="chat-engine-storage")

3. Configure your agent (Python)

Use the VoicePipelineAgent from LiveKit. Key configurations:

				
					agent = VoicePipelineAgent(
    vad=silero.VAD.load(),
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="your-cloned-voice-id",
    ),
    chat_ctx=chat_context,
)

4. Run the agent

				
					python voice_agent.py start

Now head to agents-playground.livekit.io, and click Connect. Speak your question out loud—something relevant to the documents you indexed in Step 2.

🧠 The agent will detect your voice, transcribe it using AssemblyAI, fetch context from your documents, generate a response using Gemma 3, and speak it back using Cartesia’s voice synthesis engine.

Step 5: 🎧 Clone Your Voice with Cartesia
By default, your agent uses a robotic voice provided by Cartesia. But you can replace that with your own cloned voice.
Here’s how:
1. Visit Cartesia.ai
2. Click on Instant Voice Clone
3. Provide a name and a short 5-second audio sample (e.g., “Hey, this is Aditya and this is my voice for cloning”)
4. Cartesia will return a voice_id
5. Replace the default voice ID in your .env file:

				
					CARTESIA_VOICE_ID=your_cloned_voice_id

Restart your agent to use your own voice!

Below is the full Python code I used to build this agent:

				
					import logging
import os
from dotenv import load_dotenv
from livekit.agents import JobContext, JobProcess, WorkerOptions, cli
from livekit.agents.job import AutoSubscribe
from livekit.agents.llm import (
    ChatContext,
)
from livekit.agents.pipeline import VoicePipelineAgent
from livekit.plugins import cartesia, silero, llama_index, assemblyai

load_dotenv()

logger = logging.getLogger("voice-assistant")
from llama_index.llms.ollama import Ollama
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
    Settings
)
from llama_index.core.chat_engine.types import ChatMode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

load_dotenv()

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm=Ollama(model="gemma3", request_timeout=120.0)
Settings.llm = llm
Settings.embed_model = embed_model

# check if storage already exists
PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("docs").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)


def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()


async def entrypoint(ctx: JobContext):
    chat_context = ChatContext().append(
        role="system",
        text=(
            "You are a funny, witty assistant."
            "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
        ),
    )
    
    chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)
    logger.info(f"Connecting to room {ctx.room.name}")
   
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
   
    participant = await ctx.wait_for_participant()
    logger.info(f"Starting voice assistant for participant {participant.identity}")
    
    stt_impl = assemblyai.STT()
    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=stt_impl,
        llm=llama_index.LLM(chat_engine=chat_engine),
        tts=cartesia.TTS(
            model="sonic-2",
            voice="794f9389-aac1-45b6-b726-9d9369183238",
        ),
        chat_ctx=chat_context,
    )

    agent.start(ctx.room, participant)

    await agent.say(
        "Hey there! How can I help you today?",
        allow_interruptions=True,
    )


if __name__ == "__main__":
    print("Starting voice agent...")
    cli.run_app(
        WorkerOptions(
            entrypoint_fnc=entrypoint,
            prewarm_fnc=prewarm,
        ),
    )

🧠 Real-Life Use Cases

Here’s where this gets seriously powerful:

1. 👩‍🏫 Education & Research

Ask research papers questions by simply speaking. No UI. No ChatGPT. Just voice.

2. 🏢 Enterprise Knowledge Assistants

Replace dumb IVRs. Let users talk to internal wikis, HR docs, SOPs.

3. 🦳 Elderly or Visually Impaired Users

Make documentation searchable and explorable without screens.

4. 🔧 Developer Agents

Let engineers talk to API docs, CLI tools, changelogs—all via voice.

5. 🤖 Personalized Voice Agents

Build your own Jarvis-style assistant trained on your workflows.

Stay tuned for more experiments in real-time AI agents 🚀

Aditya Chaudhari

Product Manager

From wireframes to launchpads, Aditya engineers AI-powered experiences with the precision of a pilot and the soul of a designer.

Blog Portfolio About Connect

About the Author

I’m Aditya, a Mumbai-based design-centric Product Manager with a passion for all things bikes, airplanes, and dogs! With over 6 years of experience in the ever-evolving world of digital products, I’ve become the ultimate driving machine for crafting pixel-perfect experiences that’ll rev your engines. When I’m not obsessing over pixels, you can find me travelling. I thrive on collaborative projects and delight in finding innovative solutions that not only wow users but also drive tangible results. I take my work seriously, but I also believe in adding a sprinkle of humor and quirkiness to keep things exciting.

Aditya ChaudhariProduct Manager