Adityeah

How I Built a Talking AI Agent That Sounds Like Me (in 100 Lines of Code)

How I Built a Talking AI Agent That Sounds Like Me (in 100 Lines of Code)

Imagine having a voice assistant that doesn’t just respond from predefined answers but actually retrieves relevant context from your own local documents and responds using your own cloned voice.

Welcome to the era of real-time voice-based Retrieval-Augmented Generation (RAG) agents!

In this post, I’ll walk you through:

  • How I built a real-time voice-based RAG agent

  • How I integrated AssemblyAI for transcription

  • How the agent uses LiveKit for voice streaming

  • How it responds in my cloned voice using Cartesia AI

  • Practical real-life use cases that make this more than a fun demo

⚙️ Architecture Overview

Here’s what’s happening under the hood:

Technologies Used:

  • 🧠 LLM: Google’s Gemma 3 via LlamaIndex

  • 📄 Indexing: Local PDFs indexed with LlamaIndex

  • 🗣️ STT: AssemblyAI

  • 🧏‍♂️ TTS: Cartesia AI

  • 🎤 VAD: Silero

  • 📡 Orchestration: LiveKit agents playground

🛠️ Building the Agent

1. Set up your environment

				
					pip install -r requirements.txt
				
			

Create a .env file with:

				
					OPENAI_API_KEY=your_openai_api_key
CARTESIA_API_KEY=your_cartesia_api_key
LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
ASSEMBLYAI_API_KEY=your_assemblyai_api_key
				
			

2. Index your documents

Place your PDFs in a docs/ folder and run:

				
					from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
reader = SimpleDirectoryReader("docs")
docs = reader.load_data()
index = VectorStoreIndex.from_documents(docs)
index.storage_context.persist(persist_dir="chat-engine-storage")
				
			

3. Configure your agent (Python)

Use the VoicePipelineAgent from LiveKit. Key configurations:

				
					agent = VoicePipelineAgent(
    vad=silero.VAD.load(),
    stt=assemblyai.STT(),
    llm=llama_index.LLM(chat_engine=chat_engine),
    tts=cartesia.TTS(
        model="sonic-2",
        voice="your-cloned-voice-id",
    ),
    chat_ctx=chat_context,
)
				
			

4. Run the agent

				
					python voice_agent.py start
				
			

Now head to agents-playground.livekit.io, and click Connect. Speak your question out loud—something relevant to the documents you indexed in Step 2.

🧠 The agent will detect your voice, transcribe it using AssemblyAI, fetch context from your documents, generate a response using Gemma 3, and speak it back using Cartesia’s voice synthesis engine.

  1. Step 5: 🎧 Clone Your Voice with Cartesia

    By default, your agent uses a robotic voice provided by Cartesia. But you can replace that with your own cloned voice.

    Here’s how:

    1. Visit Cartesia.ai

    2. Click on Instant Voice Clone

    3. Provide a name and a short 5-second audio sample (e.g., “Hey, this is Aditya and this is my voice for cloning”)

    4. Cartesia will return a voice_id

    5. Replace the default voice ID in your .env file:

				
					CARTESIA_VOICE_ID=your_cloned_voice_id
				
			
  1. Restart your agent to use your own voice!


Below is the full Python code I used to build this agent:

				
					import logging
import os
from dotenv import load_dotenv
from livekit.agents import JobContext, JobProcess, WorkerOptions, cli
from livekit.agents.job import AutoSubscribe
from livekit.agents.llm import (
    ChatContext,
)
from livekit.agents.pipeline import VoicePipelineAgent
from livekit.plugins import cartesia, silero, llama_index, assemblyai

load_dotenv()

logger = logging.getLogger("voice-assistant")
from llama_index.llms.ollama import Ollama
from llama_index.core import (
    SimpleDirectoryReader,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
    Settings
)
from llama_index.core.chat_engine.types import ChatMode
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

load_dotenv()

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
llm=Ollama(model="gemma3", request_timeout=120.0)
Settings.llm = llm
Settings.embed_model = embed_model

# check if storage already exists
PERSIST_DIR = "./chat-engine-storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("docs").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)


def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()


async def entrypoint(ctx: JobContext):
    chat_context = ChatContext().append(
        role="system",
        text=(
            "You are a funny, witty assistant."
            "Respond with short and concise answers. Avoid using unpronouncable punctuation or emojis."
        ),
    )
    
    chat_engine = index.as_chat_engine(chat_mode=ChatMode.CONTEXT)
    logger.info(f"Connecting to room {ctx.room.name}")
   
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
   
    participant = await ctx.wait_for_participant()
    logger.info(f"Starting voice assistant for participant {participant.identity}")
    
    stt_impl = assemblyai.STT()
    agent = VoicePipelineAgent(
        vad=ctx.proc.userdata["vad"],
        stt=stt_impl,
        llm=llama_index.LLM(chat_engine=chat_engine),
        tts=cartesia.TTS(
            model="sonic-2",
            voice="794f9389-aac1-45b6-b726-9d9369183238",
        ),
        chat_ctx=chat_context,
    )

    agent.start(ctx.room, participant)

    await agent.say(
        "Hey there! How can I help you today?",
        allow_interruptions=True,
    )


if __name__ == "__main__":
    print("Starting voice agent...")
    cli.run_app(
        WorkerOptions(
            entrypoint_fnc=entrypoint,
            prewarm_fnc=prewarm,
        ),
    )
				
			

🧠 Real-Life Use Cases

Here’s where this gets seriously powerful:

1. 👩‍🏫 Education & Research

Ask research papers questions by simply speaking. No UI. No ChatGPT. Just voice.

2. 🏢 Enterprise Knowledge Assistants

Replace dumb IVRs. Let users talk to internal wikis, HR docs, SOPs.

3. 🦳 Elderly or Visually Impaired Users

Make documentation searchable and explorable without screens.

4. 🔧 Developer Agents

Let engineers talk to API docs, CLI tools, changelogs—all via voice.

5. 🤖 Personalized Voice Agents

Build your own Jarvis-style assistant trained on your workflows.

Stay tuned for more experiments in real-time AI agents 🚀

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *