top of page
Youtubeh2r.png
Reverb
Conversational AI Podcast

Current content like audiobooks and podcasts often leaves us passive, unable to engage or take notes. Not being able to search or research more about what we are listening to when consuming audio media while driving, cooking, or exercising leads to added situational disability. I am trying to make audio content more interactive and conversational using AI.

For the scope of my thesis, I focused on creating interactive podcasts and how we could use AI conversations to increase knowledge retention. But it’s not just about innovation; it’s about ethics. The thesis explores how we could use the upcoming technologies to create new interactions for electracy (electronic literacy).

Duration
4 Months
(Master Thesis)
Mentors

Eric Paulos

Hugh Dubberly

Yoon Bahk

Hila Mor

Fields explored

Generative Artificial Intelligence

Retreival Augmented Generation

User Experience Design

Human Interface Design

Interaction Design

Full Stack Development

UX for AI

Tools

Figma

Visual Studio Code

Whisper AI

Chat GPT

Llama Index

Eleven Labs

Adobe Suite

Html, Javascript, Python, C++

User Research
  • Difficulty in Note-Taking: Struggle to capture information efficiently while actively consuming content, especially for complex topics

  • Knowledge Retention: The key pain point is lack of context in notes making it difficult for listeners to recall the information later

  • Multitasking Inefficiency: The constant need to switch between apps for research, note-taking, and audio playback creates disruption and cognitive overload

  • Accessibility Challenges: Visually impaired individuals often face difficulties navigating complex interfaces, finding specific information, and utilizing features without auditory cues

image_fx_cognitive_overload_with_a_lot_i
Who is this product for?
  • Individuals with Learning Differences: People with ADHD, dyslexia, or other learning disabilities may find it challenging to focus, remember information, or switch between apps quickly

  • Multitaskers: People who listen to audio while driving, cooking, working out, or doing other activities often struggle to take notes or research information efficiently

  • Accessibility Needs: Visually impaired individuals can benefit from audio descriptions

  • Active listeners: Individuals who want to deepen their understanding of topics

Competitive analysis and market gap

Siri, Google Assistant, and Alexa lack contextual understanding and conversational engagement specific to audio media consumption. They are good for task utilities but don't know what podcast you're listening to or the speaker's context. Some platforms offer transcripts but lack interactive features or conversational AI. Platforms like Curio.io, Snipd, Huberman's lab provide high-quality audio content and summarization but lacks real-time conversational features.

Why now?
Group 880rr5.png

For detailed information, read the full thesis here. 

Design Goal
Metamorphosis of oneway audio into dynamic and interactive conversations through Conversational AI.
Product Breakdown

The final design is anchored in 3 primary pillars, each addressing key user pain points identified through research.

Turn up the volume to hear the demos!xb.png
Contextual

Conversational: Enhances our capacity to articulate thoughts, resolve problems, and rapidly assimilate information through interactive dialogues

Personalization: It is tailored to user preferences, enabling them to customize experience.

mdi_notesir1.png

Easy Note-taking: Retain knowledge by automatically saving podcast segments on command.

Transcript Access: Easy transcript access transcripts for better comprehension.

Easy Access Information

Catch-me Up: Catches users up on missed information when they get distracted, ensuring a smooth listening experience.

Wrap-Up/Summarize: Provides a quick overview of key information and insights.

Group 877ir2.png
Interlinks

Cross-referencing: Enrich the listener’s journey through interconnected narratives and effortlessly bridged the gap in discovering similar content by weaving a web of references across podcasts.

Mental Model and Interactions
Leveraging existing mental models
  • Seamlessly transition to onboarding with our familiar Spotify-like interface.

  • Intuitive AI-powered features with no steep learning curve, building on familiar mental models for audio content consumption.

Personalized options for interaction

Multiple interaction modes like text, voice, and quick action buttons contribute to the shift in user perception of AI from a "tool" to a "collaborator."

  • Knowledge of Users: AI needs to "know" the user's intent. "Reverb" allows users to express their intent in a manner that suits their preferences and context

  • Acting on Behalf of Users: AI should act based on the user's goals. By providing multiple interaction modes, it empowers users to direct the AI's actions in a way that aligns with their individual needs. 

Tech Stack
Extraction of key data from podcast and the user flow

For this thesis, I focused on common podcast platforms like YouTube. To embed a video player, I used YouTube’s iframe API, which provides access to podcast data such as audio, timestamps, descriptions, and captions via JavaScript. I developed the frontend on JavaScript and HTML processing the backend in Python.

Group 879rr2.png
User prompt analysis for relevant information generation and retrieval

In this thesis, prompt design, a crucial aspect of conversational AI, is meticulously addressed to ensure natural, relevant, and meaningful AI interactions. Commands include General Queries for Q&A with the AI podcaster, “Save” to save podcast segments, “Summarize” for timestamped summaries, and “Recommend” to suggest podcasts based on interactions with Reverb. The prompt engineering is tailored to the podcast listening context, ensuring the AI understands the exact timestamp and the context of conversations happening at that moment.

Slide 16_9 - 4257r.png

ℹ️ Voice cloning, done with the podcaster's permission to ensure ethical standards, creates an immersive experience. If permission is not granted, the default voice assistant is used.

ℹ️ The tech stack flow is based on technologies available as of September 2023. Since then, advancements like GPT-4, Gemini 1.5 Pro, Clause 3, Hume.ai, and Deepgram have emerged, which I plan to explore and integrate in future works.

MacBook Pro 14_ - 59slide13r.png
MacBook Pro 14_ - 60slide14r.png
Timestamp-aware calculations
Tailored information fetching
Geolocation-based response
MacBook Pro 14_ - 55slide11r.png
MacBook Pro 14_ - 56slide12r.png
Intelligent speaker diarization

Since the thesis is yet to be published, I would be happy to show an in-person demo.

Contact: shikha.shah@berkeley.edu 

Group 878rr.png
Future enhancements

My thesis was ahead of the curve, accurately predicting future technological advancements. Since then, technologies like GPT-4, Gemini 1.5 Pro, Clause 3, Hume.ai, and Deepgram have emerged, which I plan to integrate into future works. These will significantly advance my goal of making podcasts more conversational using AI. With Gemini 1.5 Pro, I'll expand video podcast comprehension for a richer, more interactive experience. Hume.ai will interpret speaker emotions, adding nuance to AI responses. Deepgram will enhance speaker diarization for better clarity and engagement. Leveraging these technologies, I aim to create a more immersive and conversational podcast experience.

IMG_0208.jpg
Screen Shot 2024-05-15 at 7.52_edited.jpg
WhatsApp Image 2024-05-15 at 20.00.45.jpeg
WhatsApp Image 2024-05-15 at 19_edited.j
bottom of page