A voice-activated gaming assistant that captures screenshots and provides real-time AI-powered analysis using Gemini Vision.
- Voice Activation: Wake word detection using "Hey Specta"
- Screenshot Analysis: Automatic screenshot capture and AI analysis
- Real-time Conversation: Natural voice conversations with context retention
- Gaming-Focused: Specialized prompts for gaming assistance
- Professional Audio: High-quality speech-to-text and text-to-speech
SPECTA VOICE GAMING ASSISTANT
System Architecture
┌─────────────┐ ┌─────────────┐ ┌──────────────────────────────┐
│ Microphone │───▶│ Wake Word │───▶│ PIPECAT PIPELINE │
│ Audio │ │ Detection │ │ │
└─────────────┘ │ (Picovoice) │ │ ┌─────────────────────────┐ │
└─────────────┘ │ │ LocalAudioTransport │ │
│ │ + Silero VAD │ │
│ └──────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────────┐ │
│ │ STT Mute Filter │ │
│ └──────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────────┐ │
│ │ Whisper STT Service │ │
│ └──────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────────┐ │
┌─────────────┐ │ │ First Query Handler │ │
│ Screenshot │◀──────────────────────│ │ • Screenshot Capture │ │
│ Storage │ │ │ • Gemini Vision API │ │
└─────────────┘ │ │ • Response Parsing │ │
│ └──────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────────┐ │
│ │ Context Aggregator │ │
│ │ (User) │ │
│ └──────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────────┐ │
│ │ Gemini LLM Service │ │
│ │ (Follow-up queries) │ │
│ └──────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────────┐ │
│ │ Deepgram TTS │ │
│ └──────────┬──────────────┘ │
│ │ │
│ ┌──────────▼──────────────┐ │
│ │ Audio Output │ │
┌─────────────┐ │ └──────────┬──────────────┘ │
│ Speakers/ │◀──────────────────────│ │ │
│ Headphones │ │ ┌──────────▼──────────────┐ │
└─────────────┘ │ │ Context Aggregator │ │
│ │ (Assistant) │ │
│ └─────────────────────────┘ │
└──────────────────────────────┘
Flow: Audio → "Hey Specta" → STT → Screenshot+Vision → Context → LLM → TTS → Audio
Key Components:
- Wake word detection (Picovoice)
- Speech-to-Text (Whisper)
- Screenshot capture (PIL)
- AI analysis (Gemini 2.5 Flash)
- Text-to-Speech (Deepgram)
- Context management (OpenAI-compatible)
- Python 3.8+
- API Keys:
GEMINI_API_KEY- Google Gemini AIDEEPGRAM_API_KEY- Deepgram speech servicesPICOVOICE_ACCESS_KEY- Picovoice wake word detection
- Clone the repository:
git clone https://github.com/yourusername/Specta.git
cd Specta- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env with your API keys- Run the assistant:
python specta.py- Say "Hey Specta" to activate
- Ask gaming-related questions
- The assistant will capture screenshots and provide contextual help
- Wake Word: Customizable via
hey_specta.ppnfile - Screenshots: Saved to
screenshots/directory - Audio: 16kHz sample rate with VAD
pipecat-ai- Voice pipeline frameworkgoogle-generativeai- Gemini AI integrationpvporcupine- Wake word detectiondeepgram-sdk- Speech serviceswhisper- Speech-to-textPIL- Screenshot capturesounddevice- Audio I/O
MIT License - see LICENSE file for details.
Pull requests welcome! Please ensure:
- Clean, professional code
- No hardcoded secrets
- Proper error handling
- Documentation updates