The system is straightforward:
-
You describe your goal — "I want to set up two-factor authentication on my Google account" or "Help me configure my Git SSH keys"
-
You share your screen — The app uses your browser's built-in screen sharing (the same tech used for video calls)
-
AI analyzes what it sees — Vision language models look at your screen and figure out the current state
-
You get one instruction at a time — No information overload. Just "Click the blue Settings button in the top right" or "Scroll down to find Security"
-
Automatic progress detection — When you complete a step, Screen Vision notices the screen changed and automatically gives you the next instruction
| Model | Provider | Purpose |
|---|---|---|
| GPT-5.2 | OpenAI | Primary reasoning: generates step-by-step instructions and answers follow-up questions |
| Gemini 3 Flash | Google AI Studio | Step verification: compares before/after screenshots to confirm action completion |
| Qwen3-VL 30B | Fireworks AI | Coordinate detection: locates specific UI elements on screen |
Screen Vision is designed to process your data securely without retaining it.
- Zero Data Retention: No images or screen recordings are stored on the server. All processing happens in real-time, and data is discarded immediately after analysis.
- Secure AI Processing: Screenshots are sent to trusted LLM providers (OpenAI and Fireworks AI) solely for analysis. These providers adhere to strict data handling policies and do not store or use your data to train their models.
- Frontend: Next.js 13, React 18, Tailwind CSS, Zustand
- Backend: FastAPI, Python
- AI: OpenAI GPT models, Qwen-VL (via OpenRouter)
- UI: Radix primitives, Framer Motion, Lucide icons
Frontend (Next.js + React)
- Handles screen capture via the MediaDevices API
- Runs change detection by comparing scaled-down frames
- Manages the PiP window for always-on-top instructions
- Masks its own window from screenshots (so the AI doesn't see itself)
Backend (FastAPI + Python)
/api/step— Given a goal and screenshot, returns the next single instruction/api/check— Compares before/after screenshots to verify if a step was completed/api/help— Answers follow-up questions about what's on screen/api/coordinates— Locates specific UI elements when needed
- Node.js 18+
- Python 3.10+
- pnpm (or npm/yarn)
Clone the repo and install dependencies:
git clone https://github.com/r-muresan/screen.vision.git
cd screen.vision
# Frontend
pnpm install
# Backend
pip install -r requirements.txtCreate a .env.local file in the root directory:
# Required - powers the main step-by-step logic
OPENAI_API_KEY=sk-...
# Required - used for verification and coordinate detection (Qwen models)
OPENROUTER_API_KEY=sk-or-...The app uses OpenAI for primary reasoning and OpenRouter to access Qwen-VL models for specific tasks like step verification. You can swap these out by modifying api/index.py if you prefer different providers.
Start both the frontend and backend with a single command:
npm run devThis runs:
- Next.js dev server on
http://localhost:3000 - FastAPI server on
http://localhost:8000
Open your browser to http://localhost:3000 and you're good to go.
For production deployments:
# Build the frontend
npm run build
# Start the frontend
npm run start
# Run the API separately
uvicorn api.index:app --host 0.0.0.0 --port 8000Or use the included Procfile for platforms like Railway or Heroku.
