Project
Local Media Processor API
Flask API wrapper combining yt-dlp for video extraction and Whisper for transcription, replacing expensive third-party services.
The Problem
Third-party transcription APIs are expensive. Services like Deepgram, AssemblyAI, or AWS Transcribe charge per minute of audio. For heavy users processing hours of content daily, costs balloon quickly.
Additionally, these services require sending your data to external servers, which isn't ideal for sensitive content.
The Solution
A self-hosted Flask API that provides video downloading and transcription as a service. It combines two powerful open-source tools:
- yt-dlp - Downloads videos from 1000+ sites
- Whisper - OpenAI's speech recognition model (runs locally)
The API accepts video URLs, downloads them, extracts audio, transcribes with Whisper, and returns timestamped transcripts—all locally, no external APIs required.
Architecture
API Endpoints
POST /download- Downloads video, returns file pathPOST /transcribe- Transcribes local audio filePOST /process- Combined download + transcribeGET /status/{job_id}- Check processing status
Processing Pipeline
Jobs are queued using Redis and processed asynchronously by Celery workers. This allows the API to handle multiple concurrent requests without blocking.
- Client submits video URL
- API creates job, returns job ID
- Worker downloads video with yt-dlp
- ffmpeg extracts audio to WAV format
- Whisper transcribes audio (GPU-accelerated if available)
- Transcript stored in PostgreSQL with timestamps
- Client polls status endpoint for results
Technical Details
Model Selection
Whisper offers multiple model sizes (tiny, base, small, medium, large). I use the medium model—it's accurate enough for most content while being fast enough for real-time processing on consumer GPUs.
Performance Optimizations
- GPU acceleration via CUDA (10x faster transcription)
- Batch processing for multiple files
- Audio preprocessing to remove silence
- Caching of previously processed videos
Cost Comparison
Commercial API (e.g., Deepgram):
- $0.0125/minute
- 100 hours/month = $7,500/month
Self-Hosted Solution:
- VPS with GPU: $50/month
- Unlimited processing
- Full data control
Use Cases
I use this API in several automation workflows:
- Transcribing podcast episodes for searchable archives
- Processing educational videos for study notes
- Generating subtitles for social media content
- Extracting quotes from conference talks
The key advantage: pay once for infrastructure, process unlimited content.