Project

Local Media Processor API

Flask yt-dlp Whisper Python

Flask API wrapper combining yt-dlp for video extraction and Whisper for transcription, replacing expensive third-party services.

The Problem

Third-party transcription APIs are expensive. Services like Deepgram, AssemblyAI, or AWS Transcribe charge per minute of audio. For heavy users processing hours of content daily, costs balloon quickly.

Additionally, these services require sending your data to external servers, which isn't ideal for sensitive content.

The Solution

A self-hosted Flask API that provides video downloading and transcription as a service. It combines two powerful open-source tools:

yt-dlp - Downloads videos from 1000+ sites
Whisper - OpenAI's speech recognition model (runs locally)

The API accepts video URLs, downloads them, extracts audio, transcribes with Whisper, and returns timestamped transcripts—all locally, no external APIs required.

Architecture

API Endpoints

POST /download - Downloads video, returns file path
POST /transcribe - Transcribes local audio file
POST /process - Combined download + transcribe
GET /status/{job_id} - Check processing status

Processing Pipeline

Jobs are queued using Redis and processed asynchronously by Celery workers. This allows the API to handle multiple concurrent requests without blocking.

Client submits video URL
API creates job, returns job ID
Worker downloads video with yt-dlp
ffmpeg extracts audio to WAV format
Whisper transcribes audio (GPU-accelerated if available)
Transcript stored in PostgreSQL with timestamps
Client polls status endpoint for results

Technical Details

Model Selection

Whisper offers multiple model sizes (tiny, base, small, medium, large). I use the medium model—it's accurate enough for most content while being fast enough for real-time processing on consumer GPUs.

Performance Optimizations

GPU acceleration via CUDA (10x faster transcription)
Batch processing for multiple files
Audio preprocessing to remove silence
Caching of previously processed videos

Cost Comparison

Commercial API (e.g., Deepgram):

$0.0125/minute
100 hours/month = $7,500/month

Self-Hosted Solution:

VPS with GPU: $50/month
Unlimited processing
Full data control

Use Cases

I use this API in several automation workflows:

Transcribing podcast episodes for searchable archives
Processing educational videos for study notes
Generating subtitles for social media content
Extracting quotes from conference talks

The key advantage: pay once for infrastructure, process unlimited content.

← Back to Projects