Symbolic Capital

Local Media Processor API

Flask yt-dlp Whisper Python

Flask API wrapper combining yt-dlp for video extraction and Whisper for transcription, replacing expensive third-party services.

The Problem

Third-party transcription APIs are expensive. Services like Deepgram, AssemblyAI, or AWS Transcribe charge per minute of audio. For heavy users processing hours of content daily, costs balloon quickly.

Additionally, these services require sending your data to external servers, which isn't ideal for sensitive content.

The Solution

A self-hosted Flask API that provides video downloading and transcription as a service. It combines two powerful open-source tools:

  • yt-dlp - Downloads videos from 1000+ sites
  • Whisper - OpenAI's speech recognition model (runs locally)

The API accepts video URLs, downloads them, extracts audio, transcribes with Whisper, and returns timestamped transcripts—all locally, no external APIs required.

Architecture

API Endpoints

  • POST /download - Downloads video, returns file path
  • POST /transcribe - Transcribes local audio file
  • POST /process - Combined download + transcribe
  • GET /status/{job_id} - Check processing status

Processing Pipeline

Jobs are queued using Redis and processed asynchronously by Celery workers. This allows the API to handle multiple concurrent requests without blocking.

  1. Client submits video URL
  2. API creates job, returns job ID
  3. Worker downloads video with yt-dlp
  4. ffmpeg extracts audio to WAV format
  5. Whisper transcribes audio (GPU-accelerated if available)
  6. Transcript stored in PostgreSQL with timestamps
  7. Client polls status endpoint for results

Technical Details

Model Selection

Whisper offers multiple model sizes (tiny, base, small, medium, large). I use the medium model—it's accurate enough for most content while being fast enough for real-time processing on consumer GPUs.

Performance Optimizations

  • GPU acceleration via CUDA (10x faster transcription)
  • Batch processing for multiple files
  • Audio preprocessing to remove silence
  • Caching of previously processed videos

Cost Comparison

Commercial API (e.g., Deepgram):

  • $0.0125/minute
  • 100 hours/month = $7,500/month

Self-Hosted Solution:

  • VPS with GPU: $50/month
  • Unlimited processing
  • Full data control

Use Cases

I use this API in several automation workflows:

  • Transcribing podcast episodes for searchable archives
  • Processing educational videos for study notes
  • Generating subtitles for social media content
  • Extracting quotes from conference talks

The key advantage: pay once for infrastructure, process unlimited content.

← Back to Projects