📦

calhacks

No description available

0 installs

Trust: 30 — Low

Other

Ask AI about calhacks

I know everything about calhacks. Ask me about installation, configuration, usage, or troubleshooting.

0/500

Loading tools...

Reviews

Documentation

CalHacks Video Analysis Project

A comprehensive video analysis system that combines audio transcription, face recognition, and lip movement detection to generate accurate speaker-attributed transcripts.

Features

🎤 Audio Pipeline

Extract audio from video files
Speaker diarization (identify when different speakers talk)
Speech-to-text transcription with timestamps
Reference-based speaker identification

👤 Video Pipeline

Real-time face detection and recognition
One-shot learning face recognition using dlib
Face database management
Unknown face capture and tracking

🎬 Unified Pipeline (DISCO) (NEW!)

Combines audio and video processing
Audio transcription with speaker diarization
Face detection and recognition in video
Combined output with recognized faces and timestamped transcripts
Automatic database storage - Saves interactions to PostgreSQL
Located in disco/ folder

Quick Start

Installation

# Install unified pipeline dependencies (includes both audio and video)
cd disco
./setup.sh

Or install manually:

cd disco
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Setup

Add faces to database: Place images in video/public/images/ (e.g., Neal.png, Jay.png)
Configure audio: Create audio/.env with your Hugging Face token
Run unified pipeline: Process videos with automatic speaker identification

Usage

Unified Pipeline (Recommended)

cd disco
source venv/bin/activate
python example.py ../public/IMG_4308.mov

This will:

Extract and transcribe audio with speaker diarization
Detect and recognize all faces in the video
Generate combined output with transcript and face information
Save interactions to PostgreSQL database

See disco/README.md for detailed documentation.

View saved interactions:

cd disco
python view_interactions.py

Audio Pipeline Only

cd audio
python audio.py

See audio/readme.md for details.

Video Pipeline Only

cd video
python process_video.py path/to/video.mov

See video/readme.md for details.

Project Structure

calhacks/
├── disco/                       # Unified pipeline (DISCO)
│   ├── unified_pipeline.py      # Main coordinator
│   ├── database.py              # PostgreSQL interaction storage
│   ├── view_interactions.py     # View saved interactions
│   ├── example.py               # Example usage script
│   ├── README.md                # Detailed documentation
│   ├── QUICKSTART.md            # Quick start guide
│   ├── DATABASE.md              # Database integration docs
│   ├── requirements.txt         # Combined dependencies
│   └── setup.sh                 # Setup script
│
├── audio/                       # Audio processing pipeline
│   ├── audio.py                 # Main audio pipeline
│   ├── config.py                # Configuration
│   └── output/                  # Generated transcripts
│
├── video/                       # Video processing pipeline
│   ├── functions.py             # Face recognition utilities
│   ├── process_video.py         # Video processing script
│   ├── public/images/           # Face database images
│   └── encodings/               # Face encodings database
│
└── public/                      # Test videos and reference audio
    ├── IMG_4308.mov
    └── neal-voice.m4a

Output Examples

Combined JSON Output

{
  "video_info": {
    "filename": "IMG_4308.mov",
    "duration": 120.5,
    "fps": 30.0
  },
  "recognized_faces": [
    { "name": "Neal", "first_seen": 0.5, "confidence": 0.891 },
    { "name": "Jay", "first_seen": 2.1, "confidence": 0.923 }
  ],
  "transcript": [
    {
      "speaker": "Neal",
      "start": 0.5,
      "end": 3.2,
      "text": "Hello, how are you doing today?"
    }
  ]
}

Combined Text Output

================================================================================
UNIFIED VIDEO TRANSCRIPT
================================================================================

Video: IMG_4308.mov
Duration: 120.50 seconds
Recognized Faces: Neal, Jay

================================================================================

Neal [00:00:00.500 -> 00:00:03.200]
Hello, how are you doing today?

Jay [00:00:03.500 -> 00:00:05.800]
I'm doing great, thanks for asking!

How It Works

The unified pipeline processes videos in 5 steps:

Video Metadata Extraction - Get FPS, duration, resolution
Audio Processing - Transcription + speaker diarization
Face Detection - Scan video for recognized faces
Speaker Processing - Keep speaker labels from audio diarization
Output Generation - Combine transcript with face recognition results

Audio Diarization

The system uses pyannote.audio to:

Identify when different speakers are talking
Label speakers as "Speaker 1", "Speaker 2", etc.
Transcribe speech using Faster Whisper
Provide timestamps for each speaker turn

Face Recognition

The system scans the video to:

Detect all faces throughout the video
Match faces against the known database (using dlib)
Track when each person first appears
Calculate confidence scores for matches

Requirements

Python 3.8+
OpenCV (video processing)
dlib (face recognition)
PyTorch (audio models)
Faster Whisper (speech recognition)
pyannote.audio (speaker diarization)
Hugging Face account (for diarization models)

Configuration

Face Recognition

Adjust in video/functions.py:

threshold: Face matching sensitivity (default: 0.5)
metrix: Distance metric ("euclidean" or "cosine")

Audio Processing

Adjust in audio/config.py:

WHISPER_MODEL: Model size (tiny, base, small, medium, large)
DIARIZATION_NUM_SPEAKERS: Expected number of speakers

Performance

Processing Speed: 2-5x real-time (depends on hardware)
Accuracy: High for clear videos with visible faces
Memory Usage: 2-4GB RAM for typical videos

Troubleshooting

No faces recognized?

Add images to video/images/ folder
Check face visibility in training images
Adjust threshold in video/functions.py

Slow processing?

Use smaller Whisper model
Reduce video resolution
Increase frame sampling interval

Documentation

disco/README.md - Complete unified pipeline documentation
disco/QUICKSTART.md - Quick start guide
disco/DATABASE.md - Database integration guide
audio/readme.md - Audio pipeline details
video/readme.md - Video pipeline details

Credits

Built with:

pyannote.audio - Speaker diarization
Faster Whisper - Speech recognition
dlib - Face recognition
OpenCV - Video processing

License

MIT License - See LICENSE file for details