calhacks
No description available
Ask AI about calhacks
Powered by Claude Β· Grounded in docs
I know everything about calhacks. Ask me about installation, configuration, usage, or troubleshooting.
0/500
Reviews
Documentation
CalHacks Video Analysis Project
A comprehensive video analysis system that combines audio transcription, face recognition, and lip movement detection to generate accurate speaker-attributed transcripts.
Features
π€ Audio Pipeline
- Extract audio from video files
- Speaker diarization (identify when different speakers talk)
- Speech-to-text transcription with timestamps
- Reference-based speaker identification
π€ Video Pipeline
- Real-time face detection and recognition
- One-shot learning face recognition using dlib
- Face database management
- Unknown face capture and tracking
π¬ Unified Pipeline (DISCO) (NEW!)
- Combines audio and video processing
- Audio transcription with speaker diarization
- Face detection and recognition in video
- Combined output with recognized faces and timestamped transcripts
- Automatic database storage - Saves interactions to PostgreSQL
- Located in
disco/folder
Quick Start
Installation
# Install unified pipeline dependencies (includes both audio and video)
cd disco
./setup.sh
Or install manually:
cd disco
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Setup
- Add faces to database: Place images in
video/public/images/(e.g.,Neal.png,Jay.png) - Configure audio: Create
audio/.envwith your Hugging Face token - Run unified pipeline: Process videos with automatic speaker identification
Usage
Unified Pipeline (Recommended)
cd disco
source venv/bin/activate
python example.py ../public/IMG_4308.mov
This will:
- Extract and transcribe audio with speaker diarization
- Detect and recognize all faces in the video
- Generate combined output with transcript and face information
- Save interactions to PostgreSQL database
See disco/README.md for detailed documentation.
View saved interactions:
cd disco
python view_interactions.py
Audio Pipeline Only
cd audio
python audio.py
See audio/readme.md for details.
Video Pipeline Only
cd video
python process_video.py path/to/video.mov
See video/readme.md for details.
Project Structure
calhacks/
βββ disco/ # Unified pipeline (DISCO)
β βββ unified_pipeline.py # Main coordinator
β βββ database.py # PostgreSQL interaction storage
β βββ view_interactions.py # View saved interactions
β βββ example.py # Example usage script
β βββ README.md # Detailed documentation
β βββ QUICKSTART.md # Quick start guide
β βββ DATABASE.md # Database integration docs
β βββ requirements.txt # Combined dependencies
β βββ setup.sh # Setup script
β
βββ audio/ # Audio processing pipeline
β βββ audio.py # Main audio pipeline
β βββ config.py # Configuration
β βββ output/ # Generated transcripts
β
βββ video/ # Video processing pipeline
β βββ functions.py # Face recognition utilities
β βββ process_video.py # Video processing script
β βββ public/images/ # Face database images
β βββ encodings/ # Face encodings database
β
βββ public/ # Test videos and reference audio
βββ IMG_4308.mov
βββ neal-voice.m4a
Output Examples
Combined JSON Output
{
"video_info": {
"filename": "IMG_4308.mov",
"duration": 120.5,
"fps": 30.0
},
"recognized_faces": [
{ "name": "Neal", "first_seen": 0.5, "confidence": 0.891 },
{ "name": "Jay", "first_seen": 2.1, "confidence": 0.923 }
],
"transcript": [
{
"speaker": "Neal",
"start": 0.5,
"end": 3.2,
"text": "Hello, how are you doing today?"
}
]
}
Combined Text Output
================================================================================
UNIFIED VIDEO TRANSCRIPT
================================================================================
Video: IMG_4308.mov
Duration: 120.50 seconds
Recognized Faces: Neal, Jay
================================================================================
Neal [00:00:00.500 -> 00:00:03.200]
Hello, how are you doing today?
Jay [00:00:03.500 -> 00:00:05.800]
I'm doing great, thanks for asking!
How It Works
The unified pipeline processes videos in 5 steps:
- Video Metadata Extraction - Get FPS, duration, resolution
- Audio Processing - Transcription + speaker diarization
- Face Detection - Scan video for recognized faces
- Speaker Processing - Keep speaker labels from audio diarization
- Output Generation - Combine transcript with face recognition results
Audio Diarization
The system uses pyannote.audio to:
- Identify when different speakers are talking
- Label speakers as "Speaker 1", "Speaker 2", etc.
- Transcribe speech using Faster Whisper
- Provide timestamps for each speaker turn
Face Recognition
The system scans the video to:
- Detect all faces throughout the video
- Match faces against the known database (using dlib)
- Track when each person first appears
- Calculate confidence scores for matches
Requirements
- Python 3.8+
- OpenCV (video processing)
- dlib (face recognition)
- PyTorch (audio models)
- Faster Whisper (speech recognition)
- pyannote.audio (speaker diarization)
- Hugging Face account (for diarization models)
Configuration
Face Recognition
Adjust in video/functions.py:
threshold: Face matching sensitivity (default: 0.5)metrix: Distance metric ("euclidean" or "cosine")
Audio Processing
Adjust in audio/config.py:
WHISPER_MODEL: Model size (tiny, base, small, medium, large)DIARIZATION_NUM_SPEAKERS: Expected number of speakers
Performance
- Processing Speed: 2-5x real-time (depends on hardware)
- Accuracy: High for clear videos with visible faces
- Memory Usage: 2-4GB RAM for typical videos
Troubleshooting
No faces recognized?
- Add images to
video/images/folder - Check face visibility in training images
- Adjust threshold in
video/functions.py
Slow processing?
- Use smaller Whisper model
- Reduce video resolution
- Increase frame sampling interval
Documentation
- disco/README.md - Complete unified pipeline documentation
- disco/QUICKSTART.md - Quick start guide
- disco/DATABASE.md - Database integration guide
- audio/readme.md - Audio pipeline details
- video/readme.md - Video pipeline details
Credits
Built with:
- pyannote.audio - Speaker diarization
- Faster Whisper - Speech recognition
- dlib - Face recognition
- OpenCV - Video processing
License
MIT License - See LICENSE file for details
