VeriStream

What did I actually build

Imagine pointing an AI compiler at a Twitch stream and getting live verdicts on what's real, what's fake, and why. VeriStream is that pipeline.

Tech Stack

FastAPIReactApache KafkaApache SparkNeo4jPyTorchWhisperGroq

TL;DR

Deepfake scores + fact-check verdicts while the stream is still playing
Chained compiler: CV → Whisper → LLM → Knowledge Graph
Built the whole stack: FastAPI, React, Apache Kafka, Apache Spark, Neo4j

Role	What I shipped	Stack
Product + ML Engineer	Real-time misinformation scanner + dashboard	FastAPI, PyTorch, Whisper, Groq, Neo4j
Distributed Systems	Kafka + Spark streaming backbone	Apache Kafka, ZooKeeper, PySpark
Frontend	Analyst-facing console	React, Chart.js, Leaflet

Why I built this

Election season + AI-generated video chaos = nobody knows what’s legit. Journalists told me their current workflow is “download clip → manually scrub it → Google the claims”. VeriStream short-circuits that by acting like a compiler for live media—tokenizing frames, running inference passes, then linking evidence into a knowledge graph that can be queried instantly.

System Highlights

Dual-path stream compiler

Path A (FastAPI): ultra-low latency mode (chunked FFmpeg capture → async inference → WebSocket push, ~20s delay)
Path B (Spark + Kafka): high-throughput mode that chews through ~1,800 frames/min with micro-batches and writes verdicts back to Kafka.

Attention-driven deepfake radar

Fine-tuned DINOv2 ViT produces frame-level probabilities and heatmaps so an analyst can literally see which facial regions look synthetic.

Multilingual narrative watchdog

Whisper → Groq translation pipeline keeps both the source language and the English transcript so fact-checkers don’t lose nuance.

Fact-checking compiler

Lexical pass: spaCy + regex identify claims worth verifying.
Evidence pass: Google Fact-Check API + FAISS RAG + Neo4j knowledge graph.
Synthesis pass: LLM writes a verdict + justification with confidence scores.

Knowledge graph memory

Every processed clip plots entities, claims, and verdicts inside Neo4j so repeat misinformation gets flagged faster next time.

Tech I Actually used

Layer	Notes
Data plane	Apache Kafka (5 MB messages) + Spark 3.5.3 micro-batches (2s trigger, checkpointed)
Inference	DINOv2 Vision Transformer, Whisper base, custom BERT political bias classifier
LLM	Groq Llama3-8B (translations + verdict synthesis)
Storage	Neo4j (graph), FAISS (vector search), temp media store, JSON caches
UI	React + WebSockets + Chart.js/Plotly + Leaflet heatmaps

Impact / Wins

92.4% deepfake accuracy with explainable heatmaps (attention overlays saved straight from PyTorch tensors).
15–30s stream latency in direct mode, 2–5s frame-to-verdict in Spark mode.
Detects 116 emotional trigger patterns + 150+ stereotype templates to score manipulation risk.
Builds a Neo4j knowledge graph per stream, so repeat misinformation gets a “seen before” badge automatically.

What I Learned (and Shipped)

Getting Apache Kafka + Spark to play nice with OpenCV frames meant inventing a base64 frame codec and dedupe keys.
Ran all heavy models as singletons inside FastAPI so I don’t nuke RAM every request.
Built a background fact-check buffer: accumulate 30 seconds of transcription, then kick off Groq verdicts without blocking the stream.
Designing for analysts meant obsessing over small UX touches (heatmap gallery, political bias gauges, manipulation score pills).

🔗 Links: Watch the walkthrough | GitHub Repo

← Back to Home