Pose-Based Detection: How GuardianAI Sees Without Faces

Stage one of GuardianAI's two-stage pipeline turns each video frame into a list of 17-point skeletons. Everything downstream operates on those skeletons, never on the raw pixels.

Last updated: April 25, 2026

Why pose, not pixels

A traditional video-analytics model runs end-to-end on raw RGB frames. It ingests a person's face, clothing, gait, and surroundings, and learns to fire on patterns of pixels that correlate with violence. That works, and it also creates a system that couldbe repurposed to identify a specific student. We don't want the system to be capable of that — not for regulatory comfort, but as a fundamental architectural property.

So we split the pipeline. Stage one — the page you're reading about — maps each frame to 17 (x, y, confidence) tuples per person. Stage two — the spatiotemporal graph classifier — operates only on those numbers. The classifier never sees the underlying pixels. By construction, it cannot identify a person; it can only describe motion.

The model: YOLO11n-Pose

We use Ultralytics' YOLO11n-Pose, the smallest variant of the YOLO11 series with a pose-estimation head. The choice was deliberate:

Speed. ~3 ms per frame on an NVIDIA Jetson Orin Nano (8 GB). Real-time analysis at 30 fps with significant headroom.
Accuracy on COCO.50.0 mAP@.50:.95 for the pose task on the COCO keypoints validation split, which is plenty for our downstream task — we don't need pixel-perfect skeletons; we need consistent ones.
Open-source license. AGPL-3.0, with a commercial license available from Ultralytics. Compatible with our customer-deployment terms.
Multi-person out of the box. Top-down detection finds every person in the frame, then estimates their pose individually — no person-counting hack, no separate tracker.

The COCO 17-keypoint skeleton

The output of YOLO11n-Pose is the standard COCO Keypoints format: for each detected person we get 17 (x, y, confidence) triples in this order:

nose
left_eye, right_eye
left_ear, right_ear
left_shoulder, right_shoulder
left_elbow, right_elbow
left_wrist, right_wrist
left_hip, right_hip
left_knee, right_knee
left_ankle, right_ankle

Per frame, per person, the entire pose is described by 51 floating-point numbers. A 64-frame window for a single fight involves 6,528 numbers — about 26 kilobytes uncompressed. There is no image data anywhere in that representation, even before encryption.

Multi-person tracking

Violence is by definition multi-person — at least two skeletons need to be interacting. We sort detected persons per frame by detection confidence and keep the top two; the spatiotemporal classifier expects exactly that input shape. When a person enters or leaves the frame mid-window, we pad the missing slot with the last known skeleton (matching how the model was trained).

Person identity across frames doesn't actually need to be solved for our task — we don't care who is fighting, only thattwo bodies are interacting in an aggressive way. We deliberately do not run a person re-ID model, because re-ID is a tracking signal that can be abused ("follow this person across cameras"), and that would defeat the privacy guarantee.

What goes into the next stage

Stage two consumes a tensor of shape (T=64, M=2, V=17, C=3):

T = 64 consecutive frames (~2 s at 30 fps).
M = 2 persons.
V = 17 joints (the COCO skeleton above).
C = 3 channels (x, y, confidence).

That tensor is the only thing that crosses the boundary between pose extraction and classification. Continue reading how the spatiotemporal classifier turns that tensor into a violence/no-violence label, or read how this pipeline is deployed in K-12 schools and on university campuses.