# Pose-Based Detection — How GuardianAI Sees Without Faces

> Stage one of GuardianAI's two-stage pipeline turns each video frame into a list of 17-point skeletons. Everything downstream operates on those skeletons, never on the raw pixels.

## Why pose, not pixels
A traditional video-analytics model runs end-to-end on raw RGB frames. It ingests a person's face, clothing, gait, and surroundings, and learns to fire on patterns of pixels that correlate with violence. That works, and it also creates a system that *could* be repurposed to identify a specific student.

We don't want the system to be capable of that. So we split the pipeline.

Stage one — covered here — maps each frame to **17 (x, y, confidence) tuples per person**. Stage two — the spatiotemporal graph classifier — operates only on those numbers. The classifier never sees the underlying pixels. By construction, it cannot identify a person; it can only describe motion.

## The model: YOLO11n-Pose
Ultralytics' YOLO11n-Pose, the smallest variant of the YOLO11 series with a pose-estimation head.

- **Speed.** ~3 ms per frame on an NVIDIA Jetson Orin Nano (8 GB). Real-time at 30 fps with significant headroom.
- **Accuracy on COCO.** 50.0 mAP@.50:.95 for pose. Plenty for our downstream task.
- **Open-source license.** AGPL-3.0; commercial licensing available from Ultralytics.
- **Multi-person.** Top-down detection finds every person, then estimates their pose individually — no person-counting hack, no separate tracker.

## The COCO 17-keypoint skeleton
Per detected person, output is 17 (x, y, confidence) triples in this order:

1. nose
2. left_eye, right_eye
3. left_ear, right_ear
4. left_shoulder, right_shoulder
5. left_elbow, right_elbow
6. left_wrist, right_wrist
7. left_hip, right_hip
8. left_knee, right_knee
9. left_ankle, right_ankle

Per frame, per person, the entire pose is described by 51 floats. A 64-frame window for one fight = 6,528 numbers ≈ 26 kB uncompressed. No image data anywhere in that representation.

## Multi-person tracking
Violence is multi-person — at least two skeletons need to be interacting. We sort detected persons per frame by detection confidence and keep the top two. When a person enters or leaves the frame mid-window, we pad the missing slot with the last known skeleton (matching how the model was trained).

Person identity across frames doesn't actually need to be solved for our task — we don't care *who* is fighting, only *that* two bodies are interacting in an aggressive way. We deliberately do not run a person re-ID model, because re-ID is a tracking signal that can be abused ("follow this person across cameras") — that defeats the privacy guarantee.

## What goes into the next stage
Stage two consumes a tensor of shape `(T=64, M=2, V=17, C=3)`:

- **T** = 64 consecutive frames (~2 s at 30 fps).
- **M** = 2 persons.
- **V** = 17 joints (the COCO skeleton above).
- **C** = 3 channels (x, y, confidence).

That tensor is the only thing that crosses the boundary between pose extraction and classification.

## Related reading
- [How the spatiotemporal classifier turns this tensor into a violence/no-violence label](https://guardianai.tech/technology/spatiotemporal-graph/index.md)
- [K-12 school deployments](https://guardianai.tech/use-cases/schools/index.md)
- [University campus deployments](https://guardianai.tech/use-cases/campuses/index.md)

---
*Markdown mirror of https://guardianai.tech/technology/pose-detection.*