# Spatiotemporal Graph Networks — How GuardianAI Classifies Motion

> Stage two of GuardianAI's pipeline. A graph convolutional network treats the human body as a graph, time as a sequence of graphs, and asks: is this motion pattern aggressive?

## The shape of the problem
Stage one emits a tensor of shape `(T=64, M=2, V=17, C=3)`: 64 frames of two people's 17-joint skeletons, each with (x, y, confidence). The job of the second stage: turn that tensor into a single number between 0 and 1 — the probability that this two-second clip contains physical aggression.

The naive approach is to flatten everything and feed it to a feed-forward network. It works poorly because it throws away two structures that matter: the skeleton is a **graph** (joints connected by bones), and time is a **sequence** (each frame depends on the previous one).

## CTR-GCN — Channel-wise Topology Refinement Graph Convolutional Network
Our base architecture is CTR-GCN (ICCV 2021). The core idea: instead of using a fixed adjacency matrix that says "left wrist is connected to left elbow", the network *learns the connection strength per channel*. Different feature channels can pay attention to different bone connections.

We adapted the original paper's NTU-RGB+D-skeleton format (25 joints) to the COCO 17-joint format from YOLO11n-Pose, and rewrote the adjacency matrix accordingly. The network has 2.7M parameters — small enough to run on the same Jetson Orin Nano that hosts the pose extractor.

## Three input streams, one ensemble
A single skeleton tensor is information-rich but representationally narrow. We compute three views and run three classifiers:

- **Joint stream.** Raw (x, y, conf) coordinates. Captures absolute body posture.
- **Bone stream.** Vector differences between connected joints (e.g., wrist - elbow). Captures limb orientation, invariant to position in the frame.
- **Velocity stream.** Frame-to-frame deltas. Captures motion magnitude — the difference between "reaching" and "throwing a punch".

The three logit outputs are combined with weights `[0.5, 0.3, 0.2]` (learned during validation), then a sigmoid produces the final probability. Per-stream, each model is the same CTR-GCN architecture. The ensemble buys ~+3 F1 points over the best single stream.

## Sliding-window inference
The classifier is a 64-frame model, but real video is longer. For real-time camera streams, we run the model on a 64-frame window with a **32-frame stride** — a fresh prediction every ~1 second. For uploaded clips longer than 2 s, we slide across the whole clip and report the maximum-confidence window plus its timestamp. A 30-second clip produces 14 windows; the maximum across them is robust to the model missing any single frame.

## Why this beats end-to-end video CNNs
- **Data hunger.** A 3D-CNN with comparable accuracy needs 10–100× more labeled video. School-CCTV violence datasets are small; skeleton-based models amortize the body-pose prior.
- **Bias on appearance.** Pixel-based models latch onto skin tone, clothing, lighting, and background. Skeleton-based models cannot — they only see joint coordinates. This is also a fairness property.
- **Privacy.** The classifier never touches pixels. Architecturally incapable of being repurposed as a face-recognition or person-identification model.

## Failure modes (honest)
- **Pose-extraction errors propagate.** A mis-localized wrist over several frames creates a phantom punch in the velocity stream. We mitigate with a 0.3 keypoint-confidence floor.
- **Two-person assumption.** The shipped model handles 2 interacting bodies. A 5-on-1 brawl is detected (any pair fires) but exact crowd-violence semantics are lost.
- **Crowd occlusion.** Packed cafeterias may show only shoulder-up keypoints. We've trained on partial visibility but precision in those scenarios is ~10 points lower.

## Continuous improvement loop
Every confirmed/false-alarm event from production is fed into a per-customer fine-tuning pipeline. Base model is shared; last two layers are tuned per site over the first 90 days. Drift is monitored monthly; a fresh fine-tune triggers only when precision drops more than 3 points.

## Related reading
- [How pose detection works (stage one)](https://guardianai.tech/technology/pose-detection/index.md)
- [K-12 school deployments](https://guardianai.tech/use-cases/schools/index.md)
- [University campus deployments](https://guardianai.tech/use-cases/campuses/index.md)

---
*Markdown mirror of https://guardianai.tech/technology/spatiotemporal-graph.*
