Spatiotemporal Graph Networks: How GuardianAI Classifies Motion

Stage two of GuardianAI's pipeline. A graph convolutional network treats the human body as a graph, time as a sequence of graphs, and asks: is this motion pattern aggressive?

Last updated: April 25, 2026

The shape of the problem

Stage one — covered on the pose-detection page — emits a tensor of shape (T=64, M=2, V=17, C=3): 64 frames of two people's 17-joint skeletons, each with (x, y, confidence). The job of the second stage is to turn that tensor into a single number between 0 and 1: the probability that this two-second clip contains physical aggression.

The naive approach is to flatten everything and feed it to a feed-forward network. It works poorly because it throws away two structures that matter: the skeleton is a graph (joints are connected by bones), and time is a sequence (this frame depends on the previous one). We use a model designed to respect both.

CTR-GCN — Channel-wise Topology Refinement Graph Convolutional Network

Our base architecture is CTR-GCN, an ICCV 2021 paper from the Chinese Academy of Sciences group at the State Key Lab of Pattern Recognition. The core idea: instead of using a fixed adjacency matrix that says "left wrist is connected to left elbow", the network learns the connection strength per channel. Different feature channels can pay attention to different bone connections.

We adapted the original paper's NTU-RGB+D-skeleton format (25 joints) to the COCO 17-joint format that comes out of YOLO11n-Pose, and rewrote the adjacency matrix accordingly. The network has 2.7M parameters — small enough to run on the same Jetson Orin Nano that hosts the pose extractor.

Three input streams, one ensemble

A single skeleton tensor is information-rich but representationally narrow. We compute three views of the same data and run three copies of the classifier:

  • Joint stream. The raw (x, y, conf) coordinates. Captures absolute body posture.
  • Bone stream. Vector differences between connected joints (e.g., wrist - elbow). Captures limb orientation, invariant to where the person is in the frame.
  • Velocity stream.Frame-to-frame deltas. Captures motion magnitude — the difference between "reaching" and "throwing a punch".

The three logit outputs are combined with weights [0.5, 0.3, 0.2], learned during validation on a held-out split, then a sigmoid produces the final probability. Per-stream, each model is the same CTR-GCN architecture; the ensemble buys us about +3 F1 points over the best single stream.

Sliding-window inference

The classifier is a 64-frame model, but real video is longer than 64 frames. For real-time camera streams, we run the model on a 64-frame window with a 32-frame stride— that's a fresh prediction every ~1 second. For uploaded clips longer than 2 s, we slide across the whole clip and report the maximum-confidence window plus its timestamp. A 30-second clip produces 14 windows; the maximum across them is robust to the model missing any single frame.

Why this beats end-to-end video CNNs

The obvious alternative is a 3D-CNN (I3D, X3D) or a video transformer (TimeSformer, VideoMAE) trained directly on raw video. Those models work, but they have three problems for our setting:

  1. Data hunger. A 3D-CNN with comparable accuracy needs 10–100× more labeled video. The available corpus of school-CCTV violence data is small (UBI-Fights, RWF-2000, NTU-RGB+D fight subset). A skeleton-based model gets to amortize the body-pose prior, so it generalizes from a few thousand clips.
  2. Bias on appearance.Pixel-based models latch onto skin tone, clothing, lighting, and background. Skeleton-based models cannot — they only see joint coordinates, which factors out demographic and environmental confounds. This is also a fairness property: the model cannot accidentally learn that one demographic is "more aggressive" because of how that demographic appears in training data.
  3. Privacy. Already covered, but worth restating: the classifier never touches pixels. It is architecturally incapable of functioning as a face-recognition or person-identification model.

Failure modes (honest)

A skeleton-based model has weaknesses too, and we don't hide them:

  • Pose-extraction errors propagate. If YOLO11n-Pose mis-localizes a wrist for several frames in a row, the velocity stream sees a phantom punch. We mitigate with a 0.3 keypoint-confidence floor and by zero-imputing low-confidence joints in the bone/velocity computations.
  • Two-person assumption. The shipped model handles 2 interacting bodies. A 5-on-1 brawl is detected (any 2-subject pair fires) but exact crowd-violence semantics are lost.
  • Crowd occlusion.In a packed cafeteria, only the shoulder-up keypoints may be visible. We've trained on partial visibility but precision in those scenarios is ~10 points lower than in open spaces. The fix is more diverse training data, on the roadmap.

Continuous improvement loop

Every confirmed and false-alarm event from production deployments is fed back into a per-customer fine-tuning pipeline. The base model is shared; the last two layers are tuned per site over the first 90 days. After the tuning window closes, drift is monitored monthly and a fresh fine-tune is triggered only when precision drops more than 3 points. This keeps the system responsive to local patterns (uniforms, lighting, camera angle) without accumulating customer-specific data centrally.

See the pose-extraction page for stage one, or read about deployments in K-12 schools and on university campuses.