ammarinjtk.com
GitHub

Experiments

How we trained both models — from dataset preparation to fine-tuning to evaluation. All training runs on Kaggle free GPU (T4) at $0 cost.

Understanding YOLOv12s

YOLOv12s

Attention-Centric Detection

What it is

YOLOv12 is the latest in the YOLO (You Only Look Once) family of real-time object detectors. The “s” variant has 9.3M parameters — small enough for fast inference but large enough for accurate detection. The key innovation is the Area Attention Module (A²) which replaces traditional convolution blocks with attention mechanisms.

Why it's good for signatures

Signatures are thin, elongated stroke patterns on cluttered document backgrounds. The attention mechanism can focus on these fine details better than pure convolution, which tends to average over local patches and miss thin lines.

YOLOv12s specs:
Parameters: 9.3M
COCO mAP@0.5:0.95: 48.0 (vs YOLOv8s: 44.9)
Inference: 2.61ms on T4 GPU
Architecture: R-ELAN backbone + Area Attention + FlashAttention

Dataset

tech4humans/signature-detection (HuggingFace)

1,980
train
420
validation
419
test

Combined from Tobacco800 (scanned business documents) + Roboflow 100 signatures. Apache 2.0 license.

Training Process

Phase 1 — Head warm-up (epochs 1-20):
Freeze: backbone (10 layers)
Train: detection head only
LR: 0.01, Optimizer: AdamW
Phase 2 — Full fine-tuning (epochs 21-100):
Freeze: nothing (all layers trainable)
LR: 0.0001 (100x lower — prevent catastrophic forgetting)
Augmentation: rotation ±10°, shear, brightness ±20%, mosaic, blur
GPU: Kaggle T4 (free), ~3-4 hours total

Results

0.910
mAP@0.5
0.533
mAP@0.5:0.95
0.916
Precision
0.884
Recall

Phase 1 → Phase 2 improvement: mAP@0.5 improved from 0.846 to 0.910 (+7.6%). Precision jumped from 0.812 to 0.916 (+12.8%). Trained on RunPod RTX 4090 in ~20 minutes total, cost ~$0.15.