How the Pipeline Works
Three stages transform a document image into a signature verification verdict. Each stage solves a specific part of the problem.
1Signature Detection
YOLOv12s · ~3ms (GPU) / ~100ms (CPU)
Signature Detection
YOLOv12s · ~3ms (GPU) / ~100ms (CPU)
Objective: Locate all signature regions in a document image and return their pixel coordinates as bounding boxes.
How it works
- Input document is resized to 640×640px
- YOLOv12s (attention-centric detector with Area Attention Module) processes the image in a single forward pass
- Non-maximum suppression filters overlapping detections (IoU threshold 0.45)
- Output: bounding boxes with confidence scores for each detected signature
Why this approach
YOLOv12s achieves mAP@0.5 of 48.0 on COCO — +3.1% over YOLOv8s. Its attention mechanism is especially good at detecting thin stroke features like signatures in cluttered document backgrounds.
Example input / output
2Preprocessing
Pillow + OpenCV · < 5ms
Preprocessing
Pillow + OpenCV · < 5ms
Objective: Normalize the cropped signature to a consistent format the verification model expects.
How it works
- Crop the bounding box region from the original document
- Convert to grayscale (single channel)
- Apply Otsu thresholding to binarize (separate ink from background)
- Crop to content bounding box (remove surrounding whitespace)
- Resize preserving aspect ratio, pad to 220×155px
- Normalize pixel values to [0, 1]
Why this approach
The Siamese network expects fixed-size grayscale input. Otsu thresholding removes background artifacts (paper texture, grid lines) that would confuse the encoder.
Example input / output
3Signature Verification
SigNet + Projection Head · ~8ms (GPU) / ~50ms (CPU)
Signature Verification
SigNet + Projection Head · ~8ms (GPU) / ~50ms (CPU)
Objective: Compare the detected signature against a reference and determine if they were written by the same person.
How it works
- Both signatures pass through the same SigNet encoder (shared weights) — a 5-layer CNN pretrained on signature data
- SigNet outputs 2048-dim features for each signature
- Compute |emb_a - emb_b| — the absolute difference between the two embedding vectors
- A binary classifier head processes the difference: Linear(2048→512→128→1) → sigmoid
- Output probability [0, 1]: >0.5 = GENUINE, <0.5 = FORGED, mapped to confidence levels
Why this approach
Siamese networks generalize to new signers without retraining — just provide a reference. SigNet's pretrained weights already understand stroke patterns. The binary classifier learns which dimensions of the embedding difference matter for genuine vs forged — stronger than raw cosine similarity for this task.
Example input / output
Technical Decisions
Why YOLOv12s over YOLOv8
YOLOv12s uses Area Attention modules that are better at detecting thin, elongated features like signatures. +3.1% mAP improvement over YOLOv8s on COCO, with similar inference speed.
Why Siamese over classification
A classifier needs to be retrained for every new signer. A Siamese network compares embeddings — just provide a reference signature and it works on any signer, zero-shot. No retraining needed.
Why binary classification over metric learning
We tested triplet loss and contrastive loss — both plateaued at ~25% EER. SigNet features are classification-oriented, not metric-oriented. A binary classifier on |emb_a - emb_b| learns which dimensions matter for the decision, giving a stronger gradient signal than raw distance comparison.
Why two-phase training
Phase 1 freezes the pretrained encoder and trains only the new layers. Phase 2 fine-tunes everything with lower learning rates. This prevents catastrophic forgetting of learned stroke features.
Models Used
| Model | Type | Params | Input | Output |
|---|---|---|---|---|
| YOLOv12s | Object detection | 9.3M | Document (640×640) | Bounding boxes |
| SigNet + Classifier | Binary classification | 16.9M | Signature pair (220×155) | Genuine probability [0, 1] |