ammarinjtk.com
GitHub

How the Pipeline Works

Three stages transform a document image into a signature verification verdict. Each stage solves a specific part of the problem.

Document Image + Reference Signature
1

Signature Detection

YOLOv12s · ~3ms (GPU) / ~100ms (CPU)

Objective: Locate all signature regions in a document image and return their pixel coordinates as bounding boxes.

How it works

  1. Input document is resized to 640×640px
  2. YOLOv12s (attention-centric detector with Area Attention Module) processes the image in a single forward pass
  3. Non-maximum suppression filters overlapping detections (IoU threshold 0.45)
  4. Output: bounding boxes with confidence scores for each detected signature

Why this approach

YOLOv12s achieves mAP@0.5 of 48.0 on COCO — +3.1% over YOLOv8s. Its attention mechanism is especially good at detecting thin stroke features like signatures in cluttered document backgrounds.

Example input / output

Input:
Document image (JPEG/PNG, any resolution)
Output:
Bounding boxes: [{x: 120, y: 450, w: 280, h: 90, confidence: 0.94}]
Count: 1 signature detected
Cost: $0 (self-hosted) · Latency: ~3ms (GPU) / ~100ms (CPU)
2

Preprocessing

Pillow + OpenCV · < 5ms

Objective: Normalize the cropped signature to a consistent format the verification model expects.

How it works

  1. Crop the bounding box region from the original document
  2. Convert to grayscale (single channel)
  3. Apply Otsu thresholding to binarize (separate ink from background)
  4. Crop to content bounding box (remove surrounding whitespace)
  5. Resize preserving aspect ratio, pad to 220×155px
  6. Normalize pixel values to [0, 1]

Why this approach

The Siamese network expects fixed-size grayscale input. Otsu thresholding removes background artifacts (paper texture, grid lines) that would confuse the encoder.

Example input / output

Input:
Cropped signature region (color, variable size)
Output:
Normalized tensor: (1, 155, 220) grayscale
Consistent format regardless of input resolution
Cost: $0 · Latency: < 5ms
3

Signature Verification

SigNet + Projection Head · ~8ms (GPU) / ~50ms (CPU)

Objective: Compare the detected signature against a reference and determine if they were written by the same person.

How it works

  1. Both signatures pass through the same SigNet encoder (shared weights) — a 5-layer CNN pretrained on signature data
  2. SigNet outputs 2048-dim features for each signature
  3. Compute |emb_a - emb_b| — the absolute difference between the two embedding vectors
  4. A binary classifier head processes the difference: Linear(2048→512→128→1) → sigmoid
  5. Output probability [0, 1]: >0.5 = GENUINE, <0.5 = FORGED, mapped to confidence levels

Why this approach

Siamese networks generalize to new signers without retraining — just provide a reference. SigNet's pretrained weights already understand stroke patterns. The binary classifier learns which dimensions of the embedding difference matter for genuine vs forged — stronger than raw cosine similarity for this task.

Example input / output

Input:
Two preprocessed signatures: query (from detection) + reference (uploaded)
Output:
Similarity score: 0.82
Verdict: GENUINE
Confidence: HIGH
Cost: $0 (self-hosted) · Latency: ~8ms (GPU) / ~50ms (CPU)
Verdict: GENUINE / FORGED

Technical Decisions

Why YOLOv12s over YOLOv8

YOLOv12s uses Area Attention modules that are better at detecting thin, elongated features like signatures. +3.1% mAP improvement over YOLOv8s on COCO, with similar inference speed.

Why Siamese over classification

A classifier needs to be retrained for every new signer. A Siamese network compares embeddings — just provide a reference signature and it works on any signer, zero-shot. No retraining needed.

Why binary classification over metric learning

We tested triplet loss and contrastive loss — both plateaued at ~25% EER. SigNet features are classification-oriented, not metric-oriented. A binary classifier on |emb_a - emb_b| learns which dimensions matter for the decision, giving a stronger gradient signal than raw distance comparison.

Why two-phase training

Phase 1 freezes the pretrained encoder and trains only the new layers. Phase 2 fine-tunes everything with lower learning rates. This prevents catastrophic forgetting of learned stroke features.

Models Used

ModelTypeParamsInputOutput
YOLOv12sObject detection9.3MDocument (640×640)Bounding boxes
SigNet + ClassifierBinary classification16.9MSignature pair (220×155)Genuine probability [0, 1]