How the Pipeline Works

Three stages transform a document image into a signature verification verdict. Each stage solves a specific part of the problem.

Document Image + Reference Signature

Signature Detection

YOLOv12s · ~3ms (GPU) / ~100ms (CPU)

Objective: Locate all signature regions in a document image and return their pixel coordinates as bounding boxes.

How it works

Input document is resized to 640×640px
YOLOv12s (attention-centric detector with Area Attention Module) processes the image in a single forward pass
Non-maximum suppression filters overlapping detections (IoU threshold 0.45)
Output: bounding boxes with confidence scores for each detected signature

Why this approach

YOLOv12s achieves mAP@0.5 of 48.0 on COCO — +3.1% over YOLOv8s. Its attention mechanism is especially good at detecting thin stroke features like signatures in cluttered document backgrounds.

Example input / output

Input:

Document image (JPEG/PNG, any resolution)

Output:

Bounding boxes: [{x: 120, y: 450, w: 280, h: 90, confidence: 0.94}]

Count: 1 signature detected

Cost: $0 (self-hosted) · Latency: ~3ms (GPU) / ~100ms (CPU)

Preprocessing

Pillow + OpenCV · < 5ms

Objective: Normalize the cropped signature to a consistent format the verification model expects.

How it works

Crop the bounding box region from the original document
Convert to grayscale (single channel)
Apply Otsu thresholding to binarize (separate ink from background)
Crop to content bounding box (remove surrounding whitespace)
Resize preserving aspect ratio, pad to 220×155px
Normalize pixel values to [0, 1]

Why this approach

The Siamese network expects fixed-size grayscale input. Otsu thresholding removes background artifacts (paper texture, grid lines) that would confuse the encoder.

Example input / output

Input:

Cropped signature region (color, variable size)

Output:

Normalized tensor: (1, 155, 220) grayscale

Consistent format regardless of input resolution

Cost: $0 · Latency: < 5ms

Signature Verification

SigNet + Projection Head · ~8ms (GPU) / ~50ms (CPU)

Objective: Compare the detected signature against a reference and determine if they were written by the same person.

How it works

Both signatures pass through the same SigNet encoder (shared weights) — a 5-layer CNN pretrained on signature data
SigNet outputs 2048-dim features for each signature
Compute |emb_a - emb_b| — the absolute difference between the two embedding vectors
A binary classifier head processes the difference: Linear(2048→512→128→1) → sigmoid
Output probability [0, 1]: >0.5 = GENUINE, <0.5 = FORGED, mapped to confidence levels

Why this approach

Siamese networks generalize to new signers without retraining — just provide a reference. SigNet's pretrained weights already understand stroke patterns. The binary classifier learns which dimensions of the embedding difference matter for genuine vs forged — stronger than raw cosine similarity for this task.

Example input / output

Input:

Two preprocessed signatures: query (from detection) + reference (uploaded)

Output:

Similarity score: 0.82

Verdict: GENUINE

Confidence: HIGH

Cost: $0 (self-hosted) · Latency: ~8ms (GPU) / ~50ms (CPU)

Verdict: GENUINE / FORGED

Technical Decisions

Why YOLOv12s over YOLOv8

YOLOv12s uses Area Attention modules that are better at detecting thin, elongated features like signatures. +3.1% mAP improvement over YOLOv8s on COCO, with similar inference speed.

Why Siamese over classification

A classifier needs to be retrained for every new signer. A Siamese network compares embeddings — just provide a reference signature and it works on any signer, zero-shot. No retraining needed.

Why binary classification over metric learning

We tested triplet loss and contrastive loss — both plateaued at ~25% EER. SigNet features are classification-oriented, not metric-oriented. A binary classifier on |emb_a - emb_b| learns which dimensions matter for the decision, giving a stronger gradient signal than raw distance comparison.

Why two-phase training

Phase 1 freezes the pretrained encoder and trains only the new layers. Phase 2 fine-tunes everything with lower learning rates. This prevents catastrophic forgetting of learned stroke features.

Models Used

Model	Type	Params	Input	Output
YOLOv12s	Object detection	9.3M	Document (640×640)	Bounding boxes
SigNet + Classifier	Binary classification	16.9M	Signature pair (220×155)	Genuine probability [0, 1]

Live Demo

Evaluation