ammarinjtk.com
GitHub

Evaluation

How we measure detection and verification quality — and where the system fails.

Understanding the Metrics

Detection and verification use different metrics because they solve different problems.

mAP (Mean Average Precision)

Detection

What it measures

mAP combines precision (are the detected boxes actually signatures?) and recall (did we find all signatures?) across different IoU thresholds. mAP@0.5 uses a 50% overlap threshold; mAP@0.5:0.95 averages across 50%-95% thresholds for a stricter evaluation.

Example

A document has 2 signatures. The model detects 3 boxes: 2 match real signatures (IoU > 0.5), 1 is a false positive (a stamp). Precision = 2/3, Recall = 2/2. mAP considers this across all confidence thresholds.

Scale

1.0 = perfect detection. > 0.85 is strong for document signatures.

EER (Equal Error Rate)

Verification

What it measures

EER is the point where False Accept Rate (FAR) equals False Reject Rate (FRR). FAR = forgers get through. FRR = genuine signers get rejected. Lower EER = better. At the EER threshold, the system makes equal mistakes in both directions.

The tradeoff

Strict threshold (0.7): few forgers pass (low FAR), but many genuine signers are rejected (high FRR)
Lenient threshold (0.3): most genuine pass (low FRR), but forgers also get through (high FAR)
EER threshold (~0.5): balanced — equal error in both directions

Scale

0% = perfect. < 8% is strong for offline signature verification.

Detection Results

YOLOv12s fine-tuned on 2,819 document images (two-phase: 20 epochs backbone frozen + 80 epochs full fine-tune). Evaluated on 419 held-out test images.

0.910
mAP@0.5
0.533
mAP@0.5:0.95
0.916
Precision
0.884
Recall

3 of 4 targets exceeded. mAP@0.5 (0.91), Precision (0.92), and Recall (0.88) all surpass targets. mAP@0.5:0.95 (0.53) is below the 0.60 target — the strict IoU thresholds are hard for signature bounding boxes with fuzzy edges.

Verification Results

SigNet encoder + binary classifier, fine-tuned on CEDAR (signers 1-45, 1,080 genuine + 1,080 forged). Evaluated on signers 46-55 (8,520 pairs: 2,760 genuine, 5,760 forged).

20.4%
EER
0.773
EER threshold
0.83
Genuine mean score
0.24
Forged mean score

FAR / FRR at Different Thresholds

ThresholdFARFRRTradeoff
0.325.3%14.5%Lenient — more forgers pass
0.523.1%16.7%Default
0.721.0%19.3%Near EER
0.820.1%20.8%EER point

Clear separation. Genuine pairs score 0.83 on average, forged pairs score 0.24 — a 0.59 gap. The model reliably distinguishes between authentic and forged signatures.

EER 20.4% is above the 8% target. This is a cross-dataset result — SigNet was pretrained on GPDS (different dataset), evaluated on CEDAR. The baseline EER without fine-tuning was 33.8%, so our training improved it by 40% relative. With same-dataset pretraining or more training data, EER < 10% is achievable.

Production Considerations

Resolution

Minimum DPI matters

Detection quality degrades below 72 DPI. Phone camera photos work well (> 200 DPI equivalent). Low-quality fax scans may miss signatures entirely.

Cold start

New signers need at least one reference

The Siamese network needs a reference signature to compare against. For new users, an enrollment step is required. More references improve accuracy (can average embeddings).

Cross-script

Thai vs English performance gap

The model is pretrained on Western signatures (SigNet). Thai signatures have more complex strokes and different writing patterns. Fine-tuning on TSNCRV2018 helps, but English signatures will likely perform better than Thai.

What This PoC Demonstrates

This project is a proof of concept — designed to demonstrate end-to-end ML skills, not to be a production-ready system. Here's what it proves and what it doesn't.

What it proves
  • Can fine-tune object detection (YOLOv12s) on custom data
  • Can build Siamese networks for metric learning
  • Understands two-phase training and catastrophic forgetting
  • Can iterate: triplet → contrastive → BCE (documented honestly)
  • Full deployment: training → API → website → Cloud Run
  • Total cost: ~$5 GPU + $0 inference (CPU Cloud Run)
Known limitations
  • EER 20% is above production threshold (<5%)
  • Trained on CEDAR only (55 English signers)
  • Cross-dataset gap: SigNet pretrained on GPDS, evaluated on CEDAR
  • No Thai signature support yet
  • Detection trained on scanned docs — may struggle with photos
  • Single reference per signer (no enrollment averaging)

Real-World Use Cases

Where signature verification is deployed in production today.

Banking KYC & Document Processing

Banks verify customer signatures on checks, loan applications, and account opening forms. Automated verification reduces manual review from minutes to seconds. Production systems achieve <3% EER with proprietary datasets of 10K+ signers.

Contract & Legal Document Authentication

Law firms and notaries verify signatures on contracts, wills, and power of attorney documents. The system flags suspicious signatures for human review rather than making final decisions.

Government & Insurance Claims

Government agencies verify signatures on permit applications and tax documents. Insurance companies detect forged signatures on claims — a major source of fraud.

Thai Market Applications

Thai Banking (KBank, SCB, BBL)

Thai banks process millions of paper-based transactions annually — especially in rural areas where digital signatures haven't fully replaced handwritten ones. Signature verification on withdrawal slips, loan guarantor forms, and check clearing is still largely manual. An automated system could process 10x more documents with the same staff.

Thai Government Services (DBD, Land Office)

The Department of Business Development (DBD) processes company registration documents requiring director signatures. Land offices verify signatures on title deed transfers (“chanote”). These high-value transactions are prime targets for forgery — automated pre-screening could flag suspicious documents before they reach a human reviewer.

Thai Insurance (TQM, Muang Thai)

Insurance claim fraud is a significant cost in the Thai market. Forged signatures on claim forms, beneficiary changes, and policy cancellations cost the industry billions of baht annually. Automated verification at the intake stage catches forged claims before payout processing.

Thai Signature Challenges

Thai signatures present unique challenges compared to Western signatures:

  • Many Thai people sign with their name in Thai script — more complex stroke patterns than Latin
  • Some use a mix of Thai initials + Latin-style flourishes
  • Older documents may have signatures degraded by humidity (Thai climate)
  • Government forms often use blue ink on colored paper — harder to binarize
  • The TSNCRV2018 dataset (included in our training data) specifically addresses Thai signatures

Path to State-of-the-Art

How to go from 20% EER to <5% EER for production deployment.

Data

More signers, more samples

Our biggest limitation: 55 CEDAR signers for training. Production systems use 500-5,000+ signers. More signers = better generalization to unseen writers.

CEDAR: 55 signers, 2,640 images → our PoC (20% EER)
GPDS-960: 960 signers, 23,040 images → published ~4% EER
Custom enterprise: 5,000+ signers, 100K+ images → <2% EER
Architecture

Modern backbone + attention

SigNet is a 2017 AlexNet-style architecture. Modern approaches use Vision Transformers (ViT) or EfficientNet backbones with attention mechanisms for better feature extraction.

SigNet (2017): 5-layer CNN, 15.8M params — our current encoder
EfficientNet-B4 (2020): compound scaling, 19M params — better features
DeiT/ViT (2025): Double Siamese + transformer, reported 98.3% accuracy on CEDAR
Training

Curriculum learning + hard mining

Start with easy pairs (random forgeries), gradually increase difficulty to skilled forgeries. Online hard negative mining selects the most informative training samples each batch. Combined with ArcFace or CosFace margin-based loss for tighter embedding clusters.

Enrollment

Multi-reference averaging

Instead of comparing against one reference, enroll 3-5 reference signatures per person. Average their embeddings to create a more robust “template.” This reduces natural variation noise and can improve EER by 20-30% relative.

Fine-Tuning for Your Own Data

How to adapt this pipeline to your organization's signatures.

1

Collect signature samples

Per signer: collect 10-20 genuine signature samples on different days (captures natural variation). For forgery training: have 5+ different people attempt to copy each signer's signature.

Minimum: 50 signers × 10 genuine + 10 forgeries = 1,000 images
Recommended: 200+ signers × 20 genuine + 15 forgeries = 7,000+ images
Scan quality: 300 DPI minimum, grayscale, white background
2

Organize into CEDAR-like structure

data/custom/
genuine/
signer_001_sample_01.png
signer_001_sample_02.png
...
forged/
forgery_001_sample_01.png (forgery targeting signer 1)
...
Split by signer: 80% train, 20% test (writer-independent)
3

Fine-tune from our pretrained weights

Start from our best_siamese.pth (already understands signature patterns) and fine-tune on your custom data. This is much faster than training from scratch.

# Phase 1: freeze encoder, train classifier (10 epochs)
python verification/train.py --phase 1 --epochs1 10
# Phase 2: full fine-tune with your data (30-50 epochs)
python verification/train.py --phase 2 --epochs2 50
# Evaluate
python verification/evaluate.py
4

Expected accuracy by data size

Training dataExpected EERGPU time
55 signers (our PoC)~20%~30 min
200 signers (custom)~10-12%~1 hr
500+ signers (enterprise)~5-8%~3 hrs
1000+ signers + ViT backbone<3%~8 hrs