The Cardinal Benchmark

Jianna Liu

Sep 23, 2025

cardinal-logo
cardinal-logo
cardinal-logo

Hey everyone! We’re back, with a new benchmark! We developed this benchmark to provide a rigorous, standardized evaluation framework for comparing OCR systems across diverse real-world scenarios. Creating this benchmark required addressing two fundamental challenges: (1) establishing a comprehensive dataset and (2) implementing a reliable evaluation methodology.

Our investigation revealed a significant gap in existing OCR evaluation resources. While general-purpose OCR datasets with ground-truth Markdown labels are essential for systematic comparison, they proved remarkably scarce. OCR-Bench, for example, offered extensive input materials—tables, receipts, handwritten documents—but lacked the corresponding Markdown annotations necessary for quantitative assessment. The limited alternatives we examined, including olmOCR, did not provide the data diversity we required for comprehensive evaluation.

Since the raw data from OCR-Bench was promising, we built our own labels on top of it. The result is a new OCR evaluation dataset that spans a wide range of real-world scenarios: messy tables, blurry photos, receipts, handwriting, and more.

👉 The dataset (and labels) are fully open-source. You can grab it on Hugging Face and use it to benchmark your own OCR systems.

We selected Markdown as our target output format based on prevalent industry requirements and its dual advantages: structured representation (supporting tables, lists, headers) and standardized format enabling systematic cross-provider comparison.

Just like in our OCR Showdown, we put this benchmark to the test against the providers we hear about most often on customer calls. That’s why you’ll see this mix of LLMs and traditional OCR engines (and let us know if you’d like us to add others):

  • Gemini 2.5 Flash

  • GPT-5

  • Claude 4 Sonnet

  • Azure Document Intelligence

  • AWS Textract

  • Mistral OCR

  • Tesseract

This gives us a clear, apples-to-apples view of how today’s leading systems perform on the same tricky inputs — tables, receipts, handwriting, and more.

Among these systems, Claude-4-Sonnet is the highest cost option, while Tesseract is the most economical solution. In terms of processing speed, GPT-5 has the longest processing times, while Tesseract is the fastest.

Below is the accuracy on all of the systems on our dataset:

We graded outputs by checking how closely they match the original labels. First, we clean both versions so small formatting differences (like bullet styles or link URLs) don’t affect the score. Then we compare them in two ways: (1) do they use the same words in the same order, and (2) do the texts look overall similar. We combine those into one score, weighted 70% on word-level overlap (ROUGE-L F1) and 30% on overall text similarity (character match). This way the benchmark is fair, consistent, and not thrown off by tiny markdown quirks. You can view our evaluation script here!

If you want to try out Cardinal today, feel free to visit our dashboard or create a new API key:)