OCR Is Broken for Complex Documents - Here’s How We Fixed It

Jianna Liu

Sep 7, 2025

article illustration
article illustration
article illustration

Hey everyone!

This is Jianna, CTO at Cardinal. Cardinal partners with innovative teams to turn complex, messy documents into accurate, usable data.

Since our launch 7 days ago, we’ve already processed over 50k pages through our platform and have had some incredible feedback from our early users.

Since our launch, a lot of folks have asked us about how our processing technology works under the hood. I’m writing this article to help answer that question, as well as explain the why behind Cardinal and the problem we’re solving.


I. Our Story

This is my second company alongside my co-founder Devi. We met as Harvard and MIT CS undergrads, and our first company together was a utility data processing company (part of YC S23).

Processing utility data turned out to be a trickier challenge than we initially expected. It was also a high-stakes one: many of our clients relied on our parsing to pay their utility bills. It was critical that we extracted the correct numbers from each bill. Our goal was to pass in a utility bill and return a structured JSON output of the contents of the bill.

After trying both LLM and incumbent OCR solutions, what we realized was that all of the solutions out there were bad at maintaining semantic structure and providing an accurate markdown representation of the text, especially for annotations and complex columnar layouts.


II. The Problem We Solve

Cardinal is built to deliver both JSON and markdown outputs for the most complex documents across different verticals.

To show you the problem, let’s look at the following doctor’s note:

I’ll run the document through both Azure Document Intelligence (a common incumbent OCR tool) and Cardinal for comparison.

Here’s the original document:

Here’s the outputs from Azure:

and Cardinal:

It’s clear the Cardinal output is more accurate because (1) its annotations are more accurate, since Azure often marks text without properly indicating what is actually annotated, and (2) it preserves the structure of the document in a way that closely reflects the original input.

Using these Markdown outputs, let’s use a common LLM to extract my preferred JSON schema from the document. For this example I’m using GPT-5. This is the JSON I will be using:


{

"type": "object",

"properties": {

"observations": {

"type": "array",

"description": "List of observations, scores, or metrics extracted from the form",

"items": {

"type": "object",

"properties": {

"label": { "type": "string" },

"value": { "type": ["string", "number"] },

"unit": { "type": "string" }

},

"required": ["label", "value"]

}

}

}


Here are the results from Azure (top) and Cardinal (bottom): 

{

"observations": [

{

"label": "Date of accident",

"value": "2019-01-06"

},

{

"label": "Time of accident",

"value": "9:00 AM"

},

{

"label": "Police responded",

"value": "Yes"

},

{

"label": "Ambulance responded",

"value": "No"

},

{

"label": "Copy of police report",

"value": "No"

},

{

"label": "Location of accident (City, State)",

"value": "Eagan, MN"

},

{

"label": "Role in accident",

"value": "Driver"

},

{

"label": "Collision type",

"value": "Two vehicle crash"

},

{

"label": "Force of impact",

"value": "Severe"

},

{

"label": "Vehicle type (self)",

"value": "Full-size car"

},

{

"label": "Seat belt worn",

"value": "No"

},

{

"label": "Other vehicle type",

"value": "Semi truck"

},

{

"label": "Airbag deployed",

"value": "Yes"

},

{

"label": "Accident description",

"value": "Front passenger of car; head-on crash by a semi truck backing up"

}

]

}


{

"observations": [

{

"label": "Date of accident",

"value": "01-06-2021"

},

{

"label": "Time",

"value": "9:00"

},

{

"label": "City",

"value": "Eagan, MN"

},

{

"label": "Police responded",

"value": "Yes"

},

{

"label": "Ambulance responded",

"value": "No"

},

{

"label": "Police report available",

"value": "Yes"

},

{

"label": "Role in accident",

"value": "Front Passenger"

},

{

"label": "Collision type",

"value": "Two vehicle, Head-on crash"

},

{

"label": "Force of impact",

"value": "Severe"

},

{

"label": "Damage to vehicle",

"value": "Front"

},

{

"label": "Vehicle type",

"value": "Full-size car"

},

{

"label": "Seat belt worn",

"value": "Yes"

},

{

"label": "Other vehicle type",

"value": "Semi truck"

},

{

"label": "Airbag deployed",

"value": "No"

},

{

"label": "Accident description",

"value": "Was Front Passenger of car, had a head-on crash with a semi truck backing up."

}

]

}


It’s clear that even the best LLMs can struggle to output correct information when fed data that is not structured correctly in a way that semantically matches the document. In this case, when Azure markdown is sent in, many things are off, including but not limited to “Copy of police report” should be marked as “Yes”, “Role in accident” should be marked as “Driver”.

Note that Cardinal provides both markdown and JSON schema outputs, so there’s no need to run GPT-5 on top. Our JSON schema extraction is built directly on perfectly structured markdown, ensuring maximum accuracy.

Out of the box LLMs also are not at a place to OCR these tricky documents accurately.

To show you how it works, let’s look at a table:

I’ll run the document through both Gemini 2.5 Pro (a common incumbent OCR tool) and Cardinal for comparison. Here’s the outputs from Gemini (top) and Cardinal (bottom): 

Output from Gemini

Output from Cardinal


As you can see, when Gemini is asked to produce Markdown, it is unable to accurately reproduce the structure and the accuracy of the document.


II. Cardinal’s architecture

The accuracy at Cardinal is a result of the diverse amount of data that we’ve collected over the past few years from our last company. We used this data to finetune a custom VLM to process data at a high accuracy rate, while concurrently preserving the semantic structure of the document. Here’s how it works:

Stage 1: Foundation Layer - Annotation & Table Detection

Our core OCR is focused on hitting edge cases that most OCR models miss, namely complex tables and annotations (handwriting, circles, checkmarks, signatures). Instead of just flattening documents into plain text, we focus on preserving structure, and capturing handwriting, margin notes, and even deeply nested tables with their full hierarchy intact. We capture each element and its bounding box and return a Markdown for stage 2.

Stage 2: Intelligence Layer - Fine-tuned VLM The structured Markdown output feeds into our fine-tuned vision-language model (VLM), trained specifically on annotated documents and complex tabular data to give us a more refined Markdown or JSON output.


III. Why now?

It’s no secret that 80% of enterprise data is still locked in pdfs and other forms of unstructured, underutilized data.

But that doesn’t explain the why now for Cardinal.

We’re at an inflection point where every company is racing to adopt AI solutions. AI is powerful - it can reason, generate, and automate in ways businesses could only dream of a few years ago.

But for enterprises, the real bottleneck isn’t the models, it’s the data. The value of AI comes down to whether it can actually access and understand a company’s unstructured data. This shows up in many ways:

  • Enterprises want reliable knowledge bases.

    Every company is trying to centralize its knowledge into searchable systems, whether for internal documentation, customer support, or compliance. Retrieval-augmented generation (RAG) has become the default way to make those systems useful, but it only works if the underlying documents are accurate. When OCR loses semantics or breaks structure, knowledge bases retrieve the wrong context and the entire system becomes unreliable.

  • Enterprises need structured data to automate.

    Beyond knowledge bases, enterprises are also looking to turn documents into structured outputs they can plug directly into workflows, namely JSONs, tables, and key-value pairs for ERPs, insurance claims, or analytics pipelines. Without accurate OCR and semantic parsing, these automations break down and risk corrupting the entire workflow.

    The next wave of enterprise AI won’t be defined by bigger models - it will be defined by better data. We’re moving from prompt engineering to context engineering, where the quality of inputs determines the reliability of outputs - that’s where Cardinal comes in.

Try us out by uploading a pdf here: https://dashboard.trycardinal.ai/

Excited to hear your feedback! If you made it this far, thanks for reading!

Footer-details

© 2025 Cardinal. All rights reserved.

logo-xl