Baidu Open-Sources Unlimited OCR: One-Shot Document Parsing That Doesn't Slow Down

I was scrolling through Hacker News yesterday and saw a project sitting at #1 with 430+ points — Baidu just open-sourced something called Unlimited OCR. My first reaction was "Baidu open-sourcing something? For real?" I clicked through the paper and code, and honestly, this thing has some real substance to it.

The short version: it solves an annoying problem in OCR — the longer the document, the slower the model. Traditional approaches either split pages or process in batches. Unlimited OCR uses a new attention mechanism that can process dozens of pages in a single forward pass within a standard 32K context window, without speed degradation.

The Old OCR Problem: Long Documents Kill Performance

Anyone who's worked with OCR knows that using large models for document recognition works great, especially after DeepSeek-OCR showed up and end-to-end approaches blew traditional pipelines out of the water. But there's a killer problem: the longer the output sequence, the bigger the KV cache, and the slower inference gets.

Imagine telling an intern to copy a 50-page report by hand. First 5 pages go fine. By page 20, they're noticeably slower. By page 40, they're barely moving. Because every line they write, they have to look back at everything they've already written to avoid repeating themselves. More to look at means more mental load.

Large models doing OCR work the same way. The decoder needs to attend to every previously generated token when producing each new line. A 10-page document might have tens of thousands of tokens, and the attention computation is O(n²) — it explodes as the document grows.

Humans don't work like this. When you're copying something, you don't need to keep everything you've copied in your head. You just need to remember "what's on the current page." That's exactly what Unlimited OCR simulates.

R-SWA: Teaching Models to Selectively Forget

The core innovation is Reference Sliding Window Attention (R-SWA).

Traditional Transformer decoders use global attention — every token can attend to all previous tokens. R-SWA changes this:

Fixed KV cache size: No matter how long the document, the KV cache stays constant. It doesn't balloon with output length.
Sliding window: The decoder only attends to the most recently generated tokens (within the window), not the entire history.
Reference mechanism: Key context (like the current page's content) gets injected through the encoder's compressed representation, without taking up KV cache space.

The clever part is how it leverages DeepSeek-OCR's high compression rate encoder. The encoder already compresses image content into compact tokens, so the decoder doesn't need to keep revisiting raw image info — it just works from the encoder's compressed output plus the sliding window's recent generation.

The paper shows that within a standard 32K max length, Unlimited OCR can process dozens of pages in one shot. And because the KV cache is constant, processing page 1 and page 30 takes roughly the same time. That's wild.

Deep Dive: How R-SWA Keeps KV Cache Constant

This section gets a bit technical. Skip it if you just want to know how to use the model.

Traditional Transformer decoder attention works like this: for each new token, compute attention weights against all previous tokens. With N tokens, each step costs O(N). Generating M tokens total costs O(M²). That's why longer = slower.

R-SWA splits attention into two parts:

Sliding window part: The decoder only attends to the W most recent tokens (say W=256). The KV cache size is fixed at W — new tokens come in, old tokens get dropped. Computation goes from O(N) to O(W), which is constant.

Reference part: The encoder's compressed representation gets injected as "reference" into every attention layer. These reference tokens don't occupy KV cache — they directly serve as attention keys/values. It's like telling the decoder "here's the current page content, don't forget it."

The combined effect: the decoder remembers current content (via reference tokens) and maintains recent generation context (via sliding window), all with constant resource consumption.

Think of it this way: when copying something, your left hand holds the page you're copying from (reference), your right hand remembers the last few lines you wrote (sliding window). Pages you've already finished? Turned over and forgotten.

The code shows a ngram_window parameter — 128 for single pages, 1024 for multi-page. This controls the sliding window size. Multi-page needs a larger window because cross-page context dependencies are stronger.

There's also a no_repeat_ngram_size=35 anti-repetition mechanism. This is interesting — it confirms that the model does tend to produce repetitive output in long sequences (a common LLM issue), so ngram constraints force-break repetition patterns. A value of 35 means if 35 consecutive tokens form a previously-seen ngram, continuing that pattern gets blocked.

Relationship with DeepSeek-OCR

Unlimited OCR isn't built from scratch — it explicitly states it uses DeepSeek-OCR as its baseline. The GitHub description says it plainly: "aiming to push DeepSeek-OCR one step further."

The changes are concentrated in the decoder — all attention layers get replaced with R-SWA. The encoder largely inherits DeepSeek-OCR's design, keeping the high compression rate.

So think of Unlimited OCR as "DeepSeek-OCR specialized for long documents." For short documents, they're similar. But once documents exceed 10 pages, the gap becomes obvious.

The paper also mentions R-SWA isn't limited to OCR — it's a general-purpose parsing attention mechanism that could theoretically apply to ASR (speech recognition), translation, and other sequence-to-sequence tasks. But currently only the OCR model is open-sourced.

How to Use It

The project provides two inference methods: Transformers and SGLang.

Transformers (Single GPU)

python

import torch
from transformers import AutoModel, AutoTokenizer
 
model_name = 'baidu/Unlimited-OCR'
 
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
)
model = model.eval().cuda()
 
# Single image — two modes: gundam (cropped) or base (no crop)
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='your/output/dir',
    base_size=1024, image_size=640, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

Two modes to know about:

gundam mode: base_size=1024, image_size=640, crop_mode=True — crops large images into patches, processes each, then merges. Good for single high-res pages.
base mode: base_size=1024, image_size=1024, crop_mode=False — processes without cropping. Good for multi-page documents.

Multi-Page PDF Processing

This is where it gets interesting. Convert PDF to images then batch process:

python

import tempfile, fitz  # PyMuPDF
 
def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths
 
model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

SGLang Deployment (Production)

For production deployments, use SGLang to serve an OpenAI-compatible API:

shell

uv venv --python 3.12
source .venv/bin/activate
 
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2

Start the server:

shell

python -m sglang.launch_server \
    - -model baidu/Unlimited-OCR \
    - -served-model-name Unlimited-OCR \
    - -attention-backend fa3 \
    - -page-size 1 \
    - -mem-fraction-static 0.8 \
    - -context-length 32768 \
    - -enable-custom-logit-processor \
    - -disable-overlap-schedule \
    - -skip-server-warmup \
    - -host 0.0.0.0 \
    - -port 10000

Then you can use standard OpenAI API format with streaming support.

Gotchas from Actually Using It

Bottom line: the results are solid, but there are real pitfalls.

New dependency versions: Requires torch==2.10.0 and transformers==4.57.1 — both pretty recent. My local env had torch 2.5, which caused version conflicts. Use uv or create a dedicated virtual environment.

VRAM requirements: bfloat16 loading needs roughly 16-20GB VRAM. A single RTX 4090 (24GB) should work, but long documents may push peak usage close to the limit. If you're short on VRAM, look into SGLang's quantization options.

SGLang version is special: You can't just pip install sglang. The project provides a local wheel file sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl built from a specific commit that likely includes R-SWA support. The official PyPI SGLang version probably won't work.

no_repeat_ngram_size parameter: The code uses no_repeat_ngram_size=35 to prevent repetitive generation. That's a large value, confirming the model does struggle with repetition in long-document scenarios. If you see repeated paragraphs in output, try bumping this up.

PDF-to-image conversion is lossy: Converting PDF to 300 DPI images before OCR loses some vector information (like precise text edges). Fine for text-heavy documents, but for PDFs with detailed charts or formulas, consider bumping DPI to 600.

Comparison with Other OCR Solutions

Here's my honest assessment:

vs PaddleOCR: PaddleOCR is Baidu's traditional OCR solution — pipeline mode (detection + recognition + post-processing). Fast, easy to deploy, but struggles with complex layouts (nested tables, multi-column, mixed formulas). Unlimited OCR is end-to-end, which naturally handles complex layouts better.

vs DeepSeek-OCR: As mentioned, Unlimited OCR is an improved version. Short documents are similar; long documents show clear advantages. If your use case regularly handles 10+ page documents, go with Unlimited OCR. For just a few pages, DeepSeek-OCR works fine.

vs GPT-4o / Claude vision: These closed-source models have strong OCR capabilities too, but they're expensive, have privacy concerns, and aren't great for batch processing. Unlimited OCR is open-source, free, and keeps data local.

vs Traditional tools (Tesseract etc.): Different era entirely. Tesseract works okay for clean English documents with simple layouts, but Chinese, complex layouts, and handwriting are basically hopeless. We're in the large model era now — stop using Tesseract.

Best Use Cases

Where this model shines:

Bulk document digitization: Company has stacks of scanned documents to convert to structured text — dozens or hundreds of pages
Academic paper processing: PDF papers to Markdown, preserving formula and table structure
Invoice/contract processing: Batch document parsing for finance
Archive digitization: Libraries, archives digitizing historical documents

Where it's not ideal:

Real-time OCR (like camera recognition): Large model inference is still too slow; use lightweight solutions instead
Pure number/barcode recognition: Overkill
VRAM-constrained environments: Needs at least a 16GB+ GPU

Performance Benchmarks

The paper includes quite a few benchmark numbers. Key takeaways:

On standard document OCR test sets, Unlimited OCR and DeepSeek-OCR perform nearly identically for short documents (within 1% of each other). But for long documents (20+ pages), Unlimited OCR achieves 2-3x the throughput because it doesn't need to repeatedly process growing KV cache.

Memory-wise, processing 30 pages with DeepSeek-OCR might push KV cache to 40GB+ (won't fit on a single card), while Unlimited OCR stays around 8-10GB. This gap matters a lot in practice — it means you can process longer documents on cheaper hardware.

That said, these are self-reported numbers. Real-world results depend on your actual documents. I'm planning to test with Chinese papers and invoices.

Combining with AI Agents

If you put this model into an AI Agent workflow, the possibilities expand significantly.

For example, a document processing Agent: throw in a bunch of PDFs, the Agent automatically OCRs → structures → extracts key info → generates summaries. With DeepSeek-OCR, long documents required batch processing and stitching logic. Unlimited OCR handles it in one pass, simplifying the Agent code considerably.

Or RAG document preprocessing: feed PDFs to Unlimited OCR for Markdown conversion, then chunk, embed, and store in vector databases. Previously with PaddleOCR, tables and formulas often got misrecognized, degrading downstream chunking and retrieval. Large model OCR preserves more structural information.

My own use case: maintaining this site often requires extracting information from PDF technical documents. I used to use pymupdf for text extraction, but scanned documents were a dead end. An Unlimited OCR pipeline could handle those automatically.

Community Reaction

430 points on Hacker News, with active discussion. Main viewpoints:

Supporters: "Finally someone serious about open-source OCR. PaddleOCR has been around for years and needs an upgrade." Skeptics: "Baidu's open-source motives are suspect — probably pushing the PaddlePaddle ecosystem." Pragmatists: "Who cares about motives as long as it works. Let me try it first."

3000+ GitHub stars within a day shows the community's appetite for high-quality open-source OCR. DeepSeek-OCR got similar hype when it open-sourced.

Interestingly, commenters also mentioned GOT-OCR (another open-source OCR solution), but its star count and community activity are noticeably lower. Probably a combination of Baidu's brand effect and HN amplification.

Wrapping Up

OCR doesn't sound glamorous, but it solves a very practical problem: getting machines to read human-world text. Whether it's scanned documents, photos, PDFs, or handwritten notes, the first step is always "reading" them.

Unlimited OCR's R-SWA mechanism actually solves a more general problem: how to process arbitrarily long sequences with fixed resources. If this approach extends to other tasks (translation, speech, video understanding), the implications are significant.

The paper acknowledges this, calling R-SWA a "general-purpose parsing attention mechanism" beyond just OCR. But currently only the OCR model is open-sourced; other directions need future validation.

3000+ stars, 430 points on HN — the community clearly wants this. Models keep getting more capable, but how to get maximum efficiency from limited resources? R-SWA offers an interesting answer.

I'm planning to test this on some real PDFs next — Chinese papers and table-heavy documents specifically. Stay tuned for a follow-up.

Specific test plan:

Test 1: A 30-page Chinese paper (arXiv-style, mixed formulas and figures)
Test 2: A 50-page commercial contract (scanned, with seal stamps)
Test 3: A set of invoice images (different formats, different languages)
Comparison: DeepSeek-OCR, GPT-4o vision, PaddleOCR

Each test will log accuracy, processing time, and VRAM usage. If you're interested, follow along — results coming as soon as I finish.

Oh, and the project includes a batch inference script infer.py that auto-starts an SGLang server and processes concurrently. If you're doing bulk document processing, use that script directly — much easier than writing your own loop.

Questions? Drop them in the comments.

Project: github.com/baidu/Unlimited-OCR*
Paper: arXiv:2606.23050*
Written June 24, 2026*

1	`import torch`
2	`from transformers import AutoModel, AutoTokenizer`
3
4	`model_name = 'baidu/Unlimited-OCR'`
5
6	`tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)`
7	`model = AutoModel.from_pretrained(`
8	`model_name,`
9	`trust_remote_code=True,`
10	`use_safetensors=True,`
11	`torch_dtype=torch.bfloat16,`
12	`)`
13	`model = model.eval().cuda()`
14
15	`# Single image — two modes: gundam (cropped) or base (no crop)`
16	`model.infer(`
17	`tokenizer,`
18	`prompt='<image>document parsing.',`
19	`image_file='your_image.jpg',`
20	`output_path='your/output/dir',`
21	`base_size=1024, image_size=640, crop_mode=True,`
22	`max_length=32768,`
23	`no_repeat_ngram_size=35, ngram_window=128,`
24	`save_results=True,`
25	`)`

1	`import tempfile, fitz # PyMuPDF`
2
3	`def pdf_to_images(pdf_path, dpi=300):`
4	`doc = fitz.open(pdf_path)`
5	`tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')`
6	`mat = fitz.Matrix(dpi / 72, dpi / 72)`
7	`paths = []`
8	`for i, page in enumerate(doc):`
9	`out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')`
10	`page.get_pixmap(matrix=mat).save(out)`
11	`paths.append(out)`
12	`doc.close()`
13	`return paths`
14
15	`model.infer_multi(`
16	`tokenizer,`
17	`prompt='<image>Multi page parsing.',`
18	`image_files=pdf_to_images('your_doc.pdf', dpi=300),`
19	`output_path='your/output/dir',`
20	`image_size=1024,`
21	`max_length=32768,`
22	`no_repeat_ngram_size=35, ngram_window=1024,`
23	`save_results=True,`
24	`)`

1	`uv venv --python 3.12`
2	`source .venv/bin/activate`
3
4	`uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl`
5	`uv pip install kernels==0.11.7`
6	`uv pip install pymupdf==1.27.2.2`

1	`python -m sglang.launch_server \`
2	`- -model baidu/Unlimited-OCR \`
3	`- -served-model-name Unlimited-OCR \`
4	`- -attention-backend fa3 \`
5	`- -page-size 1 \`
6	`- -mem-fraction-static 0.8 \`
7	`- -context-length 32768 \`
8	`- -enable-custom-logit-processor \`
9	`- -disable-overlap-schedule \`
10	`- -skip-server-warmup \`
11	`- -host 0.0.0.0 \`
12	`- -port 10000`