I live in Germany, and I am still learning German. If you not know, Germany is famous for handling all processes via letters. Not electronic mails, but physical letters. Not knowing the language well enough has always been a challenge, especially when dealing with letters.
I can use translation services, or AI apps to understand what these letters are about. However, I have always been uneasy about uploading sensitive documents to third-party services. Contracts, invoices, tax forms β these contain information that I would rather not send to someone elseβs servers. But whenever I needed to extract knowledge from a stack of PDFs, the options were limited: read them manually (kein Deutsch), ship them to an external API for processing (hello, data privacy concerns), or build a custom RAG pipeline with a dozen moving parts.
I wanted to see if I could build a complete document Q&A system where the data never leaves Cloudflareβs network. No third-party LLM APIs. No external vector databases. No data flying off to services I donβt control. The result is DocFlare β a chat-based app where you upload PDFs, ask questions in natural language, and get answers grounded in the documents. Everything runs on Cloudflareβs edge infrastructure: Workers, Durable Objects, R2, AI Search, Sandbox containers, and Workers AI.
In this article, Iβll walk you through how DocFlare works, the architectural decisions I made, and the problems I ran into along the way.
But before that, hereβs a quick demo
The Problem with PDFs
PDFs are the cockroaches of the digital world β theyβre everywhere, they survive everything, and theyβre nearly impossible to work with programmatically. When I started building DocFlare, the PDF extraction piece was the challenge I was most worried about. More on that later.
Architecture at a Glance
Before diving into the details, hereβs how the system fits together:
The key thing I want to highlight: every box in that diagram is a Cloudflare product. R2 for storage. AI Search for the RAG pipeline. Workers AI for generation. Sandbox containers for OCR. Durable Objects for stateful chat sessions. Thereβs no external dependency in the critical path.
Two-Strategy PDF Extraction
This was the hardest problem to solve well, and the part of DocFlare Iβm most proud of. While it is not perfect, in my testing it reliably extracts meaningful text from a wide variety of PDFs β including scanned documents, handwritten notes, and image-heavy files β without hallucinating content.
Strategy 1: env.AI.toMarkdown()
Cloudflareβs Workers AI binding includes a toMarkdown() method that extracts text from PDFs and converts it to structured markdown. Itβs fast, itβs included in Workers AI at no extra cost, and it works beautifully for text-layer PDFs β the kind generated by Word, LaTeX, or any modern document tool.
const results = await ai.toMarkdown([ { name: fileName, blob: new Blob([pdfBytes], { type: "application/pdf" }), },]);
const result = results[0];if (!result || result.format === "error") { return null;}
// Strip metadata headers toMarkdown always includes, then check// that there's at least 50 characters of actual contentconst contentsMatch = result.data.match(/## Contents\s*\n([\s\S]*)/);const contentsSection = contentsMatch?.[1] ?? "";const stripped = contentsSection.replace(/###\s+Page\s+\d+/g, "").trim();
if (stripped.length >= 50) { return { fileName, markdown: result.data, hasContent: true, method: "toMarkdown", };}The critical detail here: I strip out the metadata section that toMarkdown() always includes (page headers, etc.) and check that the remaining content is at least 50 characters. If it isnβt, weβre probably looking at a scanned document where toMarkdown() found little or no text layer β and we need to fall back.
Why Not Use a Vision LLM for OCR?
This was a tempting shortcut. Modern vision LLMs can βreadβ images, right? But thereβs a fundamental problem: vision LLMs hallucinate when used as OCR. Theyβll confidently βreadβ text that isnβt there, rearrange numbers in tables, and invent content. For a document Q&A system where accuracy is the entire point, this was a non-starter for me.
Strategy 2: RapidOCR in a Sandbox Container
For scanned PDFs, DocFlare falls back to classical OCR β specifically, RapidOCR running inside a Cloudflare Sandbox container.
RapidOCR uses the same PaddleOCR models (text detection, direction classification, text recognition) but runs them through ONNX Runtime instead of PaddlePaddle. This drops the runtime overhead from ~500 MiB to ~80 MiB β a big deal when youβre running inside a container with constrained resources.
The OCR container processes PDFs page by page to keep memory usage at ~25 MiB per page:
# Get page count first, then convert one page at a time to keep peak# memory low (~25 MiB per page instead of all pages in memory at once).info = pdfinfo_from_path(str(path))num_pages = info["Pages"]
pages = []for i in range(1, num_pages + 1): images = convert_from_path(str(path), dpi=300, first_page=i, last_page=i) img_array = np.array(images[0]) result = engine(img_array) if result and result.txts: pages.append({"page": i, "text": "\n".join(result.txts)})On the Worker side, the Sandbox container is invoked through Cloudflareβs @cloudflare/sandbox package. The PDF is written to the sandbox filesystem, then the Python script is executed directly:
const sandbox = getSandbox(sandboxNs, "ocr");
// Write the PDF to the sandbox filesystemconst base64 = Buffer.from(pdfBytes).toString("base64");await sandbox.writeFile("/workspace/input.pdf", base64, { encoding: "base64" });
// Run RapidOCR and parse JSON from stdoutconst result = await sandbox.exec("python3 /app/ocr.py /workspace/input.pdf");const ocrResult = JSON.parse(result.stdout);The result is a clean, structured markdown extraction that works reliably on scanned documents, handwritten-ish text, and image-heavy PDFs β with zero hallucination risk. I was genuinely impressed with how well this worked.
AI Search: RAG Without the Plumbing
If youβve built a RAG system before, you know the pain: chunk your documents (but what chunk size? overlap?), generate embeddings (which model? dimensions?), store them in a vector database (which one? how do you index?), retrieve with similarity search (cosine? dot product?), maybe rerank, then generate.
Cloudflare AI Search handles all of it as a managed service. You point it at an R2 bucket, it indexes the contents, and you get a search API. Thatβs it.
Hereβs the part that made me smile: my original plan included a full custom pipeline β bge-m3 embeddings, Durable Object SQLite storage, JavaScript cosine similarity. I scrapped all of that in favor of a single AI Search call:
const searchResponse = await this.env.AI.autorag("docsflare-search").search({ query, rewrite_query: true, max_num_results: 8, ranking_options: { score_threshold: 0.15, },});One call. That replaces chunking, embedding, vector storage, retrieval, and reranking. I love when things get simpler.
Why search() Instead of aiSearch()?
AI Search offers two APIs:
aiSearch()β retrieval + generation in one call. Convenient, but you lose control.search()β retrieval only. You handle generation yourself.
I deliberately use search() because I needed control over:
- The system prompt β DocFlare identifies itself as a retrieval assistant with specific behavioral instructions: ground answers in retrieved context, acknowledge when context is insufficient, and include source filenames in responses.
- Conversation history β Multi-turn chat requires injecting prior messages into the LLM context.
aiSearch()doesnβt support this. - Streaming β Responses stream back over WebSocket in real-time. I needed direct access to the
streamText()call. - Model selection β I use
@cf/nvidia/nemotron-3-120b-a12bspecifically.
The ChatAgent builds context from search results and passes it to Workers AI with the full conversation history:
// Build a context string from retrieved chunks, labelled by source filenameconst contextText = chunks .map((chunk, index) => { const source = chunk.filename ?? `Document ${index + 1}`; const confidence = chunk.score ? ` (score ${chunk.score.toFixed(2)})` : ""; const text = chunk.content .filter((entry) => entry.type === "text") .map((entry) => entry.text?.trim()) .join("\n"); return `[${source}${confidence}]\n${text}`; }) .join("\n\n");
const workersAI = createWorkersAI({ binding: this.env.AI });
const result = streamText({ model: workersAI("@cf/nvidia/nemotron-3-120b-a12b"), system: [ "You are Docflare, a retrieval assistant for indexed PDF documents.", "Answer only with information grounded in the retrieved context.", "If context is insufficient, say so directly.", "Include the source file names in your answer when possible.", "", "Retrieved context:", contextText, ].join("\n"), messages: modelMessages,});Privacy by Architecture
This is the part I care about the most, and itβs not a feature bolted on after the fact β itβs a consequence of how the system is built.
| Step | Where It Happens | Data Leaves Cloudflare? |
|---|---|---|
| PDF upload & storage | R2 | No |
| Text extraction (Strategy 1) | Workers AI (toMarkdown()) | No |
| OCR extraction (Strategy 2) | Sandbox container | No |
| Chunking & indexing | AI Search | No |
| Retrieval | AI Search | No |
| LLM generation | Workers AI (Nemotron 3 120B) | No |
| Chat state | Durable Objects | No |
| WebSocket transport | Workers | No |
Every single step runs on Cloudflare infrastructure. The original PDFs sit in R2. The extracted text sits in R2. The embeddings and index live in AI Search. The LLM runs on Workers AI. The chat sessions live in Durable Objects.
If youβre working with sensitive documents β legal contracts, financial records, medical information β this matters. Youβre not shipping your data to OpenAI, Anthropic, or any other third party. The documents stay in your Cloudflare account. Privacy is a structural guarantee, not a policy promise.
The Tech Stack
| Layer | Technology |
|---|---|
| Frontend | React 19, TanStack Start (SSR), TanStack Router |
| Runtime | Cloudflare Workers |
| Chat Agent | AIChatAgent (Cloudflare Durable Object) |
| Real-time | WebSocket via useAgent + useAgentChat hooks |
| LLM | @cf/nvidia/nemotron-3-120b-a12b via Workers AI |
| PDF Extraction | env.AI.toMarkdown() + RapidOCR (ONNX Runtime) |
| Object Storage | Cloudflare R2 |
| RAG Pipeline | Cloudflare AI Search |
| OCR Container | Cloudflare Sandbox (Python 3.11 + poppler + PaddleOCR ONNX) |
| UI Components | @cloudflare/kumo + Tailwind CSS v4 |
A Note on the UI
I wanted DocFlareβs interface to feel different from the typical βAI chatβ look. The design draws from archival documents and dossiers β parchment-colored backgrounds (#F4F1EA), vermillion red accents (#E3342F), zero border radius everywhere, monospace system labels like [AWAITING_COMMAND] and [GENERATING_RESPONSE], and a subtle noise texture overlay.
Itβs a small detail, but it reinforces what the tool is: a system for interrogating documents. Not another chatbot with rounded corners and a gradient.
Whatβs Next?
DocFlare is currently single-tenant β one user, one document collection. Here are some things I want to build next:
- Multi-tenancy β per-user document namespaces and chat histories
- Document management β delete, re-index, and organize uploaded documents
- Richer citations β link directly to source pages within PDFs
- More file formats β extend beyond PDF to DOCX, plain text, and HTML
Wrapping Up
Building DocFlare was a fun exercise in seeing how far Cloudflareβs edge platform can go. The key pieces that came together:
- Two-strategy extraction solves the βPDFs are hardβ problem reliably β
toMarkdown()for text-layer PDFs, RapidOCR in Sandbox containers for scanned documents. - AI Search eliminates the entire custom RAG pipeline β no chunking code, no embedding generation, no vector database to manage.
- Edge-native architecture means documents never leave Cloudflareβs network β privacy is a structural guarantee, not a policy promise.
The entire project is open source. If youβre building on Cloudflare and working with documents, take a look.
If you have questions or want to share how youβre building with these tools, feel free to reach out on LinkedIn or X (Twitter). Iβd love to hear about your use case.