What you get
A plain UTF-8 .txt file containing the text from the PDF, page-marked. The text is extracted via Google Cloud Vision OCR, so it works whether the PDF is real text or a scanned image of text — both come out the same way on the other side.
Why OCR a PDF?
- Searchable. Pull text out so you can grep it, paste it, feed it into a spreadsheet, or run it through a script.
- Editable. A scanned PDF is just an image — no amount of "edit" in Word will let you change a word. OCR turns it into text first.
- Accessible. Screen readers can't read images; they can read text.
- Re-formattable. Once it's text, you can drop it into anything — DOCX, HTML, Markdown, a CMS.
How it works
- Open the converter. Go to the Formatly converter — no signup required.
- Drop your PDF. Drag and drop one or more PDFs into the upload box (up to five files, 20 MB each). Both real text PDFs and scanned image-PDFs work.
- Pick OCR (Extract Text) from the dropdown. Each page is rendered to an image and sent to Google Cloud Vision — typical processing time is 1 to 3 seconds per page.
- Convert and download the .txt. Click Convert; a download link appears for a UTF-8 text file, page-delimited with
--- Page N ---markers for easy splitting.
What works well
- Standard text PDFs — extracted nearly perfectly.
- Clean, high-contrast scans (300+ DPI).
- Latin-script European languages (English, French, Spanish, German, Italian, Portuguese).
- Multi-column layouts — Vision is good at reading order.
What doesn't
- Faint, skewed, or low-resolution scans — accuracy drops fast.
- Handwritten pages — partial recognition only.
- Heavily designed pages with text on textured backgrounds.
- Password-protected PDFs — remove the password first.
Tips
- If the source is a paper document, scan at 300 DPI or higher with good lighting.
- For long PDFs, expect up to a minute of processing time — each page is rendered then sent to Vision.
- If you only need the text from one page, extract that page first in Preview / Adobe Reader, then upload the smaller file.
FAQ
Does this work on scanned PDFs? Yes — that's exactly what OCR is for. A scanned PDF is just a stack of images, so plain copy-paste or text extraction won't find anything. The converter renders each page to an image, runs it through Google Cloud Vision OCR, and assembles the recognized text into a single .txt file with per-page delimiters.
Does the PDF to text output preserve formatting? No. Output is plain UTF-8 text, page-delimited with --- Page N --- markers. Bold, italics, fonts, columns, and tables are not preserved. If you need a formatted, editable version of the PDF, convert to DOCX instead — that's a different tool because the OCR pipeline is text-only.
What languages does the PDF OCR support? English by default, plus most Latin-script European languages (French, Spanish, German, Italian, Portuguese, Dutch). CJK and right-to-left scripts (Arabic, Hebrew) are available on request via the contact form.
How long does PDF OCR take? Roughly 1 to 3 seconds per page. A 50-page document typically takes about a minute to render and OCR. If you only need a few pages, extract them first in Preview or Adobe Reader and upload the smaller file — that's much faster than processing the full document.
What happens to my PDF after the OCR runs? The uploaded PDF and the resulting text are auto-deleted from our servers after one hour. We don't store or analyze your document beyond completing the OCR pass. See Security for the full data-handling policy.
Related
- Image → Text (OCR) → same OCR, image source
- PDF → DOCX → for layout-preserving edits
- DOCX → PDF → save the cleaned-up text back as PDF
- How OCR works →
- Why scanned PDFs look bad →
- How to extract text from a screenshot →