PDF Text Extractor and Search Tool - Extract Text from PDF with Page Range and In-Browser Search
First Published:
Last Updated:
This tool extracts the embedded text layer from PDF files entirely in your browser, with page-range filtering, incremental search across all pages, and one-click TXT download. Unlike online converters that upload your documents to remote servers, this tool runs 100% locally — PDFs never leave your device, making it safe for confidential reports, contracts, papers, and personal records. It is ideal for quickly grepping a long PDF, copying a specific section to plain text, or feeding extracted content into other tools.
All processing is performed entirely in your browser using client-side JavaScript. No files are uploaded to any server. Your PDF documents never leave your device.
- This tool is provided "AS IS" without any warranties of any kind.
- The author accepts no responsibility for data loss, corrupted output, or any issues arising from the use of this tool.
- Extraction works only for PDFs with embedded text. Scanned/image-only PDFs require OCR which is not supported.
- Encrypted or password-protected PDFs cannot be opened.
- Very large PDFs may require significant browser memory. Close other tabs if you experience slowness.
- Always keep backups of your original PDF files.
- By using this tool, you accept full responsibility for any outcomes.
This tool uses client-side JavaScript for all processing. No data is transmitted to servers, no files are uploaded online, all processing happens locally in your browser. Once loaded, this tool continues to work even without an internet connection. For more details, please refer to our Web Tools Disclaimer.
Drop a PDF file here or click to browse
Embedded text only — OCR for scanned/image-only PDFs is not supported.
Extraction Settings
Supports: page numbers (1, 3, 5), ranges (1-5, 3-end),
keywords (odd, even, first, last, all).
Combine with commas: 1-3, 5, odd. Leave empty to extract all pages.
Features
- Embedded Text Extraction: Extracts the existing text layer of a PDF using
pdf.js'sgetTextContent()API for accurate, lossless text retrieval. - Smart Page Range Filtering: Use intelligent range expressions (
1-3, 5, odd, even, last, 3-end) to extract just the pages you need, saving time and memory on large PDFs. - Incremental In-Page Search: Type to highlight all matching occurrences across every extracted page in real time, with a "n of N" counter and previous/next navigation.
- Case-Sensitive & Whole-Word Options: Refine your search with case-sensitivity and whole-word toggles for precise lookups, with friendly fallback to literal matching when needed.
- Active-Hit Auto-Scroll: The current match is highlighted distinctly in orange and automatically scrolled into view when navigating with the arrow buttons or Enter / Shift+Enter.
- Per-Page & Combined TXT Download: Download the text of any individual page or all extracted pages as a single TXT file with clear page-number headers (
--- Page N ---). - Useful Statistics: See total characters, Latin words, CJK character count, non-blank lines, and number of extracted pages at a glance.
- "No Text Layer" Detection: Pages without extractable text are clearly flagged so you know which parts of the PDF would require OCR (not supported by this tool).
- Per-Page Progress Bar: Watch the extraction progress in real time as each page is processed, with descriptive status messages.
- Drag-and-Drop or Click to Browse: Open a PDF with one drag, with full keyboard accessibility (Enter / Space) for the drop zone.
- 100% Client-Side: All processing happens in your browser. No PDF data is ever transmitted to any server. Safe for confidential documents.
- Works Offline: Once the page is loaded, the tool functions without an internet connection.
- Keyboard Shortcuts: Enter in the page-range field to start extraction; Enter in the search field to jump to the next hit; Shift+Enter for the previous hit; Escape to clear the search.
How to Use
- Load a PDF file by dragging it into the drop zone or clicking to browse your files. Encrypted PDFs cannot be opened.
- Optionally specify a page range (e.g.
1-5,odd,3-end,1-3, 7, last). Leave the field empty to extract every page. - Click "Extract Text" (or press Enter in the page-range field). Watch per-page progress while the text is recovered.
- Review the statistics — characters, Latin words, CJK characters, non-blank lines, and pages.
- Search inside the extracted text by typing in the search box. Hits are highlighted in yellow; the active hit is highlighted in orange and scrolled into view. Use the ◀ / ▶ buttons or Enter / Shift+Enter to navigate.
- Download the text: click "Download TXT" on any page header to save just that page, or "Download Combined TXT" to save all extracted pages with page-number headers in a single file.
- Click "Clear All" to reset and start with a new PDF.
Important Notes
- Extraction works only for PDFs with embedded text. Scanned/image-only PDFs require OCR which is not supported.
- Pages without an extractable text layer are flagged with a yellow "No text layer detected" notice. The combined download replaces such pages with the placeholder line
(no text layer on this page — OCR not supported). - Reading order in the output approximates the visual order of the original PDF. Highly multi-column or design-heavy layouts may not preserve the exact human reading order; tables and complex structures are flattened to plain text.
- Encrypted or password-protected PDFs cannot be opened by this tool.
- Very large PDFs (hundreds of pages or many embedded fonts) may use significant browser memory and CPU. If the page becomes slow, try a narrower page range.
- The "Whole word" option treats sequences of letters/digits/underscore as words. For CJK languages without word boundaries, the literal match (Whole word disabled) is usually what you want.
- All processing happens entirely in your browser. No PDF data is ever sent to any server. This makes it safe for confidential documents such as contracts, financial statements, medical records, and legal documents.
About PDF Text Extraction
A PDF file may contain text in two fundamentally different ways. Text-layer PDFs store the document's text as actual character data with positioning information — this is what this tool reads using pdf.js's getTextContent() API, which returns each text item along with its position and a hasEOL flag that hints at line breaks. Image-only PDFs (typically produced by scanners or "Save as PDF" of an image) contain only rasterized page images with no underlying character data. To recover text from image-only PDFs, optical character recognition (OCR) is needed, which is intentionally out of scope for this tool to keep it lightweight and fully client-side.
If you are unsure which type your PDF is, simply load it. Pages with no extractable text will be clearly marked, and you can pick a different OCR-capable tool when needed.
Third-Party Libraries:
- PDF.js (v4.10.38) - License: Apache 2.0 (Mozilla Foundation) - https://github.com/mozilla/pdf.js
References:
Tech Blog with curated related content
Web Tools Collection