How to convert PDF to text on Linux (GUI and command line)

This article introduces two tools for converting PDF documents to editable text on Linux using graphical tools (Calibre) and command line tools (pdftotext).
It is worth noting that if the PDF consists of images (eg scanned pages / pictures), neither of the two tools mentioned in this article for extracting text from PDF files can extract text.caliber Is a free open source e-book software suite. It supports organizing, displaying, editing and converting e-books and supports multiple formats. The application runs on Linux, macOS, and Microsoft Windows.
Calibre should be available in the repository of your Linux distribution, and you should be able to install it using any software store you have on your system. For example, to install it on Debian, Ubuntu, Linux Mint, Fedora, openSUSE or Arch Linux, use:

  • Debian, Ubuntu or Linux Mint:
sudo apt install calibre
  • Fedora:
sudo dnf install calibre
  • openSUSE:
sudo zypper install calibre
  • Arch Linux:
sudo pacman -S calibre

Calibre can also be installed on Linux by using the following command Flathub bag (Claim Setting up Flathub / Flatpak On some Linux distributions).
The use of this application illustrates another way to install Caliber on Linux. Download pageYou can also find macOS and Windows binaries. Related: How to Convert PDF to Image (PNG, JPEG) Using GIMP or pdftoppm Command Line Tool
Calibre is now installed on your system, launch it and click Add books Add a PDF to convert to text (or multiple PDF-Caliber supports batch conversion of multiple PDF files to text). From the book list, select the PDF you want to convert to text (or batch convert multiple PDFs to .txt) and click Convert books Button. In the upper right corner of the conversion window, select TXT as Output format:Caliber convert PDF to text

You can adjust many options in this conversion dialog. For example, you can choose to automatically remove the space between paragraphs or insert a blank line between paragraphs (Look & Feel -> Layout). You can also set the character encoding and end-of-line style (system, unix, windows, old_mac), and even format it as markdown.
After completing the configuration, click OK The button starts converting the PDF to text. The converted .txt file can be found in the directory where you set the Caliber library location (then in AuthorName/BookName A subfolder; if the author or book name cannot be determined, the subfolder is called “unknown”.
What Caliber lacks in this case is a method that only converts pages or page ranges-currently only whole PDF files can be converted to text. Related to PDF: How to Create Fillable PDF Forms with LibreOffice Writer

Convert PDF to text using pdftotext (command line)

pdftotext is a command line utility that converts PDF files to plain text. It has many options including specifying the range of pages to be converted, keeping the original physical layout of the text as much as possible, setting the end of the line (unix, dos or mac), and even the ability to use password protected PDF files.
pdftotextis is Popler / poppler-utils / poppler-tools package (depending on the Linux distribution you are using). Install this package as follows:

  • Debian, Ubuntu, Linux Mint, and other Debian / Ubuntu-based Linux distributions:
sudo apt install poppler-utils
  • Fedora:
sudo dnf install poppler-utils
  • openSUSE:
sudo zypper install poppler-tools
  • Arch Linux:
sudo pacman -S poppler

In other Linux distributions, use the package manager to install the poppler / poppler-utils package.
Now that the package is installed, you can convert the PDF file to plain text and retain its layout (I recommend using this file -layout The option to keep the original physical layout, but you can also try it without it):

pdftotext -layout input.pdf output.txt

You need to replace input.pdf With the name of the PDF file, and output.txt Use the name of the TXT file you want to generate. You can also add a path before the file name if needed (e.g. ~/Documents/mypdf.pdf). If no output text file is specified, pdftotext will name the file with the same file name as the original PDF file. The layout option preserves the PDF layout when converting it to text, even for multi-column PDF cases. Convert only the page range of a PDF to text, not the entire PDF file? Adopt -f (The first page to be converted) and -l (The last page to be converted), and then the page number as follows:

pdftotext -layout -f M -l N input.pdf

Replace M with N And the page numbers of the first and last pages to be extracted, and input.pdf Use PDF file name. Do you want to use mac, dos, or unix end-of-line characters? You can also use -eol Followed by mac, dos Either unix. E.g. For Unix line endings:

pdftotext -layout -eol unix input.pdf

If you don’t want to insert page breaks between pages, append -nopgbrk:

pdftotext -layout nopgbrk input.pdf

Do you want to batch convert all PDF files from folders to text files? pdftotext does not support batch conversion of PDFs to text (and pdftotext *.pdf Does not work), but you can use a Bash FOR loop to convert all PDF files in the folder to text files:

for file in *.pdf; do pdftotext -layout "$file"; done

For more options, run man pdftotext with pdftotext --helpYou might like: Download Master PDF Editor 4 (Free Edition) for Linux

Source

Sidebar