Using python to retrieve text from a PDF File

Using python to retrieve text from a PDF File

Introduction

When working with reports and research papers, one frequently performs the operation of extracting text from PDF files.

If you use the available software and web tools to do it manually for every file, it's a time-consuming operation.

In this lesson, we'll look at a few lines of Python code that may be used to extract text from PDF files.

The PyPDF2 Python library is required to continue with this lesson.

Please launch "Command Prompt" (on Windows) and enter the following code to install them if you don't already have it installed:

pip install PyPDF2

Sample PDF file

Here is the PDF file we will use in this tutorial:

https://pyshark.com/wp-content/uploads/2022/10/sample_file.pdf

This PDF file will reside in the same folder as the main.py with our code.

Here is what the structure of my files looks like:

Extract text from PDF using Python

Now we have everything we need and can easily extract text from image using Python:

from PyPDF2 import PdfFileReader

#Define path to PDF file
pdf_file_name = 'sample_file.pdf'

#Open the file in binary mode for reading
with open(pdf_file_name, 'rb') as pdf_file:
    #Read the PDF file
    pdf_reader = PdfFileReader(pdf_file)
    #Get number of pages in the PDF file
    page_nums = pdf_reader.numPages
    #Iterate over each page number
    for page_num in range(page_nums):
        #Read the given PDF file page
        page = pdf_reader.getPage(page_num)
        #Extract text from the given PDF file page
        text = page.extractText()
        #Print text
        print(text)

As a result, you can expect to receive:

Sample Page 1
Sample Page 2
Sample Page 3

Conclusion

This article delved into the process of using Python and PyPDF2 to extract text from PDF files.

If you have any queries or recommendations for modifications, you are welcome to leave them in the comments section below. Additionally, do take a look at my other tutorials on Python Programming.