Extracting Text from PDFs with Python

Full notebook found here.

In today's digital age, handling different file formats and extracting useful information from them is a common task. One such task is extracting text from PDF files. Python, with its vast ecosystem of libraries, makes this task relatively straightforward. In this blog, we'll explore how to convert a PDF file to text using the pdfplumber library in Python.

Setting Up the Environment

Before we dive into the code, we need to ensure that we have the necessary library installed. pdfplumber is a Python library that provides a simple way to extract text from PDFs. If you don't have it installed, you can install it using pip:

pip install pdfplumber

Defining the Text Extraction Function

def extract_text_from_pdf(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = ''
for page in pdf.pages:
text += page.extract_text()
return text

Here, we define a function extract_text_from_pdf that takes the path to a PDF file as an argument. Inside the function, we use pdfplumber.open() to open the PDF file. The with statement ensures that the file is properly closed after we're done with it.

We then iterate over each page in the PDF using a for loop. The extract_text() method is called on each page to extract the text, which is then appended to the text variable.

Finally, the function returns the accumulated text extracted from all the pages.

Specifying the PDF Path and Extracting Text

pdf_path = r"C:\Users\SalomeGrasland\Desktop\Blogging\This is a sample PDF.pdf"

variable containing text called extracted_text

extracted_text = extract_text_from_pdf(pdf_path)

Here, we specify the path to the PDF file from which we want to extract text. We then call the extract_text_from_pdf function with this path and store the extracted text in the extracted_text variable.

Saving the Extracted Text to a File

with open('PDFtoText.txt', 'w', encoding='utf-8') as file:
file.write(extracted_text)

Finally, we open a new text file (PDFtoText.txt) in write mode ('w') and use the write() method to save the extracted text to this file. The encoding='utf-8' argument ensures that the text is encoded correctly.

Conclusion

In this blog, we've seen how to use the pdfplumber library in Python to extract text from a PDF file and save it to a text file. This can be particularly useful for processing and analyzing large volumes of PDF documents in various data analysis or natural language processing tasks.

With just a few lines of code, we can automate the extraction of text from PDFs, making our workflow more efficient and streamlined.

Check out the full notebook here.

Author:
Salome Grasland
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2024 The Information Lab