PDF contains unstructured data and making it meaningful or structured is a challenging task. It contains much useful Information that If you make a predictive or NLP model then it will beneficial to you. Currently, There are many libraries that allow you to manipulate the PDF File using Python. Like extracting text, tables, images and many things from PDF using it. These are also used in doing text analysis. In this entire tutorial of “How to,” you will learn how to extract text from PDF File using Python.
Step By Step Guide to Extract Text
Step 1: Import the necessary libraries
Although there are many libraries available for extracting text from PDF File. Here for the demonstration purpose, I am using PyPDF2.
Step 2: Open the PDF File
Now using the PYPDF2 you will Open the PDF File in RB(reading in bytes) mode.
# open the pdf file pdf_file =open('data/FOMC_report.pdf', 'rb')
Step 3: Read PDF and Check for Encryption
After opening the file Read the PDF File using PyPDF2.PdfFileReader() method and check for encryption using getIsEncrypted() method. It is a must as with encryption you cannot read the PDF File and extract the text. Use the Code Below.
# read pdf read_pdf = PyPDF2.PdfFileReader(pdf_file)
#check pdf is encrypted or not read_pdf.getIsEncrypted()
# no of pages read_pdf.numPages
Step 4: Extract the text
After knowing the number of the pages, you can extract text from it using the getPage() and extractText() method. The getPage() method will first get the page number of the Pdf file and extractText() will extract the text from that page number. In our example lets say I want to extract text from page number 1 then I will use the following code.
# extract text from page number 1 page1 = read_pdf.getPage(0) page1.extractText()
If you see the output then a new line is replaced with n. Now you can easily split the sentence using split(‘n’) method. It will convert the extracted text to the list.
Converting Unstructured Text data from PDF to structured data is beneficial for you if you want to use Natural Language Processing (NLP). After extracting text data from PDF you can do anything like text preprocessing, word anagrams e.t.c.
Hope this post has solved your query on how to extract text from PDF File using Python. Please contact us if you have any query regarding anything. We are always ready to help you.
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.