Python for Pdf
Why Python for PDF processing
PDF processing comes under text analytics. Most of the Text Analytics Library or frameworks are designed in Python only. This gives leverage on text analytics. Once you extract the useful information from PDF you can easily use that data into any Machine Learning or Natural Language Processing Model.
Common Python Libraries
Here is the list of some Python Libraries could be used to handle PDF files
- PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data.
- PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.
- Tabula-py is a simple Python wrapper of tabula-java, which can read the table of PDF. You can read tables from PDF and convert into pandas’ DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.
- Slate is wrapper Implementation of PDFMiner
- PDFQuery is a light wrapper around pdfminer, lxml and pyquery. It’s designed to reliably extract data from sets of PDFs with as little code as possible.
- xpdf Python wrapper for xpdf (currently just the “pdftotext” utility)
Extracting Text from pdf
First, we need to Install the
!pip install PyPDF2
Following is the code to extract simple Text from pdf using PyPDF2
# modules for
import PyPDF2
# pdf file object
# you can find find the pdf file with complete code in below
pdfFileObj = open('example.pdf', 'rb')
# pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# number of pages in pdf
print(pdfReader.numPages)
# a page object
pageObj = pdfReader.getPage(0)
# extracting text from page.
# this will print the text you can also save that into String
print(pageObj.extractText())
You can read more Details from here
Reading the Table data from pdf
In order to work with the Table data in Pdf, we can use Tabula-py
pip install tabula-py
Following is the code to extract simple Text from pdf using PyPDF2
import tabula
# readinf the PDF file that contain Table Data
# you can find find the pdf file with complete code in below
# read_pdf will save the pdf table into Pandas Dataframe
df = tabula.read_pdf("offense.pdf")
# in order to print first 5 lines of Table
df.head()
import PyPDF2
PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored
pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object
pg4 = pfr.getPage(126) #extract pg 127
writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)
NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
writer.write(outputStream) #write pages to new PDF
#the table will be returned in a list of dataframe,for working with dataframe you need pandas
import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here' + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)
Your question is near similar with:
- Extract / Identify Tables from PDF python
- Extracting tables from a pdf
- Extract table from a PDF
- How to scrape tables in thousands of PDF files?
- PDF Data and Table Scraping to Excel
- Extracting table contents from a collection of PDF files
If you Pdf file contain Multiple Table
df = tabula.read_pdf(“offense.pdf”,multiple_tables=True)
you can extract Information from the specific part of any specific page of PDF
tabula.read_pdf("offense.pdf", area=(126,149,212,462), pages=1)
If you want the output into JSON Format
tabula.read_pdf("offense.pdf", output_format="json")
Export Pdf into Excel
you can us Below code to convert the PDF Data into Excel or CSV
tabula.convert_into("offense.pdf", "offense_testing.xlsx", output_format="xlsx")
Further Readings
you can find the complete code and Pdf files in This Github Link
- This question on StackOverflow also has a lot of useful link in its Answer How to extract table as text from the PDF using Python?
- Working with PDF files in Python using PyPDF2
- Working with PDF and Word Documents
- 3 WAYS TO SCRAPE TABLES FROM PDFS WITH PYTHON
- How to Convert a PDF to Excel
ที่มาบทความ towardsdatascience.com.