tabula-py: Extract table from PDF into Python DataFrame
As of Oct. 2019, I launched a documentation site and Google Colab notebook for tabula-py. The FAQ would be good place to execute accurate extraction.
It is simple wrapper of tabula-java and it enables you to extract table into DataFrame or JSON with Python. You also can extract tables from PDF into CSV, TSV or JSON file.
tabula is a tool to extract tables from PDFs. It is GUI based software, but tabula-java is a tool based on CUI. Though there were Ruby, R, and Node.js bindings of tabula-java, before tabula-py there isn’t any Python binding of it. I believe PyData is a great ecosystem for data analysis and that’s why I created tabula-py. If you are familiar with R, I highly recommend to use tabulizer, which has the most richest bindings including rich GUI.
You can install tabula-py via pip:
pip install tabula-py
With tabula-py, you can get DataFrame with read_pdf()
method.
example of read_pdf()
You can also extract tables as JSON format:
example of JSON
You can extract tables into a file like JSON, CSV or TSV with convert_into()
method.
You can see more examples in Jupyter notebook.
Waiting for your collaboration!
If you have any trouble with tabula-py, please file an issue on GitHub. I don’t want to receive emails because the answer will not share with other people. Make sure to fill the issue template, it will reduce many costs for me to solve the problem. Or, I also check StackOverflow. You can ask about it.
Other tabula-py articles
- https://blog.chezo.uno/tabula-py-now-able-to-extract-remote-pdf-and-multiple-tables-at-once-6108e24ac07c
- https://blog.chezo.uno/a-recent-update-of-tabula-py-a923d2ab667b
Written with StackEdit.