Camelot: PDF Table Extraction for Humans

Release v0.3.2. (Installation)

https://travis-ci.org/socialcopsdev/camelot.svg?branch=master Documentation Status https://codecov.io/github/socialcopsdev/camelot/badge.svg?branch=master&service=github https://img.shields.io/pypi/v/camelot-py.svg https://img.shields.io/pypi/l/camelot-py.svg https://img.shields.io/pypi/pyversions/camelot-py.svg https://badges.gitter.im/camelot-dev/Lobby.png

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!

Note

You can also check out Excalibur, which is a web interface for Camelot!


Here’s how you can extract tables from PDF files. Check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html
>>> tables[0]
<Table shape=(7, 7)>
>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
>>> tables[0].df # get a pandas DataFrame!
Cycle Name KI (1/km) Distance (mi) Percent Fuel Savings      
      Improved Speed Decreased Accel Eliminate Stops Decreased Idle
2012_2 3.30 1.3 5.9% 9.5% 29.2% 17.4%
2145_1 0.68 11.2 2.4% 0.1% 9.5% 2.7%
4234_1 0.59 58.7 8.5% 1.3% 8.5% 3.3%
2032_2 0.17 57.8 21.7% 0.3% 2.7% 1.2%
4171_1 0.07 173.9 58.1% 1.6% 2.1% 0.5%

There’s a command-line interface too!

Note

Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, “If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based”.)

Why Camelot?

  • You are in control. Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel and HTML.

See comparison with other PDF table extraction libraries and tools.

The API Documentation/Guide

If you are looking for information on a specific function, class, or method, this part of the documentation is for you.

The Contributor Guide

If you want to contribute to the project, this part of the documentation is for you.