Quickstart#
In a hurry to extract tables from PDFs? This document gives a good introduction to help you get started with Camelot.
You can also check out our quickstart notebook.
Read the PDF#
Reading a PDF to extract tables with Camelot is very simple.
Begin by importing the Camelot module:
>>> import camelot
Now, let’s try to read a PDF. (You can check out the PDF used in this example here.) Since the PDF has a table with clearly demarcated lines, we will use the Lattice method here.
Note
Lattice is used by default. You can use Stream with flavor='stream', Network with flavor='network', Hybrid with flavor='hybrid', or flavor='auto' to let Camelot probe the first page and choose between lattice and network. When auto is selected a UserWarning names the chosen flavor.
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>
You can also pass raw PDF bytes or any binary stream (io.BytesIO, an open 'rb' file, requests response .raw) wherever a filepath is accepted — useful for PDFs that arrive over HTTP without hitting disk first. See the Reading PDFs from memory section in the advanced guide for the full pattern.
Now, we have a TableList object called tables, which is a list of Table objects. We can get everything we need from this object.
We can access each table using its index. From the code snippet above, we can see that the tables object has only one table, since n=1. Let’s access the table using the index 0 and take a look at its shape.
>>> tables[0]
<Table shape=(7, 7)>
Let’s print the parsing report.
>>> print(tables[0].parsing_report)
{
'accuracy': 99.02,
'whitespace': 12.24,
'confidence': 0.87,
'order': 1,
'page': 1
}
Woah! The accuracy is top-notch and there is less whitespace, which means the table was most likely extracted correctly. The confidence value is a single [0, 1] score computed as (accuracy / 100) * (1 - whitespace / 100) — convenient for “keep tables above X” filtering in production pipelines without having to combine the two raw fields yourself. You can access the table as a pandas DataFrame by using the table object’s df property.
>>> tables[0].df
Cycle Name |
KI (1/km) |
Distance (mi) |
Percent Fuel Savings |
|||
Improved Speed |
Decreased Accel |
Eliminate Stops |
Decreased Idle |
|||
2012_2 |
3.30 |
1.3 |
5.9% |
9.5% |
29.2% |
17.4% |
2145_1 |
0.68 |
11.2 |
2.4% |
0.1% |
9.5% |
2.7% |
4234_1 |
0.59 |
58.7 |
8.5% |
1.3% |
8.5% |
3.3% |
2032_2 |
0.17 |
57.8 |
21.7% |
0.3% |
2.7% |
1.2% |
4171_1 |
0.07 |
173.9 |
58.1% |
1.6% |
2.1% |
0.5% |
Looks good! You can now export the table as a CSV file using its to_csv() method. Alternatively you can use to_json(), to_excel() to_html() to_markdown() or to_sqlite() methods to export the table as JSON, Excel, HTML files or a sqlite database respectively.
>>> tables[0].to_csv('foo.csv')
This will export the table as a CSV file at the path specified. In this case, it is foo.csv in the current directory.
You can also export all tables at once, using the tables object’s export() method.
>>> tables.export('foo.csv', f='csv')
Tip
Here’s how you can do the same with the command-line interface.
$ camelot --format csv --output foo.csv lattice foo.pdf
This will export all tables as CSV files at the path specified. Alternatively, you can use f='json', f='excel', f='html', f='markdown' or f='sqlite'.
Note
The export() method exports files with a page-*-table-* suffix. In the example above, the single table in the list will be exported to foo-page-1-table-1.csv. If the list contains multiple tables, multiple CSV files will be created. To avoid filling up your path with multiple files, you can use compress=True, which will create a single ZIP file at your path with all the CSV files.
Note
Camelot handles rotated PDF pages automatically. As an exercise, try to extract the table out of this PDF.
Specify page numbers#
By default, Camelot only uses the first page of the PDF to extract tables. To specify multiple pages, you can use the pages keyword argument:
>>> camelot.read_pdf('your.pdf', pages='1,2,3')
Tip
Here’s how you can do the same with the command-line interface.
$ camelot --pages 1,2,3 lattice your.pdf
The pages keyword argument accepts pages as comma-separated string of page numbers. You can also specify page ranges — for example, pages=1,4-10,20-30 or pages=1,4-10,20-end.
When parallel=True, Camelot processes pages concurrently using one worker per CPU. Bound the worker count with cpu_count=N (defaults to all cores; clamped to [1, multiprocessing.cpu_count()]):
>>> camelot.read_pdf('long.pdf', pages='all', parallel=True, cpu_count=4)
If different pages need different settings (per-page table_areas, a different flavor on one page, etc.), use the per_page keyword argument. See the Per-page parameter overrides section in the advanced guide.
Reading encrypted PDFs#
To extract tables from encrypted PDF files you must provide a password when calling read_pdf().
>>> tables = camelot.read_pdf('foo.pdf', password='userpass')
>>> tables
<TableList n=1>
Tip
Here’s how you can do the same with the command-line interface.
$ camelot --password userpass lattice foo.pdf
Camelot supports PDFs with all encryption types supported by playa. This might require installing PyCryptodome. An exception is thrown if the PDF cannot be read. This may be due to no password being provided, an incorrect password, or an unsupported encryption algorithm.
Further encryption support may be added in future, however in the meantime if your PDF files are using unsupported encryption algorithms you are advised to remove encryption before calling read_pdf(). This can been successfully achieved with third-party tools such as QPDF.
$ qpdf --password=<PASSWORD> --decrypt input.pdf output.pdf
Ready for more? Check out the advanced section.