API Reference¶

Main Interface¶

camelot.read_pdf(filepath, pages='1', password=None, flavor='lattice', suppress_stdout=False, layout_kwargs={}, **kwargs)[source]¶

Read PDF and return extracted tables.

Note: kwargs annotated with ^ can only be used with flavor=’stream’ and kwargs annotated with * can only be used with flavor=’lattice’.

Parameters:

filepath (str) – Filepath or URL of the PDF file.
pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
password (str, optional (default: None)) – Password for decryption.
flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.
suppress_stdout (bool, optional (default: True)) – Print all logs and warnings.
layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
process_background* (bool, optional (default: False)) – Process background lines.
line_scale* (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize* (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant* (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

For more information, refer OpenCV’s adaptiveThreshold.
iterations* (int, optional (default: 0)) –
Number of times for erosion/dilation is applied.

For more information, refer OpenCV’s dilate.
resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

Returns:

tables

Return type:

camelot.core.TableList

Lower-Level Classes¶

class camelot.handlers.PDFHandler(filepath, pages='1', password=None)[source]¶

Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.

Parameters:

filepath (str) – Filepath or URL of the PDF file.
pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
password (str, optional (default: None)) – Password for decryption.

parse(flavor='lattice', suppress_stdout=False, layout_kwargs={}, **kwargs)[source]¶

Extracts tables by calling parser.get_tables on all single page PDFs.

Parameters:

flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.
suppress_stdout (str (default: False)) – Suppress logs and warnings.
layout_kwargs (dict, optional (default: {})) –
A dict of pdfminer.layout.LAParams kwargs.
kwargs (dict) – See camelot.read_pdf kwargs.

Returns:

tables – List of tables found in PDF.

Return type:

camelot.core.TableList

class camelot.parsers.Stream(table_regions=None, table_areas=None, columns=None, split_text=False, flag_size=False, strip_text='', edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]¶

Stream method of parsing looks for spaces between text to parse the table.

If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.

Parameters:

table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=['l', 't'], split_text=False, flag_size=False, strip_text='', line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=-2, iterations=0, resolution=300, backend='ghostscript', **kwargs)[source]¶

Lattice method of parsing looks for lines between text to parse the table.

Parameters:

table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
process_background (bool, optional (default: False)) – Process background lines.
line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

For more information, refer OpenCV’s adaptiveThreshold.
iterations (int, optional (default: 0)) –
Number of times for erosion/dilation is applied.

For more information, refer OpenCV’s dilate.
resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

Lower-Lower-Level Classes¶

class camelot.core.TableList(tables)[source]¶

Defines a list of camelot.core.Table objects. Each table can be accessed using its index.

n¶

Number of tables in the list.

Type:: int

export(path, f='csv', compress=False)[source]¶

Exports the list of tables to specified file format.

Parameters:

path (str) – Output filepath.
f (str) – File format. Can be csv, excel, html, json, markdown or sqlite.
compress (bool) – Whether or not to add files to a ZIP archive.

class camelot.core.Table(cols, rows)[source]¶

Defines a table with coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters:

cols (list) – List of tuples representing column x-coordinates in increasing order.
rows (list) – List of tuples representing row y-coordinates in decreasing order.

df¶

Type:: pandas.DataFrame

shape¶

Shape of the table.

Type:: tuple

accuracy¶

Accuracy with which text was assigned to the cell.

Type:: float

whitespace¶

Percentage of whitespace in the table.

Type:: float

order¶

Table number on PDF page.

Type:: int

page¶

PDF page number.

Type:: int

property data¶: Returns two-dimensional list of strings in table.

property parsing_report¶: Returns a parsing report with %accuracy, %whitespace, table number on page and page number.

set_all_edges()[source]¶: Sets all table edges to True.

set_border()[source]¶: Sets table border edges to True.

set_edges(vertical, horizontal, joint_tol=2)[source]¶

Sets a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.

Parameters:

vertical (list) – List of detected vertical lines.
horizontal (list) – List of detected horizontal lines.

set_span()[source]¶: Sets a cell’s hspan or vspan attribute to True depending on whether the cell spans horizontally or vertically.

to_csv(path, **kwargs)[source]¶

Writes Table to a comma-separated values (csv) file.

For kwargs, check pandas.DataFrame.to_csv().

Parameters:: path (str) – Output filepath.

to_excel(path, **kwargs)[source]¶

Writes Table to an Excel file.

For kwargs, check pandas.DataFrame.to_excel().

Parameters:: path (str) – Output filepath.

to_html(path, **kwargs)[source]¶

Writes Table to an HTML file.

For kwargs, check pandas.DataFrame.to_html().

Parameters:: path (str) – Output filepath.

to_json(path, **kwargs)[source]¶

Writes Table to a JSON file.

For kwargs, check pandas.DataFrame.to_json().

Parameters:: path (str) – Output filepath.

to_markdown(path, **kwargs)[source]¶

Writes Table to a Markdown file.

For kwargs, check pandas.DataFrame.to_markdown().

Parameters:: path (str) – Output filepath.

to_sqlite(path, **kwargs)[source]¶

Writes Table to sqlite database.

For kwargs, check pandas.DataFrame.to_sql().

Parameters:: path (str) – Output filepath.

class camelot.core.Cell(x1, y1, x2, y2)[source]¶

Defines a cell in a table with coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters:

x1 (float) – x-coordinate of left-bottom point.
y1 (float) – y-coordinate of left-bottom point.
x2 (float) – x-coordinate of right-top point.
y2 (float) – y-coordinate of right-top point.

lb¶

Tuple representing left-bottom coordinates.

Type:: tuple

lt¶

Tuple representing left-top coordinates.

Type:: tuple

rb¶

Tuple representing right-bottom coordinates.

Type:: tuple

rt¶

Tuple representing right-top coordinates.

Type:: tuple

left¶

Whether or not cell is bounded on the left.

Type:: bool

right¶

Whether or not cell is bounded on the right.

Type:: bool

top¶

Whether or not cell is bounded on the top.

Type:: bool

bottom¶

Whether or not cell is bounded on the bottom.

Type:: bool

hspan¶

Whether or not cell spans horizontally.

Type:: bool

vspan¶

Whether or not cell spans vertically.

Type:: bool

text¶

Text assigned to cell.

Type:: string

API Reference¶

Main Interface¶

Lower-Level Classes¶

Lower-Lower-Level Classes¶

Table of Contents

Related Topics