API Reference

Main Interface

camelot.read_pdf(filepath, pages='1', password=None, flavor='lattice', suppress_stdout=False, layout_kwargs={}, **kwargs)[source]

Read PDF and return extracted tables.

Note: kwargs annotated with ^ can only be used with flavor=’stream’ and kwargs annotated with * can only be used with flavor=’lattice’.

Parameters
  • filepath (str) – Filepath or URL of the PDF file.

  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.

  • password (str, optional (default: None)) – Password for decryption.

  • flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.

  • suppress_stdout (bool, optional (default: True)) – Print all logs and warnings.

  • layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

  • process_background* (bool, optional (default: False)) – Process background lines.

  • line_scale* (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.

  • copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.

  • shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.

  • line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.

  • joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.

  • threshold_blocksize* (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant* (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations* (int, optional (default: 0)) –

    Number of times for erosion/dilation is applied.

    For more information, refer OpenCV’s dilate.

  • resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

Returns

tables

Return type

camelot.core.TableList

Lower-Level Classes

class camelot.handlers.PDFHandler(filepath, pages='1', password=None)[source]

Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.

Parameters
  • filepath (str) – Filepath or URL of the PDF file.

  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.

  • password (str, optional (default: None)) – Password for decryption.

parse(flavor='lattice', suppress_stdout=False, layout_kwargs={}, **kwargs)[source]

Extracts tables by calling parser.get_tables on all single page PDFs.

Parameters
  • flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.

  • suppress_stdout (str (default: False)) – Suppress logs and warnings.

  • layout_kwargs (dict, optional (default: {})) –

    A dict of pdfminer.layout.LAParams kwargs.

  • kwargs (dict) – See camelot.read_pdf kwargs.

Returns

tables – List of tables found in PDF.

Return type

camelot.core.TableList

class camelot.parsers.Stream(table_regions=None, table_areas=None, columns=None, split_text=False, flag_size=False, strip_text='', edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]

Stream method of parsing looks for spaces between text to parse the table.

If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.

Parameters
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.

  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=['l', 't'], split_text=False, flag_size=False, strip_text='', line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=- 2, iterations=0, resolution=300, **kwargs)[source]

Lattice method of parsing looks for lines between text to parse the table.

Parameters
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • process_background (bool, optional (default: False)) – Process background lines.

  • line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.

  • copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.

  • shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.

  • joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.

  • threshold_blocksize (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations (int, optional (default: 0)) –

    Number of times for erosion/dilation is applied.

    For more information, refer OpenCV’s dilate.

  • resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

Lower-Lower-Level Classes

class camelot.core.TableList(tables)[source]

Defines a list of camelot.core.Table objects. Each table can be accessed using its index.

n

Number of tables in the list.

Type

int

export(path, f='csv', compress=False)[source]

Exports the list of tables to specified file format.

Parameters
  • path (str) – Output filepath.

  • f (str) – File format. Can be csv, json, excel, html and sqlite.

  • compress (bool) – Whether or not to add files to a ZIP archive.

class camelot.core.Table(cols, rows)[source]

Defines a table with coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters
  • cols (list) – List of tuples representing column x-coordinates in increasing order.

  • rows (list) – List of tuples representing row y-coordinates in decreasing order.

df
Type

pandas.DataFrame

shape

Shape of the table.

Type

tuple

accuracy

Accuracy with which text was assigned to the cell.

Type

float

whitespace

Percentage of whitespace in the table.

Type

float

order

Table number on PDF page.

Type

int

page

PDF page number.

Type

int

property data

Returns two-dimensional list of strings in table.

property parsing_report

Returns a parsing report with %accuracy, %whitespace, table number on page and page number.

set_all_edges()[source]

Sets all table edges to True.

set_border()[source]

Sets table border edges to True.

set_edges(vertical, horizontal, joint_tol=2)[source]

Sets a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.

Parameters
  • vertical (list) – List of detected vertical lines.

  • horizontal (list) – List of detected horizontal lines.

set_span()[source]

Sets a cell’s hspan or vspan attribute to True depending on whether the cell spans horizontally or vertically.

to_csv(path, **kwargs)[source]

Writes Table to a comma-separated values (csv) file.

For kwargs, check pandas.DataFrame.to_csv().

Parameters

path (str) – Output filepath.

to_excel(path, **kwargs)[source]

Writes Table to an Excel file.

For kwargs, check pandas.DataFrame.to_excel().

Parameters

path (str) – Output filepath.

to_html(path, **kwargs)[source]

Writes Table to an HTML file.

For kwargs, check pandas.DataFrame.to_html().

Parameters

path (str) – Output filepath.

to_json(path, **kwargs)[source]

Writes Table to a JSON file.

For kwargs, check pandas.DataFrame.to_json().

Parameters

path (str) – Output filepath.

to_sqlite(path, **kwargs)[source]

Writes Table to sqlite database.

For kwargs, check pandas.DataFrame.to_sql().

Parameters

path (str) – Output filepath.

class camelot.core.Cell(x1, y1, x2, y2)[source]

Defines a cell in a table with coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters
  • x1 (float) – x-coordinate of left-bottom point.

  • y1 (float) – y-coordinate of left-bottom point.

  • x2 (float) – x-coordinate of right-top point.

  • y2 (float) – y-coordinate of right-top point.

lb

Tuple representing left-bottom coordinates.

Type

tuple

lt

Tuple representing left-top coordinates.

Type

tuple

rb

Tuple representing right-bottom coordinates.

Type

tuple

rt

Tuple representing right-top coordinates.

Type

tuple

left

Whether or not cell is bounded on the left.

Type

bool

right

Whether or not cell is bounded on the right.

Type

bool

top

Whether or not cell is bounded on the top.

Type

bool

bottom

Whether or not cell is bounded on the bottom.

Type

bool

hspan

Whether or not cell spans horizontally.

Type

bool

vspan

Whether or not cell spans vertically.

Type

bool

text

Text assigned to cell.

Type

string