API Reference

Main Interface

camelot.read_pdf(filepath, pages='1', password=None, flavor='lattice', suppress_stdout=False, layout_kwargs={}, **kwargs)[source]

Read PDF and return extracted tables.

Note: kwargs annotated with ^ can only be used with flavor=’stream’ and kwargs annotated with * can only be used with flavor=’lattice’.

Parameters:
  • filepath (str) – Filepath or URL of the PDF file.
  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
  • password (str, optional (default: None)) – Password for decryption.
  • flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.
  • suppress_stdout (bool, optional (default: True)) – Print all logs and warnings.
  • layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.
  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
  • columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
  • row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
  • column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
  • process_background* (bool, optional (default: False)) – Process background lines.
  • line_scale* (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
  • copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
  • shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
  • line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
  • joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
  • threshold_blocksize* (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant* (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations* (int, optional (default: 0)) –

    Number of times for erosion/dilation is applied.

    For more information, refer OpenCV’s dilate.

  • resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.
Returns:

tables

Return type:

camelot.core.TableList

Lower-Level Classes

class camelot.handlers.PDFHandler(filepath, pages='1', password=None)[source]

Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.

Parameters:
  • filepath (str) – Filepath or URL of the PDF file.
  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
  • password (str, optional (default: None)) – Password for decryption.
parse(flavor='lattice', suppress_stdout=False, layout_kwargs={}, **kwargs)[source]

Extracts tables by calling parser.get_tables on all single page PDFs.

Parameters:
  • flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.
  • suppress_stdout (str (default: False)) – Suppress logs and warnings.
  • layout_kwargs (dict, optional (default: {})) –

    A dict of pdfminer.layout.LAParams kwargs.

  • kwargs (dict) – See camelot.read_pdf kwargs.
Returns:

tables – List of tables found in PDF.

Return type:

camelot.core.TableList

class camelot.parsers.Stream(table_regions=None, table_areas=None, columns=None, split_text=False, flag_size=False, strip_text='', edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]

Stream method of parsing looks for spaces between text to parse the table.

If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=['l', 't'], split_text=False, flag_size=False, strip_text='', line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=-2, iterations=0, resolution=300, **kwargs)[source]

Lattice method of parsing looks for lines between text to parse the table.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
  • process_background (bool, optional (default: False)) – Process background lines.
  • line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
  • copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
  • shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
  • line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
  • joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
  • threshold_blocksize (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations (int, optional (default: 0)) –

    Number of times for erosion/dilation is applied.

    For more information, refer OpenCV’s dilate.

  • resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

Lower-Lower-Level Classes

class camelot.core.TableList(tables)[source]

Defines a list of camelot.core.Table objects. Each table can be accessed using its index.

n

int – Number of tables in the list.

export(path, f='csv', compress=False)[source]

Exports the list of tables to specified file format.

Parameters:
  • path (str) – Output filepath.
  • f (str) – File format. Can be csv, json, excel, html and sqlite.
  • compress (bool) – Whether or not to add files to a ZIP archive.
class camelot.core.Table(cols, rows)[source]

Defines a table with coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters:
  • cols (list) – List of tuples representing column x-coordinates in increasing order.
  • rows (list) – List of tuples representing row y-coordinates in decreasing order.
df

pandas.DataFrame

shape

tuple – Shape of the table.

accuracy

float – Accuracy with which text was assigned to the cell.

whitespace

float – Percentage of whitespace in the table.

order

int – Table number on PDF page.

page

int – PDF page number.

data

Returns two-dimensional list of strings in table.

parsing_report

Returns a parsing report with %accuracy, %whitespace, table number on page and page number.

set_all_edges()[source]

Sets all table edges to True.

set_border()[source]

Sets table border edges to True.

set_edges(vertical, horizontal, joint_tol=2)[source]

Sets a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.

Parameters:
  • vertical (list) – List of detected vertical lines.
  • horizontal (list) – List of detected horizontal lines.
set_span()[source]

Sets a cell’s hspan or vspan attribute to True depending on whether the cell spans horizontally or vertically.

to_csv(path, **kwargs)[source]

Writes Table to a comma-separated values (csv) file.

For kwargs, check pandas.DataFrame.to_csv().

Parameters:path (str) – Output filepath.
to_excel(path, **kwargs)[source]

Writes Table to an Excel file.

For kwargs, check pandas.DataFrame.to_excel().

Parameters:path (str) – Output filepath.
to_html(path, **kwargs)[source]

Writes Table to an HTML file.

For kwargs, check pandas.DataFrame.to_html().

Parameters:path (str) – Output filepath.
to_json(path, **kwargs)[source]

Writes Table to a JSON file.

For kwargs, check pandas.DataFrame.to_json().

Parameters:path (str) – Output filepath.
to_sqlite(path, **kwargs)[source]

Writes Table to sqlite database.

For kwargs, check pandas.DataFrame.to_sql().

Parameters:path (str) – Output filepath.
class camelot.core.Cell(x1, y1, x2, y2)[source]

Defines a cell in a table with coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters:
  • x1 (float) – x-coordinate of left-bottom point.
  • y1 (float) – y-coordinate of left-bottom point.
  • x2 (float) – x-coordinate of right-top point.
  • y2 (float) – y-coordinate of right-top point.
lb

tuple – Tuple representing left-bottom coordinates.

lt

tuple – Tuple representing left-top coordinates.

rb

tuple – Tuple representing right-bottom coordinates.

rt

tuple – Tuple representing right-top coordinates.

left

bool – Whether or not cell is bounded on the left.

right

bool – Whether or not cell is bounded on the right.

top

bool – Whether or not cell is bounded on the top.

bottom

bool – Whether or not cell is bounded on the bottom.

hspan

bool – Whether or not cell spans horizontally.

vspan

bool – Whether or not cell spans vertically.

text

string – Text assigned to cell.