API Reference#

Main Interface#

camelot.read_pdf(filepath: str | IO[Any] | Path, pages='1', password=None, flavor='lattice', suppress_stdout=False, parallel=False, layout_kwargs=None, debug=False, **kwargs)[source]#

Read PDF and return extracted tables.

Note: kwargs annotated with ^ can only be used with flavor=’stream’ or flavor=’network’ and kwargs annotated with * can only be used with flavor=’lattice’. The hybrid parser accepts kwargs with both annotations.

Parameters:
  • filepath (str, Path, IO) – Filepath or URL of the PDF file.

  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.

  • password (str, optional (default: None)) – Password for decryption.

  • flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’, ‘stream’, ‘network’ or ‘hybrid’). Lattice is used by default.

  • suppress_stdout (bool, optional (default: False)) – Print all logs and warnings.

  • parallel (bool, optional (default: False)) – Process pages in parallel using all available cpu cores.

  • layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

  • process_background* (bool, optional (default: False)) – Process background lines.

  • line_scale* (int, optional (default: 40)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.

  • copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.

  • shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.

  • line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.

  • joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.

  • threshold_blocksize* (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant* (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations* (int, optional (default: 0)) –

    Number of times for erosion/dilation is applied.

    For more information, refer OpenCV’s dilate.

  • backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.

  • use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True

  • resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

Returns:

tables

Return type:

camelot.core.TableList

Lower-Level Classes#

class camelot.handlers.PDFHandler(filepath: str | IO[Any] | Path, pages='1', password=None, debug=False)[source]#

Handles all operations on the PDF’s.

Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.

Parameters:
  • filepath (str) – Filepath or URL of the PDF file.

  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.

  • password (str, optional (default: None)) – Password for decryption.

  • debug (bool, optional (default: False)) – Whether the parser should store debug information during parsing.

parse(flavor: str = 'lattice', suppress_stdout: bool = False, parallel: bool = False, layout_kwargs: dict[str, Any] | None = None, **kwargs)[source]#

Extract tables by calling parser.get_tables on all single page PDFs.

Parameters:
  • flavor (str (default: 'lattice')) – The parsing method to use. Lattice is used by default.

  • suppress_stdout (bool (default: False)) – Suppress logs and warnings.

  • parallel (bool (default: False)) – Process pages in parallel using all available cpu cores.

  • layout_kwargs (dict, optional (default: {})) –

    A dict of pdfminer.layout.LAParams kwargs.

  • kwargs (dict) – See camelot.read_pdf kwargs.

Returns:

tables – List of tables found in PDF.

Return type:

camelot.core.TableList

class camelot.parsers.Stream(table_regions=None, table_areas=None, columns=None, split_text=False, flag_size=False, strip_text='', edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]#

Stream method of parsing looks for spaces between text to parse the table.

If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.

  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)#

Prepare the page for parsing.

record_parse_metadata(table)[source]#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=None, split_text=False, flag_size=False, strip_text='', line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=-2, iterations=0, resolution=300, use_fallback=True, backend='pdfium', **kwargs)[source]#

Lattice method looks for lines between text to parse the table.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • process_background (bool, optional (default: False)) – Process background lines.

  • line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.

  • copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.

  • shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.

  • joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.

  • threshold_blocksize (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations (int, optional (default: 0)) –

    Number of times for erosion/dilation is applied.

    For more information, refer OpenCV’s dilate.

  • backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.

  • use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True

  • resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)#

Prepare the page for parsing.

record_parse_metadata(table)[source]#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

class camelot.parsers.Network(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', edge_tol=None, row_tol=2, column_tol=0, debug=False, **kwargs)[source]#

Network method looks for spaces between text to parse the table.

If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.

  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)#

Prepare the page for parsing.

record_parse_metadata(table)#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

class camelot.parsers.Hybrid(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', edge_tol=None, row_tol=2, column_tol=0, debug=False, **kwargs)[source]#

Defines a hybrid parser, leveraging both network and lattice parsers.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.

  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)[source]#

Call this method to prepare the page parsing .

Parameters:
  • filename ([type]) – [description]

  • layout ([type]) – [description]

  • dimensions ([type]) – [description]

  • page_idx ([type]) – [description]

  • layout_kwargs ([type]) – [description]

record_parse_metadata(table)#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

Lower-Lower-Level Classes#

class camelot.core.TableList(tables: Iterable[Table])[source]#

Defines a list of camelot.core.Table objects.

Each table can be accessed using its index.

n#

Number of tables in the list.

Type:

int

export(path: str, f='csv', compress=False)[source]#

Export the list of tables to specified file format.

Parameters:
  • path (str) – Output filepath.

  • f (str) – File format. Can be csv, excel, html, json, markdown or sqlite.

  • compress (bool) – Whether or not to add files to a ZIP archive.

property n: int#

The number of tables in the list.

class camelot.core.Table(cols, rows)[source]#

Defines a table with coordinates relative to a left-bottom origin.

(PDF coordinate space)

Parameters:
  • cols (list) – List of tuples representing column x-coordinates in increasing order.

  • rows (list) – List of tuples representing row y-coordinates in decreasing order.

df#
Type:

pandas.DataFrame

shape#

Shape of the table.

Type:

tuple

accuracy#

Accuracy with which text was assigned to the cell.

Type:

float

whitespace#

Percentage of whitespace in the table.

Type:

float

filename#

Path of the original PDF

Type:

str

order#

Table number on PDF page.

Type:

int

page#

PDF page number.

Type:

int

copy_spanning_text(copy_text=None)[source]#

Copies over text in empty spanning cells.

Parameters:

copy_text (list of str, optional (default: None)) – Select one or more of the following strings: {‘h’, ‘v’} to specify the direction in which text should be copied over when a cell spans multiple rows or columns.

Returns:

The updated table with copied text in spanning cells.

Return type:

camelot.core.Table

property data#

Returns two-dimensional list of strings in table.

get_pdf_image()[source]#

Compute pdf image and cache it.

property parsing_report#

Returns a parsing report.

with % accuracy, % whitespace, table number on page and page number.

set_all_edges()[source]#

Set all table edges to True.

set_border()[source]#

Sets table border edges to True.

set_edges(vertical, horizontal, joint_tol=2)[source]#

Set the edges of the joint.

Set a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.

Parameters:
  • vertical (list) – List of detected vertical lines.

  • horizontal (list) – List of detected horizontal lines.

  • joint_tol (int, optional) – Tolerance for determining proximity, by default 2

to_csv(path, **kwargs)[source]#

Write Table(s) to a comma-separated values (csv) file.

For kwargs, check pandas.DataFrame.to_csv().

Parameters:

path (str) – Output filepath.

to_excel(path, **kwargs)[source]#

Write Table(s) to an Excel file.

For kwargs, check pandas.DataFrame.to_excel().

Parameters:

path (str) – Output filepath.

to_html(path, **kwargs)[source]#

Write Table(s) to an HTML file.

For kwargs, check pandas.DataFrame.to_html().

Parameters:

path (str) – Output filepath.

to_json(path, **kwargs)[source]#

Write Table(s) to a JSON file.

For kwargs, check pandas.DataFrame.to_json().

Parameters:

path (str) – Output filepath.

to_markdown(path, **kwargs)[source]#

Write Table(s) to a Markdown file.

For kwargs, check pandas.DataFrame.to_markdown().

Parameters:

path (str) – Output filepath.

to_sqlite(path, **kwargs)[source]#

Write Table(s) to sqlite database.

For kwargs, check pandas.DataFrame.to_sql().

Parameters:

path (str) – Output filepath.

class camelot.core.Cell(x1, y1, x2, y2)[source]#

Defines a cell in a table.

With coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters:
  • x1 (float) – x-coordinate of left-bottom point.

  • y1 (float) – y-coordinate of left-bottom point.

  • x2 (float) – x-coordinate of right-top point.

  • y2 (float) – y-coordinate of right-top point.

lb#

Tuple representing left-bottom coordinates.

Type:

tuple

lt#

Tuple representing left-top coordinates.

Type:

tuple

rb#

Tuple representing right-bottom coordinates.

Type:

tuple

rt#

Tuple representing right-top coordinates.

Type:

tuple

left#

Whether or not cell is bounded on the left.

Type:

bool

right#

Whether or not cell is bounded on the right.

Type:

bool

top#

Whether or not cell is bounded on the top.

Type:

bool

bottom#

Whether or not cell is bounded on the bottom.

Type:

bool

text#

Text assigned to cell.

Type:

string

Plotting#

camelot.plot(table, kind='text', filename=None, ax=None)#

Classmethod for plotting methods.

class camelot.plotting.PlotMethods[source]#

Classmethod for plotting methods.

static contour(table, ax=None)[source]#

Generate a plot for all table boundaries present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static grid(table, ax=None)[source]#

Generate a plot for the detected table grids on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static joint(table, ax=None)[source]#

Generate a plot for all line intersections present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static line(table, ax=None)[source]#

Generate a plot for all line segments present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

Generate a plot illustrating the steps of the network table search.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

text(table, ax=None)[source]#

Generate a plot for all text elements present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static textedge(table, ax=None)[source]#

Generate a plot for relevant textedges.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure