API Reference#

Main Interface#

camelot.read_pdf(filepath: str | Path | bytes | bytearray | memoryview | IO[bytes], pages='1', password=None, flavor='lattice', suppress_stdout=False, parallel=False, cpu_count=None, layout_kwargs=None, per_page=None, debug=False, **kwargs)[source]#

Read PDF and return extracted tables.

Note: kwargs annotated with ^ can only be used with flavor=’stream’ or flavor=’network’ and kwargs annotated with * can only be used with flavor=’lattice’. The hybrid parser accepts kwargs with both annotations.

Parameters:
  • filepath (str, Path, bytes, or binary file-like) – Source PDF. Accepts a filesystem path / URL, a bytes-like object, or any binary stream with a .read() method (io.BytesIO, an open "rb" file, requests response .raw, etc). For in-memory inputs the bytes are spilled to a temporary file once and cleaned up on context-manager exit, so the Lattice OpenCV image-conversion backend keeps working unchanged. Originally requested in #170 / #245 / #270.

  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.

  • password (str, optional (default: None)) – Password for decryption.

  • flavor (str (default: 'lattice')) –

    The parsing method to use. Valid values:

    • 'lattice' (default): line-ruled tables.

    • 'stream': borderless tables with whitespace-separated columns.

    • 'network': borderless tables via text-edge alignment connectivity.

    • 'hybrid': combines layout- and image-based analysis.

    • 'ml': neural table-structure recognition (Table Transformer) for the structure, with cell text filled from the PDF’s own text layer (no hallucinated values). Requires the optional ML dependencies: pip install 'camelot-py[ml]'. Best for borderless tables where the heuristic parsers plateau.

    • 'auto': detect the flavor per page (count ruled lines on each rendered page) and parse each group accordingly — ruled pages via lattice with engine='combined', the rest via network — then merge. Handles documents that mix text-only cover pages with ruled tables deeper in. A UserWarning reports the per-page choices. (More accurate but slower, since it renders every page for the probe.)

  • suppress_stdout (bool, optional (default: False)) – Suppress logs and warnings.

  • parallel (bool, optional (default: False)) – Process pages in parallel using all available cpu cores.

  • cpu_count (int, optional (default: None)) – Maximum number of worker processes when parallel=True. None (default) uses all available cores. Values are clamped to [1, multiprocessing.cpu_count()]. Ignored when parallel=False.

  • layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.

  • per_page (dict, optional (default: None)) –

    Per-page parameter overrides. Maps a 1-indexed page number (int or str) to a dict of any keyword argument otherwise valid for read_pdf. Values supplied here override the globally-supplied kwargs for that one page only — every other page keeps the global values. Useful for multi-layout PDFs where different pages need different table_areas, columns, flavor, etc. The per-page flavor itself may be overridden; the global flavor applies otherwise. Originally proposed by @sverma25 in #41.

    Example:

    tables = camelot.read_pdf(
        "report.pdf",
        pages="1-3",
        flavor="stream",
        split_text=True,
        per_page={2: {"table_areas": ["120, 210, 400, 90"]}},
    )
    

    Here pages 1 and 3 use the global flavor="stream", split_text=True only; page 2 uses both and the page-specific table_areas.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • header_text^ (list, optional (default: None)) – List of substrings identifying a text line above a stream table. When table_areas is not supplied and a matching line is found, its bottom coordinate becomes the top edge of the derived table area. If no match is found, Camelot falls back to automatic table detection.

  • footer_text^ (list, optional (default: None)) – List of substrings identifying a text line below a stream table. When table_areas is not supplied and a matching line is found, its top coordinate becomes the bottom edge of the derived table area. If no match is found, Camelot falls back to automatic table detection.

  • columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str or sequence of str, optional (default: '')) – Characters or substrings to strip from each cell before assignment. A str strips per-character — every character in the string is removed wherever it appears (e.g. " \n" drops all spaces and newlines). A list/tuple of str strips whole substrings (e.g. ["[1]", "[2]"] removes those footnote markers but leaves bare [/] alone). Whole-substring mode requested in #484.

  • replace_text (dict, optional (default: None)) – Mapping of substring → replacement applied to every cell’s text just before it is written into the table. Keys are matched as literal substrings (regex metacharacters are escaped). Useful for collapsing soft-broken words (e.g. {" \n": " "}), normalising abbreviations, or rewriting unit names. Distinct from strip_text which can only remove characters; this can replace with arbitrary text. Requested in #481. (#482)

  • row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

  • process_background* (bool, optional (default: False)) – Process background lines.

  • line_scale* (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.

  • copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.

  • shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.

  • line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.

  • joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.

  • threshold_blocksize* (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant* (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations* (int, optional (default: 0)) –

    Number of dilation passes applied to close small gaps in the line mask.

    For more information, refer OpenCV’s dilate.

  • erode_iterations* (int, optional (default: 0)) – Number of erosion passes applied after dilation. Set equal to iterations for a morphological closing — bridges gaps in ruled lines without thickening the mask overall (which avoids the spurious extra-row artefact reported in #363). (#363)

  • backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.

  • use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True

  • resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

  • engine* (str, optional (default: 'combined')) –

    Line-detection engine for flavor='lattice' (and the lattice half of flavor='hybrid'):

    • 'combined' (default): render the page and detect ruled lines with OpenCV and union in the ruled lines read from the PDF’s native vector graphics, so tables whose rules render faintly (vector strokes, anti-aliasing) are still found. Safe by construction — raster always runs, vector lines can only add, and they’re clipped to table_regions — so it never does worse than 'raster' (#763).

    • 'raster': render the page and detect ruled lines with OpenCV only — the pre-#763 behaviour.

    • 'vector': detect tables straight from the PDF’s vector ruled lines, skipping rasterisation entirely — the fastest path, for PDFs whose tables are drawn with real vector strokes (#763).

    With flavor='hybrid' the same choices select how its lattice half finds ruled lines; engine='vector' there is the render-free hybrid — vector ruled lines merged with the network text-edge alignment — for partial-ruled / borderless tables at roughly an order of magnitude less time than the raster path (#39).

Returns:

tables

Return type:

camelot.core.TableList

Notes

Encrypted PDFs / extraction permissions (#590). Camelot honours the /Encrypt dictionary’s text-extraction permission: read_pdf raises playa.exceptions.PDFTextExtractionNotAllowed if the PDF is encrypted and the user-password permission set forbids text extraction. The check fires on the document object returned by playa.open while the encryption metadata is still attached — this is a real behavioural change vs the pre-1.0 backend, where per-page temp-PDF splitting silently dropped the metadata so the check was effectively a no-op. Note: PDF spec only enforces the flag through the encryption layer — for unencrypted PDFs that carry a “no extraction” claim via /Perms, there is no enforcement mechanism and Camelot extracts. Supplying the document owner password through password= bypasses the user-password permission set (matches every other PDF tool).

Examples

>>> import camelot
>>> tables = camelot.read_pdf("foo.pdf")  # xdoctest: +SKIP
>>> tables.n  # xdoctest: +SKIP
1
>>> tables[0].df  # xdoctest: +SKIP
>>> tables[0].to_csv("foo.csv")  # xdoctest: +SKIP

Select a parser and restrict extraction to a page range:

>>> tables = camelot.read_pdf(  # xdoctest: +SKIP
...     "foo.pdf", flavor="lattice", pages="1-3"
... )

Lower-Level Classes#

class camelot.handlers.PDFHandler(filepath: str | Path | bytes | bytearray | memoryview | IO[bytes], pages='1', password=None, debug=False)[source]#

Handles all operations on the PDF’s.

Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.

Parameters:
  • filepath (str, Path, bytes, or binary file-like) – Source PDF. Accepts a filesystem path / URL, or — since #270 — a bytes-like object or any binary stream with a .read() method (io.BytesIO, an open "rb" file, requests response .raw, etc). In the in-memory cases the bytes are spilled to a temporary file once and cleaned up when the handler is closed; this keeps the rest of the pipeline (in particular the Lattice OpenCV image-conversion backend) unchanged.

  • pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.

  • password (str, optional (default: None)) – Password for decryption.

  • debug (bool, optional (default: False)) – Whether the parser should store debug information during parsing.

close() None[source]#

Delete the URL-downloaded temp file, if any.

Idempotent; safe to call from both __exit__ and an explicit handler.close() call. No-op when filepath was a user-owned path (we never delete a file the caller passed in).

property pages: list[int]#

Resolved 1-based page numbers, sorted and de-duplicated.

Lazy: only opens the PDF if the spec is something other than the default "1". Cached after first access.

parse(flavor: str = 'lattice', suppress_stdout: bool = False, parallel: bool = False, cpu_count: int | None = None, layout_kwargs: dict[str, Any] | None = None, per_page: dict[int, dict[str, Any]] | None = None, pages: list[int] | None = None, render_cache: dict[int, str] | None = None, **kwargs)[source]#

Extract tables by calling parser.get_tables on all single page PDFs.

Parameters:
  • flavor (str (default: 'lattice')) – The parsing method to use. Lattice is used by default.

  • suppress_stdout (bool (default: False)) – Suppress logs and warnings.

  • parallel (bool (default: False)) – Process pages in parallel using all available cpu cores.

  • cpu_count (int, optional (default: None)) – Maximum number of worker processes to use when parallel is True. None (default) uses all available cores. Values are clamped to [1, multiprocessing.cpu_count()]. Ignored when parallel is False.

  • layout_kwargs (dict, optional (default: {})) –

    A dict of pdfminer.layout.LAParams kwargs.

  • kwargs (dict) – See camelot.read_pdf kwargs.

Returns:

tables – List of tables found in PDF.

Return type:

camelot.core.TableList

class camelot.parsers.Stream(table_regions=None, table_areas=None, header_text=None, footer_text=None, columns=None, split_text=False, flag_size=False, strip_text='', replace_text=None, edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]#

Stream method of parsing looks for spaces between text to parse the table.

If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • header_text (list, optional (default: None)) – List of substrings identifying a text line above the table. When table_areas is not set, the matched line’s bottom coordinate is used as the table area’s top edge.

  • footer_text (list, optional (default: None)) – List of substrings identifying a text line below the table. When table_areas is not set, the matched line’s top coordinate is used as the table area’s bottom edge.

  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.

  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)#

Prepare the page for parsing.

record_parse_metadata(table)[source]#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=None, split_text=False, flag_size=False, strip_text='', replace_text=None, line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=-2, iterations=0, erode_iterations=0, resolution=300, use_fallback=True, backend='pdfium', engine='combined', **kwargs)[source]#

Lattice method looks for lines between text to parse the table.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • process_background (bool, optional (default: False)) – Process background lines.

  • line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.

  • copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.

  • shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.

  • joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.

  • threshold_blocksize (int, optional (default: 15)) –

    Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.

    For more information, refer OpenCV’s adaptiveThreshold.

  • threshold_constant (int, optional (default: -2)) –

    Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.

    For more information, refer OpenCV’s adaptiveThreshold.

  • iterations (int, optional (default: 0)) –

    Number of dilation passes applied to close small gaps in the line mask (useful when a table’s ruled lines don’t quite meet at corners).

    For more information, refer OpenCV’s dilate.

  • erode_iterations (int, optional (default: 0)) – Number of erosion passes applied after dilation. Set equal to iterations for a morphological closing (bridges gaps without thickening the mask, which avoids spurious extra rows above/below the detected table). See #363.

  • backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.

  • use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True

  • resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.

  • engine (str, optional (default: 'combined')) –

    Line-detection engine (lattice only):

    • 'combined' (default): OpenCV on the rendered page plus the PDF’s native vector ruled lines unioned into the line masks before contour/joint detection — recovers tables whose rules render faintly. Safe by construction (raster always runs first, vector lines can only add; vector lines are clipped to table_regions so it never expands a table past the region).

    • 'raster': OpenCV on the rendered page only (the pre-#763 behaviour).

    • 'vector': detect tables purely from the PDF’s vector ruled lines, skipping rasterisation entirely — fastest, for PDFs whose tables are drawn with real vector strokes (#763).

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)#

Prepare the page for parsing.

record_parse_metadata(table)[source]#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

class camelot.parsers.Network(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', replace_text=None, edge_tol=None, row_tol=2, column_tol=0, debug=False, **kwargs)[source]#

Network method looks for spaces between text to parse the table.

If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.

  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.

  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)#

Prepare the page for parsing.

record_parse_metadata(table)#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

class camelot.parsers.Hybrid(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', replace_text=None, edge_tol=None, row_tol=2, column_tol=0, debug=False, engine='combined', **kwargs)[source]#

Defines a hybrid parser, leveraging both network and lattice parsers.

Parameters:
  • table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.

  • columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.

  • split_text (bool, optional (default: False)) – Split text that spans across multiple cells.

  • flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.

  • strip_text (str or sequence of str, optional (default: '')) – Characters or substrings to strip from each cell. A str strips per-character; a list/tuple of str strips whole substrings (#484).

  • edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.

  • row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.

  • column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.

  • engine (str, optional (default: 'combined')) –

    Line-detection engine for hybrid’s lattice half (the network half is text-based and unaffected):

    • 'combined' (default): OpenCV on the rendered page plus the PDF’s native vector ruled lines unioned in — recovers faintly-rendered rules. Matches the flavor='lattice' default.

    • 'raster': detect ruled lines with OpenCV only (pre-#763).

    • 'vector': detect ruled lines straight from the PDF’s vector graphics, skipping rasterisation and OpenCV entirely — the render-free hybrid (network text-edge alignment merged with vector ruled lines) for partial-ruled / borderless tables at roughly an order of magnitude less time than the raster path. (#39)

compute_parse_errors(table)#

Compute parse errors for the table .

Parameters:

table (camelot.core.Table)

Returns:

Parse errors

Return type:

Tuple

extract_tables()#

Extract tables from the document.

prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)[source]#

Call this method to prepare the page parsing .

Parameters:
  • filename ([type]) – [description]

  • layout ([type]) – [description]

  • dimensions ([type]) – [description]

  • page_idx ([type]) – [description]

  • layout_kwargs ([type]) – [description]

record_parse_metadata(table)#

Record data about the origin of the table.

table_bboxes()#

Return a list of table bounding boxes sorted by position .

Returns:

[description]

Return type:

[type]

Lower-Lower-Level Classes#

class camelot.core.TableList(tables: Iterable[Table])[source]#

Defines a list of camelot.core.Table objects.

Each table can be accessed using its index.

n#

Number of tables in the list.

Type:

int

Examples

>>> from camelot.core import TableList
>>> tables = TableList([])
>>> tables.n
0
>>> tables
<TableList n=0>
export(path: str, f='csv', compress=False)[source]#

Export the list of tables to specified file format.

Parameters:
  • path (str) – Output filepath.

  • f (str) – File format. Can be csv, excel, html, json, markdown or sqlite.

  • compress (bool) – Whether or not to add files to a ZIP archive.

filter(min_rows: int = 1, min_columns: int = 1, min_accuracy: float = 0.0, max_whitespace: float = 100.0) TableList[source]#

Return a new TableList keeping only tables that pass all thresholds.

A post-extraction convenience for dropping noise / low-quality tables (single stray cells, mostly-empty regions, …). Parsing is unchanged — everything is still detected; this just selects from the result. Every threshold defaults to a no-op, so calling filter() with no arguments returns an equivalent list and a legitimate single-row or single-column table is never dropped unless you ask for it.

Parameters:
  • min_rows (int, optional (default: 1)) – Drop tables with fewer than this many rows.

  • min_columns (int, optional (default: 1)) – Drop tables with fewer than this many columns.

  • min_accuracy (float, optional (default: 0.0)) – Drop tables whose parsing_report accuracy (0-100) is below this value.

  • max_whitespace (float, optional (default: 100.0)) – Drop tables whose parsing_report whitespace (0-100) is above this value.

Returns:

A new list (the original is left untouched), so calls compose: tables.filter(min_rows=2).filter(min_accuracy=90).

Return type:

camelot.core.TableList

property n: int#

The number of tables in the list.

stack_contiguous(match: str = 'column_count', keep_first_header: bool = False) TableList[source]#

Vertically stack tables that look like continuations across pages.

Many PDF reports break a single logical table over several pages — a header on every page, a footer on every page, body rows in between. read_pdf returns one Table per page; this helper stitches contiguous ones back together so the resulting TableList has one entry per logical table instead of one per physical page.

Parameters:
  • match (str, optional (default: 'column_count')) –

    How to decide whether two adjacent tables are continuations of each other.

    • 'column_count' — same number of columns (the rule from #628’s POC; the common case).

    • 'first_row' — same column count and identical text in the first row (catches PDFs that repeat the header on every page).

  • keep_first_header (bool, optional (default: False)) – When match='first_row', the matching first row of every continuation table is dropped (so the stacked table has exactly one header row). Set to True to keep every page’s header row in the stacked output.

Returns:

A new TableList with continuation runs collapsed. Tables that don’t continue from the previous one (different column count or different first row, depending on match) are passed through unchanged. The originals in self are not mutated.

Return type:

camelot.core.TableList

Notes

Originally proposed by @TimothyOfDelphi in #628. Consolidates #8 / #133 / #357 / #531.

Limitations:

  • The stacked Table’s page keeps the first stitched table’s page number; order keeps the first table’s order. Callers iterating with both shouldn’t be surprised by a missing row-of-pages.

  • parsing_report is averaged: accuracy and whitespace are mean-aggregated across the stitched tables; confidence is recomputed from the averaged accuracy + whitespace.

  • Cell geometry (_bbox, cells, rows) is preserved via the y-shift trick from #628’s POC, so downstream plotting on a single stitched table still works page-locally. Cells from the second-and-later tables are shifted to sit below the first table’s bottom.

class camelot.core.Table(cols, rows)[source]#

Defines a table with coordinates relative to a left-bottom origin.

(PDF coordinate space)

Parameters:
  • cols (list) – List of tuples representing column x-coordinates in increasing order.

  • rows (list) – List of tuples representing row y-coordinates in decreasing order.

df#
Type:

pandas.DataFrame

shape#

Shape of the table.

Type:

tuple

accuracy#

Accuracy with which text was assigned to the cell.

Type:

float

whitespace#

Percentage of whitespace in the table.

Type:

float

filename#

Path of the original PDF

Type:

str

order#

Table number on PDF page.

Type:

int

page#

PDF page number.

Type:

int

property confidence: float#

A unified per-table quality score in [0.0, 1.0].

Computed from the existing per-flavor signals as (accuracy / 100) * (1 - whitespace / 100). The intent is a single number suitable for production filtering and automated validation — confidence >= 0.8 works as a reasonable first-cut threshold; tune for the source PDFs.

Components and their meaning are identical across flavors:

  • accuracy (0-100): how well the detected cells line up with the parser’s structural hints (line joints for lattice, text alignments for stream/network/hybrid). Higher is better.

  • whitespace (0-100): percentage of cells that are empty after stripping. Lower is better (a perfectly populated table is 0; a mostly-empty one trends toward 100).

  • confidence (0-1): the composite. accuracy=90, whitespace=10confidence≈0.81; either signal going to its worst value pulls confidence to 0.

See #659.

copy_spanning_text(copy_text=None)[source]#

Copies over text in empty spanning cells.

Parameters:

copy_text (list of str, optional (default: None)) – Select one or more of the following strings: {‘h’, ‘v’} to specify the direction in which text should be copied over when a cell spans multiple rows or columns.

Returns:

The updated table with copied text in spanning cells.

Return type:

camelot.core.Table

Notes

Iterates the directional copy passes until the table is stable. A single pass-per-direction misses cells spanned in both directions (a 2D span): the source cell from which the 2D-spanned cell would copy hasn’t itself been filled yet, so the empty string propagates through. Repeating the chosen passes until no cell changes converges in O(spans) iterations and fixes the symptom reported in #349.

property data#

Returns two-dimensional list of strings in table.

get_pdf_image()[source]#

Compute pdf image and cache it.

property parsing_report#

Per-table parsing report.

Standard keys across all flavors:

page

1-based page number the table was found on.

order

1-based rank within that page (left-to-right / top-to-bottom).

accuracy

Float in [0, 100]. See confidence for component-by-component definitions.

whitespace

Float in [0, 100].

confidence

Float in [0, 1]. Unified quality score — combines accuracy and whitespace.

See #659.

set_all_edges()[source]#

Set all table edges to True.

set_border()[source]#

Sets table border edges to True.

set_edges(vertical, horizontal, joint_tol=2)[source]#

Set the edges of the joint.

Set a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.

Parameters:
  • vertical (list) – List of detected vertical lines.

  • horizontal (list) – List of detected horizontal lines.

  • joint_tol (int, optional) – Tolerance for determining proximity, by default 2

to_csv(path, **kwargs)[source]#

Write Table(s) to a comma-separated values (csv) file.

For kwargs, check pandas.DataFrame.to_csv().

Parameters:

path (str) – Output filepath.

to_excel(path, **kwargs)[source]#

Write Table(s) to an Excel file.

For kwargs, check pandas.DataFrame.to_excel(). The optional mode kwarg is forwarded to pandas.ExcelWriter ("w" to overwrite, "a" to append a new sheet to an existing workbook) — see #317.

Parameters:
  • path (str) – Output filepath.

  • mode (str, optional (default: 'w')) – ExcelWriter open mode. Use "a" to add a new sheet to an existing workbook (requires openpyxl).

to_html(path, **kwargs)[source]#

Write Table(s) to an HTML file.

For kwargs, check pandas.DataFrame.to_html(). The optional mode kwarg is consumed by the file-open call (#317).

Parameters:
  • path (str) – Output filepath.

  • mode (str, optional (default: 'w')) – File open mode. Pass "a" to append to an existing file rather than overwrite it.

to_json(path, **kwargs)[source]#

Write Table(s) to a JSON file.

For kwargs, check pandas.DataFrame.to_json(). The optional mode kwarg ("w" to overwrite, "a" to append) is consumed by the file-open call; the rest are forwarded to DataFrame.to_json (#317).

Parameters:
  • path (str) – Output filepath.

  • mode (str, optional (default: 'w')) – File open mode. Pass "a" to append to an existing file rather than overwrite it.

to_markdown(path, **kwargs)[source]#

Write Table(s) to a Markdown file.

For kwargs, check pandas.DataFrame.to_markdown(). The optional mode kwarg is consumed by the file-open call — passing mode="a" appends every successive call to the same file rather than overwriting (#317).

Parameters:
  • path (str) – Output filepath.

  • mode (str, optional (default: 'w')) – File open mode. Pass "a" to append to an existing file rather than overwrite it.

to_sqlite(path, **kwargs)[source]#

Write Table(s) to sqlite database.

For kwargs, check pandas.DataFrame.to_sql().

Parameters:

path (str) – Output filepath.

class camelot.core.Cell(x1, y1, x2, y2)[source]#

Defines a cell in a table.

With coordinates relative to a left-bottom origin. (PDF coordinate space)

Parameters:
  • x1 (float) – x-coordinate of left-bottom point.

  • y1 (float) – y-coordinate of left-bottom point.

  • x2 (float) – x-coordinate of right-top point.

  • y2 (float) – y-coordinate of right-top point.

lb#

Tuple representing left-bottom coordinates.

Type:

tuple

lt#

Tuple representing left-top coordinates.

Type:

tuple

rb#

Tuple representing right-bottom coordinates.

Type:

tuple

rt#

Tuple representing right-top coordinates.

Type:

tuple

left#

Whether or not cell is bounded on the left.

Type:

bool

right#

Whether or not cell is bounded on the right.

Type:

bool

top#

Whether or not cell is bounded on the top.

Type:

bool

bottom#

Whether or not cell is bounded on the bottom.

Type:

bool

text#

Text assigned to cell.

Type:

string

Plotting#

camelot.plot(table, kind='text', filename=None, ax=None)#

Classmethod for plotting methods.

class camelot.plotting.PlotMethods[source]#

Classmethod for plotting methods.

static contour(table, ax=None)[source]#

Generate a plot for all table boundaries present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static grid(table, ax=None)[source]#

Generate a plot for the detected table grids on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static joint(table, ax=None)[source]#

Generate a plot for all line intersections present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static line(table, ax=None)[source]#

Generate a plot for all line segments present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

Generate a plot illustrating the steps of the network table search.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

text(table, ax=None)[source]#

Generate a plot for all text elements present on the PDF page.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure

static textedge(table, ax=None)[source]#

Generate a plot for relevant textedges.

Parameters:
Returns:

fig

Return type:

matplotlib.fig.Figure