API Reference#
Main Interface#
- camelot.read_pdf(filepath: str | Path | bytes | bytearray | memoryview | IO[bytes], pages='1', password=None, flavor='lattice', suppress_stdout=False, parallel=False, cpu_count=None, layout_kwargs=None, per_page=None, debug=False, **kwargs)[source]#
Read PDF and return extracted tables.
Note: kwargs annotated with ^ can only be used with flavor=’stream’ or flavor=’network’ and kwargs annotated with * can only be used with flavor=’lattice’. The hybrid parser accepts kwargs with both annotations.
- Parameters:
filepath (str, Path, bytes, or binary file-like) – Source PDF. Accepts a filesystem path / URL, a
bytes-like object, or any binary stream with a.read()method (io.BytesIO, an open"rb"file,requestsresponse.raw, etc). For in-memory inputs the bytes are spilled to a temporary file once and cleaned up on context-manager exit, so the Lattice OpenCV image-conversion backend keeps working unchanged. Originally requested in #170 / #245 / #270.pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
password (str, optional (default: None)) – Password for decryption.
flavor (str (default: 'lattice')) –
The parsing method to use. Valid values:
'lattice'(default): line-ruled tables.'stream': borderless tables with whitespace-separated columns.'network': borderless tables via text-edge alignment connectivity.'hybrid': combines layout- and image-based analysis.'ml': neural table-structure recognition (Table Transformer) for the structure, with cell text filled from the PDF’s own text layer (no hallucinated values). Requires the optional ML dependencies:pip install 'camelot-py[ml]'. Best for borderless tables where the heuristic parsers plateau.'auto': detect the flavor per page (count ruled lines on each rendered page) and parse each group accordingly — ruled pages vialatticewithengine='combined', the rest vianetwork— then merge. Handles documents that mix text-only cover pages with ruled tables deeper in. AUserWarningreports the per-page choices. (More accurate but slower, since it renders every page for the probe.)
suppress_stdout (bool, optional (default: False)) – Suppress logs and warnings.
parallel (bool, optional (default: False)) – Process pages in parallel using all available cpu cores.
cpu_count (int, optional (default: None)) – Maximum number of worker processes when
parallel=True.None(default) uses all available cores. Values are clamped to[1, multiprocessing.cpu_count()]. Ignored whenparallel=False.layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.
per_page (dict, optional (default: None)) –
Per-page parameter overrides. Maps a 1-indexed page number (int or str) to a dict of any keyword argument otherwise valid for
read_pdf. Values supplied here override the globally-supplied kwargs for that one page only — every other page keeps the global values. Useful for multi-layout PDFs where different pages need differenttable_areas,columns,flavor, etc. The per-pageflavoritself may be overridden; the global flavor applies otherwise. Originally proposed by @sverma25 in #41.Example:
tables = camelot.read_pdf( "report.pdf", pages="1-3", flavor="stream", split_text=True, per_page={2: {"table_areas": ["120, 210, 400, 90"]}}, )
Here pages 1 and 3 use the global
flavor="stream", split_text=Trueonly; page 2 uses both and the page-specifictable_areas.table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
header_text^ (list, optional (default: None)) – List of substrings identifying a text line above a stream table. When
table_areasis not supplied and a matching line is found, its bottom coordinate becomes the top edge of the derived table area. If no match is found, Camelot falls back to automatic table detection.footer_text^ (list, optional (default: None)) – List of substrings identifying a text line below a stream table. When
table_areasis not supplied and a matching line is found, its top coordinate becomes the bottom edge of the derived table area. If no match is found, Camelot falls back to automatic table detection.columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str or sequence of str, optional (default: '')) – Characters or substrings to strip from each cell before assignment. A
strstrips per-character — every character in the string is removed wherever it appears (e.g." \n"drops all spaces and newlines). A list/tuple ofstrstrips whole substrings (e.g.["[1]", "[2]"]removes those footnote markers but leaves bare[/]alone). Whole-substring mode requested in #484.replace_text (dict, optional (default: None)) – Mapping of substring → replacement applied to every cell’s text just before it is written into the table. Keys are matched as literal substrings (regex metacharacters are escaped). Useful for collapsing soft-broken words (e.g.
{" \n": " "}), normalising abbreviations, or rewriting unit names. Distinct fromstrip_textwhich can only remove characters; this can replace with arbitrary text. Requested in #481. (#482)row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
process_background* (bool, optional (default: False)) – Process background lines.
line_scale* (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize* (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.
For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant* (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.
For more information, refer OpenCV’s adaptiveThreshold.
iterations* (int, optional (default: 0)) –
Number of dilation passes applied to close small gaps in the line mask.
For more information, refer OpenCV’s dilate.
erode_iterations* (int, optional (default: 0)) – Number of erosion passes applied after dilation. Set equal to
iterationsfor a morphological closing — bridges gaps in ruled lines without thickening the mask overall (which avoids the spurious extra-row artefact reported in #363). (#363)backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.
use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True
resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.
engine* (str, optional (default: 'combined')) –
Line-detection engine for
flavor='lattice'(and the lattice half offlavor='hybrid'):'combined'(default): render the page and detect ruled lines with OpenCV and union in the ruled lines read from the PDF’s native vector graphics, so tables whose rules render faintly (vector strokes, anti-aliasing) are still found. Safe by construction — raster always runs, vector lines can only add, and they’re clipped totable_regions— so it never does worse than'raster'(#763).'raster': render the page and detect ruled lines with OpenCV only — the pre-#763 behaviour.'vector': detect tables straight from the PDF’s vector ruled lines, skipping rasterisation entirely — the fastest path, for PDFs whose tables are drawn with real vector strokes (#763).
With
flavor='hybrid'the same choices select how its lattice half finds ruled lines;engine='vector'there is the render-free hybrid — vector ruled lines merged with the network text-edge alignment — for partial-ruled / borderless tables at roughly an order of magnitude less time than the raster path (#39).
- Returns:
tables
- Return type:
Notes
Encrypted PDFs / extraction permissions (#590). Camelot honours the
/Encryptdictionary’s text-extraction permission:read_pdfraisesplaya.exceptions.PDFTextExtractionNotAllowedif the PDF is encrypted and the user-password permission set forbids text extraction. The check fires on the document object returned byplaya.openwhile the encryption metadata is still attached — this is a real behavioural change vs the pre-1.0 backend, where per-page temp-PDF splitting silently dropped the metadata so the check was effectively a no-op. Note: PDF spec only enforces the flag through the encryption layer — for unencrypted PDFs that carry a “no extraction” claim via/Perms, there is no enforcement mechanism and Camelot extracts. Supplying the document owner password throughpassword=bypasses the user-password permission set (matches every other PDF tool).Examples
>>> import camelot >>> tables = camelot.read_pdf("foo.pdf") # xdoctest: +SKIP >>> tables.n # xdoctest: +SKIP 1 >>> tables[0].df # xdoctest: +SKIP >>> tables[0].to_csv("foo.csv") # xdoctest: +SKIP
Select a parser and restrict extraction to a page range:
>>> tables = camelot.read_pdf( # xdoctest: +SKIP ... "foo.pdf", flavor="lattice", pages="1-3" ... )
Lower-Level Classes#
- class camelot.handlers.PDFHandler(filepath: str | Path | bytes | bytearray | memoryview | IO[bytes], pages='1', password=None, debug=False)[source]#
Handles all operations on the PDF’s.
Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.
- Parameters:
filepath (str, Path, bytes, or binary file-like) – Source PDF. Accepts a filesystem path / URL, or — since #270 — a
bytes-like object or any binary stream with a.read()method (io.BytesIO, an open"rb"file,requestsresponse.raw, etc). In the in-memory cases the bytes are spilled to a temporary file once and cleaned up when the handler is closed; this keeps the rest of the pipeline (in particular the Lattice OpenCV image-conversion backend) unchanged.pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
password (str, optional (default: None)) – Password for decryption.
debug (bool, optional (default: False)) – Whether the parser should store debug information during parsing.
- close() None[source]#
Delete the URL-downloaded temp file, if any.
Idempotent; safe to call from both
__exit__and an explicithandler.close()call. No-op whenfilepathwas a user-owned path (we never delete a file the caller passed in).
- property pages: list[int]#
Resolved 1-based page numbers, sorted and de-duplicated.
Lazy: only opens the PDF if the spec is something other than the default
"1". Cached after first access.
- parse(flavor: str = 'lattice', suppress_stdout: bool = False, parallel: bool = False, cpu_count: int | None = None, layout_kwargs: dict[str, Any] | None = None, per_page: dict[int, dict[str, Any]] | None = None, pages: list[int] | None = None, render_cache: dict[int, str] | None = None, **kwargs)[source]#
Extract tables by calling parser.get_tables on all single page PDFs.
- Parameters:
flavor (str (default: 'lattice')) – The parsing method to use. Lattice is used by default.
suppress_stdout (bool (default: False)) – Suppress logs and warnings.
parallel (bool (default: False)) – Process pages in parallel using all available cpu cores.
cpu_count (int, optional (default: None)) – Maximum number of worker processes to use when
parallelis True.None(default) uses all available cores. Values are clamped to[1, multiprocessing.cpu_count()]. Ignored whenparallelis False.layout_kwargs (dict, optional (default: {})) –
A dict of pdfminer.layout.LAParams kwargs.
kwargs (dict) – See camelot.read_pdf kwargs.
- Returns:
tables – List of tables found in PDF.
- Return type:
- class camelot.parsers.Stream(table_regions=None, table_areas=None, header_text=None, footer_text=None, columns=None, split_text=False, flag_size=False, strip_text='', replace_text=None, edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]#
Stream method of parsing looks for spaces between text to parse the table.
If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
header_text (list, optional (default: None)) – List of substrings identifying a text line above the table. When table_areas is not set, the matched line’s bottom coordinate is used as the table area’s top edge.
footer_text (list, optional (default: None)) – List of substrings identifying a text line below the table. When table_areas is not set, the matched line’s top coordinate is used as the table area’s bottom edge.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)#
Prepare the page for parsing.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
- class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=None, split_text=False, flag_size=False, strip_text='', replace_text=None, line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=-2, iterations=0, erode_iterations=0, resolution=300, use_fallback=True, backend='pdfium', engine='combined', **kwargs)[source]#
Lattice method looks for lines between text to parse the table.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
process_background (bool, optional (default: False)) – Process background lines.
line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.
For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.
For more information, refer OpenCV’s adaptiveThreshold.
iterations (int, optional (default: 0)) –
Number of dilation passes applied to close small gaps in the line mask (useful when a table’s ruled lines don’t quite meet at corners).
For more information, refer OpenCV’s dilate.
erode_iterations (int, optional (default: 0)) – Number of erosion passes applied after dilation. Set equal to
iterationsfor a morphological closing (bridges gaps without thickening the mask, which avoids spurious extra rows above/below the detected table). See #363.backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.
use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True
resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.
engine (str, optional (default: 'combined')) –
Line-detection engine (lattice only):
'combined'(default): OpenCV on the rendered page plus the PDF’s native vector ruled lines unioned into the line masks before contour/joint detection — recovers tables whose rules render faintly. Safe by construction (raster always runs first, vector lines can only add; vector lines are clipped totable_regionsso it never expands a table past the region).'raster': OpenCV on the rendered page only (the pre-#763 behaviour).'vector': detect tables purely from the PDF’s vector ruled lines, skipping rasterisation entirely — fastest, for PDFs whose tables are drawn with real vector strokes (#763).
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)#
Prepare the page for parsing.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
- class camelot.parsers.Network(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', replace_text=None, edge_tol=None, row_tol=2, column_tol=0, debug=False, **kwargs)[source]#
Network method looks for spaces between text to parse the table.
If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)#
Prepare the page for parsing.
- record_parse_metadata(table)#
Record data about the origin of the table.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
- class camelot.parsers.Hybrid(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', replace_text=None, edge_tol=None, row_tol=2, column_tol=0, debug=False, engine='combined', **kwargs)[source]#
Defines a hybrid parser, leveraging both network and lattice parsers.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str or sequence of str, optional (default: '')) – Characters or substrings to strip from each cell. A
strstrips per-character; a list/tuple ofstrstrips whole substrings (#484).edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
engine (str, optional (default: 'combined')) –
Line-detection engine for hybrid’s lattice half (the network half is text-based and unaffected):
'combined'(default): OpenCV on the rendered page plus the PDF’s native vector ruled lines unioned in — recovers faintly-rendered rules. Matches theflavor='lattice'default.'raster': detect ruled lines with OpenCV only (pre-#763).'vector': detect ruled lines straight from the PDF’s vector graphics, skipping rasterisation and OpenCV entirely — the render-free hybrid (network text-edge alignment merged with vector ruled lines) for partial-ruled / borderless tables at roughly an order of magnitude less time than the raster path. (#39)
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, rotation, layout_kwargs)[source]#
Call this method to prepare the page parsing .
- Parameters:
filename ([type]) – [description]
layout ([type]) – [description]
dimensions ([type]) – [description]
page_idx ([type]) – [description]
layout_kwargs ([type]) – [description]
- record_parse_metadata(table)#
Record data about the origin of the table.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
Lower-Lower-Level Classes#
- class camelot.core.TableList(tables: Iterable[Table])[source]#
Defines a list of camelot.core.Table objects.
Each table can be accessed using its index.
- n#
Number of tables in the list.
- Type:
int
Examples
>>> from camelot.core import TableList >>> tables = TableList([]) >>> tables.n 0 >>> tables <TableList n=0>
- export(path: str, f='csv', compress=False)[source]#
Export the list of tables to specified file format.
- Parameters:
path (str) – Output filepath.
f (str) – File format. Can be csv, excel, html, json, markdown or sqlite.
compress (bool) – Whether or not to add files to a ZIP archive.
- filter(min_rows: int = 1, min_columns: int = 1, min_accuracy: float = 0.0, max_whitespace: float = 100.0) TableList[source]#
Return a new TableList keeping only tables that pass all thresholds.
A post-extraction convenience for dropping noise / low-quality tables (single stray cells, mostly-empty regions, …). Parsing is unchanged — everything is still detected; this just selects from the result. Every threshold defaults to a no-op, so calling
filter()with no arguments returns an equivalent list and a legitimate single-row or single-column table is never dropped unless you ask for it.- Parameters:
min_rows (int, optional (default: 1)) – Drop tables with fewer than this many rows.
min_columns (int, optional (default: 1)) – Drop tables with fewer than this many columns.
min_accuracy (float, optional (default: 0.0)) – Drop tables whose
parsing_reportaccuracy (0-100) is below this value.max_whitespace (float, optional (default: 100.0)) – Drop tables whose
parsing_reportwhitespace (0-100) is above this value.
- Returns:
A new list (the original is left untouched), so calls compose:
tables.filter(min_rows=2).filter(min_accuracy=90).- Return type:
- property n: int#
The number of tables in the list.
- stack_contiguous(match: str = 'column_count', keep_first_header: bool = False) TableList[source]#
Vertically stack tables that look like continuations across pages.
Many PDF reports break a single logical table over several pages — a header on every page, a footer on every page, body rows in between.
read_pdfreturns oneTableper page; this helper stitches contiguous ones back together so the resultingTableListhas one entry per logical table instead of one per physical page.- Parameters:
match (str, optional (default: 'column_count')) –
How to decide whether two adjacent tables are continuations of each other.
'column_count'— same number of columns (the rule from #628’s POC; the common case).'first_row'— same column count and identical text in the first row (catches PDFs that repeat the header on every page).
keep_first_header (bool, optional (default: False)) – When
match='first_row', the matching first row of every continuation table is dropped (so the stacked table has exactly one header row). Set toTrueto keep every page’s header row in the stacked output.
- Returns:
A new TableList with continuation runs collapsed. Tables that don’t continue from the previous one (different column count or different first row, depending on
match) are passed through unchanged. The originals inselfare not mutated.- Return type:
Notes
Originally proposed by @TimothyOfDelphi in #628. Consolidates #8 / #133 / #357 / #531.
Limitations:
The stacked
Table’spagekeeps the first stitched table’s page number;orderkeeps the first table’s order. Callers iterating with both shouldn’t be surprised by a missing row-of-pages.parsing_reportis averaged:accuracyandwhitespaceare mean-aggregated across the stitched tables;confidenceis recomputed from the averaged accuracy + whitespace.Cell geometry (
_bbox,cells,rows) is preserved via the y-shift trick from #628’s POC, so downstream plotting on a single stitched table still works page-locally. Cells from the second-and-later tables are shifted to sit below the first table’s bottom.
- class camelot.core.Table(cols, rows)[source]#
Defines a table with coordinates relative to a left-bottom origin.
(PDF coordinate space)
- Parameters:
cols (list) – List of tuples representing column x-coordinates in increasing order.
rows (list) – List of tuples representing row y-coordinates in decreasing order.
- df#
- Type:
pandas.DataFrame
- shape#
Shape of the table.
- Type:
tuple
- accuracy#
Accuracy with which text was assigned to the cell.
- Type:
float
- whitespace#
Percentage of whitespace in the table.
- Type:
float
- filename#
Path of the original PDF
- Type:
str
- order#
Table number on PDF page.
- Type:
int
- page#
PDF page number.
- Type:
int
- property confidence: float#
A unified per-table quality score in
[0.0, 1.0].Computed from the existing per-flavor signals as
(accuracy / 100) * (1 - whitespace / 100). The intent is a single number suitable for production filtering and automated validation —confidence >= 0.8works as a reasonable first-cut threshold; tune for the source PDFs.Components and their meaning are identical across flavors:
accuracy(0-100): how well the detected cells line up with the parser’s structural hints (line joints for lattice, text alignments for stream/network/hybrid). Higher is better.whitespace(0-100): percentage of cells that are empty after stripping. Lower is better (a perfectly populated table is 0; a mostly-empty one trends toward 100).confidence(0-1): the composite.accuracy=90,whitespace=10→confidence≈0.81; either signal going to its worst value pullsconfidenceto 0.
See #659.
- copy_spanning_text(copy_text=None)[source]#
Copies over text in empty spanning cells.
- Parameters:
copy_text (list of str, optional (default: None)) – Select one or more of the following strings: {‘h’, ‘v’} to specify the direction in which text should be copied over when a cell spans multiple rows or columns.
- Returns:
The updated table with copied text in spanning cells.
- Return type:
Notes
Iterates the directional copy passes until the table is stable. A single pass-per-direction misses cells spanned in both directions (a 2D span): the source cell from which the 2D-spanned cell would copy hasn’t itself been filled yet, so the empty string propagates through. Repeating the chosen passes until no cell changes converges in O(spans) iterations and fixes the symptom reported in #349.
- property data#
Returns two-dimensional list of strings in table.
- property parsing_report#
Per-table parsing report.
Standard keys across all flavors:
page1-based page number the table was found on.
order1-based rank within that page (left-to-right / top-to-bottom).
accuracyFloat in
[0, 100]. Seeconfidencefor component-by-component definitions.whitespaceFloat in
[0, 100].confidenceFloat in
[0, 1]. Unified quality score — combines accuracy and whitespace.See #659.
- set_edges(vertical, horizontal, joint_tol=2)[source]#
Set the edges of the joint.
Set a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.
- Parameters:
vertical (list) – List of detected vertical lines.
horizontal (list) – List of detected horizontal lines.
joint_tol (int, optional) – Tolerance for determining proximity, by default 2
- to_csv(path, **kwargs)[source]#
Write Table(s) to a comma-separated values (csv) file.
For kwargs, check
pandas.DataFrame.to_csv().- Parameters:
path (str) – Output filepath.
- to_excel(path, **kwargs)[source]#
Write Table(s) to an Excel file.
For kwargs, check
pandas.DataFrame.to_excel(). The optionalmodekwarg is forwarded topandas.ExcelWriter("w"to overwrite,"a"to append a new sheet to an existing workbook) — see #317.- Parameters:
path (str) – Output filepath.
mode (str, optional (default: 'w')) – ExcelWriter open mode. Use
"a"to add a new sheet to an existing workbook (requires openpyxl).
- to_html(path, **kwargs)[source]#
Write Table(s) to an HTML file.
For kwargs, check
pandas.DataFrame.to_html(). The optionalmodekwarg is consumed by the file-open call (#317).- Parameters:
path (str) – Output filepath.
mode (str, optional (default: 'w')) – File open mode. Pass
"a"to append to an existing file rather than overwrite it.
- to_json(path, **kwargs)[source]#
Write Table(s) to a JSON file.
For kwargs, check
pandas.DataFrame.to_json(). The optionalmodekwarg ("w"to overwrite,"a"to append) is consumed by the file-open call; the rest are forwarded toDataFrame.to_json(#317).- Parameters:
path (str) – Output filepath.
mode (str, optional (default: 'w')) – File open mode. Pass
"a"to append to an existing file rather than overwrite it.
- to_markdown(path, **kwargs)[source]#
Write Table(s) to a Markdown file.
For kwargs, check
pandas.DataFrame.to_markdown(). The optionalmodekwarg is consumed by the file-open call — passingmode="a"appends every successive call to the same file rather than overwriting (#317).- Parameters:
path (str) – Output filepath.
mode (str, optional (default: 'w')) – File open mode. Pass
"a"to append to an existing file rather than overwrite it.
- class camelot.core.Cell(x1, y1, x2, y2)[source]#
Defines a cell in a table.
With coordinates relative to a left-bottom origin. (PDF coordinate space)
- Parameters:
x1 (float) – x-coordinate of left-bottom point.
y1 (float) – y-coordinate of left-bottom point.
x2 (float) – x-coordinate of right-top point.
y2 (float) – y-coordinate of right-top point.
- lb#
Tuple representing left-bottom coordinates.
- Type:
tuple
- lt#
Tuple representing left-top coordinates.
- Type:
tuple
- rb#
Tuple representing right-bottom coordinates.
- Type:
tuple
- rt#
Tuple representing right-top coordinates.
- Type:
tuple
- left#
Whether or not cell is bounded on the left.
- Type:
bool
- right#
Whether or not cell is bounded on the right.
- Type:
bool
- top#
Whether or not cell is bounded on the top.
- Type:
bool
- bottom#
Whether or not cell is bounded on the bottom.
- Type:
bool
- text#
Text assigned to cell.
- Type:
string
Plotting#
- camelot.plot(table, kind='text', filename=None, ax=None)#
Classmethod for plotting methods.
- class camelot.plotting.PlotMethods[source]#
Classmethod for plotting methods.
- static contour(table, ax=None)[source]#
Generate a plot for all table boundaries present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static grid(table, ax=None)[source]#
Generate a plot for the detected table grids on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static joint(table, ax=None)[source]#
Generate a plot for all line intersections present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static line(table, ax=None)[source]#
Generate a plot for all line segments present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static network_table_search(table, ax=None)[source]#
Generate a plot illustrating the steps of the network table search.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- text(table, ax=None)[source]#
Generate a plot for all text elements present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static textedge(table, ax=None)[source]#
Generate a plot for relevant textedges.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure