API Reference#
Main Interface#
- camelot.read_pdf(filepath: str | IO[Any] | Path, pages='1', password=None, flavor='lattice', suppress_stdout=False, parallel=False, layout_kwargs=None, debug=False, **kwargs)[source]#
Read PDF and return extracted tables.
Note: kwargs annotated with ^ can only be used with flavor=’stream’ or flavor=’network’ and kwargs annotated with * can only be used with flavor=’lattice’. The hybrid parser accepts kwargs with both annotations.
- Parameters:
filepath (str, Path, IO) – Filepath or URL of the PDF file.
pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
password (str, optional (default: None)) – Password for decryption.
flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’, ‘stream’, ‘network’ or ‘hybrid’). Lattice is used by default.
suppress_stdout (bool, optional (default: False)) – Print all logs and warnings.
parallel (bool, optional (default: False)) – Process pages in parallel using all available cpu cores.
layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
process_background* (bool, optional (default: False)) – Process background lines.
line_scale* (int, optional (default: 40)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize* (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.
For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant* (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.
For more information, refer OpenCV’s adaptiveThreshold.
iterations* (int, optional (default: 0)) –
Number of times for erosion/dilation is applied.
For more information, refer OpenCV’s dilate.
backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.
use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True
resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.
- Returns:
tables
- Return type:
Lower-Level Classes#
- class camelot.handlers.PDFHandler(filepath: str | IO[Any] | Path, pages='1', password=None, debug=False)[source]#
Handles all operations on the PDF’s.
Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.
- Parameters:
filepath (str) – Filepath or URL of the PDF file.
pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
password (str, optional (default: None)) – Password for decryption.
debug (bool, optional (default: False)) – Whether the parser should store debug information during parsing.
- parse(flavor: str = 'lattice', suppress_stdout: bool = False, parallel: bool = False, layout_kwargs: dict[str, Any] | None = None, **kwargs)[source]#
Extract tables by calling parser.get_tables on all single page PDFs.
- Parameters:
flavor (str (default: 'lattice')) – The parsing method to use. Lattice is used by default.
suppress_stdout (bool (default: False)) – Suppress logs and warnings.
parallel (bool (default: False)) – Process pages in parallel using all available cpu cores.
layout_kwargs (dict, optional (default: {})) –
A dict of pdfminer.layout.LAParams kwargs.
kwargs (dict) – See camelot.read_pdf kwargs.
- Returns:
tables – List of tables found in PDF.
- Return type:
- class camelot.parsers.Stream(table_regions=None, table_areas=None, columns=None, split_text=False, flag_size=False, strip_text='', edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]#
Stream method of parsing looks for spaces between text to parse the table.
If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)#
Prepare the page for parsing.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
- class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=None, split_text=False, flag_size=False, strip_text='', line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=-2, iterations=0, resolution=300, use_fallback=True, backend='pdfium', **kwargs)[source]#
Lattice method looks for lines between text to parse the table.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
process_background (bool, optional (default: False)) – Process background lines.
line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.
For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.
For more information, refer OpenCV’s adaptiveThreshold.
iterations (int, optional (default: 0)) –
Number of times for erosion/dilation is applied.
For more information, refer OpenCV’s dilate.
backend* (str, optional by default "pdfium") – The backend to use for converting the PDF to an image so it can be processed by OpenCV.
use_fallback* (bool, optional) – Fallback to another backend if unavailable, by default True
resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)#
Prepare the page for parsing.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
- class camelot.parsers.Network(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', edge_tol=None, row_tol=2, column_tol=0, debug=False, **kwargs)[source]#
Network method looks for spaces between text to parse the table.
If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)#
Prepare the page for parsing.
- record_parse_metadata(table)#
Record data about the origin of the table.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
- class camelot.parsers.Hybrid(table_regions=None, table_areas=None, columns=None, flag_size=False, split_text=False, strip_text='', edge_tol=None, row_tol=2, column_tol=0, debug=False, **kwargs)[source]#
Defines a hybrid parser, leveraging both network and lattice parsers.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
- compute_parse_errors(table)#
Compute parse errors for the table .
- Parameters:
table (camelot.core.Table)
- Returns:
Parse errors
- Return type:
Tuple
- extract_tables()#
Extract tables from the document.
- prepare_page_parse(filename, layout, dimensions, page_idx, images, horizontal_text, vertical_text, layout_kwargs)[source]#
Call this method to prepare the page parsing .
- Parameters:
filename ([type]) – [description]
layout ([type]) – [description]
dimensions ([type]) – [description]
page_idx ([type]) – [description]
layout_kwargs ([type]) – [description]
- record_parse_metadata(table)#
Record data about the origin of the table.
- table_bboxes()#
Return a list of table bounding boxes sorted by position .
- Returns:
[description]
- Return type:
[type]
Lower-Lower-Level Classes#
- class camelot.core.TableList(tables: Iterable[Table])[source]#
Defines a list of camelot.core.Table objects.
Each table can be accessed using its index.
- n#
Number of tables in the list.
- Type:
int
- export(path: str, f='csv', compress=False)[source]#
Export the list of tables to specified file format.
- Parameters:
path (str) – Output filepath.
f (str) – File format. Can be csv, excel, html, json, markdown or sqlite.
compress (bool) – Whether or not to add files to a ZIP archive.
- property n: int#
The number of tables in the list.
- class camelot.core.Table(cols, rows)[source]#
Defines a table with coordinates relative to a left-bottom origin.
(PDF coordinate space)
- Parameters:
cols (list) – List of tuples representing column x-coordinates in increasing order.
rows (list) – List of tuples representing row y-coordinates in decreasing order.
- df#
- Type:
pandas.DataFrame
- shape#
Shape of the table.
- Type:
tuple
- accuracy#
Accuracy with which text was assigned to the cell.
- Type:
float
- whitespace#
Percentage of whitespace in the table.
- Type:
float
- filename#
Path of the original PDF
- Type:
str
- order#
Table number on PDF page.
- Type:
int
- page#
PDF page number.
- Type:
int
- copy_spanning_text(copy_text=None)[source]#
Copies over text in empty spanning cells.
- Parameters:
copy_text (list of str, optional (default: None)) – Select one or more of the following strings: {‘h’, ‘v’} to specify the direction in which text should be copied over when a cell spans multiple rows or columns.
- Returns:
The updated table with copied text in spanning cells.
- Return type:
- property data#
Returns two-dimensional list of strings in table.
- property parsing_report#
Returns a parsing report.
with % accuracy, % whitespace, table number on page and page number.
- set_edges(vertical, horizontal, joint_tol=2)[source]#
Set the edges of the joint.
Set a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.
- Parameters:
vertical (list) – List of detected vertical lines.
horizontal (list) – List of detected horizontal lines.
joint_tol (int, optional) – Tolerance for determining proximity, by default 2
- to_csv(path, **kwargs)[source]#
Write Table(s) to a comma-separated values (csv) file.
For kwargs, check
pandas.DataFrame.to_csv()
.- Parameters:
path (str) – Output filepath.
- to_excel(path, **kwargs)[source]#
Write Table(s) to an Excel file.
For kwargs, check
pandas.DataFrame.to_excel()
.- Parameters:
path (str) – Output filepath.
- to_html(path, **kwargs)[source]#
Write Table(s) to an HTML file.
For kwargs, check
pandas.DataFrame.to_html()
.- Parameters:
path (str) – Output filepath.
- to_json(path, **kwargs)[source]#
Write Table(s) to a JSON file.
For kwargs, check
pandas.DataFrame.to_json()
.- Parameters:
path (str) – Output filepath.
- class camelot.core.Cell(x1, y1, x2, y2)[source]#
Defines a cell in a table.
With coordinates relative to a left-bottom origin. (PDF coordinate space)
- Parameters:
x1 (float) – x-coordinate of left-bottom point.
y1 (float) – y-coordinate of left-bottom point.
x2 (float) – x-coordinate of right-top point.
y2 (float) – y-coordinate of right-top point.
- lb#
Tuple representing left-bottom coordinates.
- Type:
tuple
- lt#
Tuple representing left-top coordinates.
- Type:
tuple
- rb#
Tuple representing right-bottom coordinates.
- Type:
tuple
- rt#
Tuple representing right-top coordinates.
- Type:
tuple
- left#
Whether or not cell is bounded on the left.
- Type:
bool
- right#
Whether or not cell is bounded on the right.
- Type:
bool
- top#
Whether or not cell is bounded on the top.
- Type:
bool
- bottom#
Whether or not cell is bounded on the bottom.
- Type:
bool
- text#
Text assigned to cell.
- Type:
string
Plotting#
- camelot.plot(table, kind='text', filename=None, ax=None)#
Classmethod for plotting methods.
- class camelot.plotting.PlotMethods[source]#
Classmethod for plotting methods.
- static contour(table, ax=None)[source]#
Generate a plot for all table boundaries present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static grid(table, ax=None)[source]#
Generate a plot for the detected table grids on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static joint(table, ax=None)[source]#
Generate a plot for all line intersections present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static line(table, ax=None)[source]#
Generate a plot for all line segments present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static network_table_search(table, ax=None)[source]#
Generate a plot illustrating the steps of the network table search.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- text(table, ax=None)[source]#
Generate a plot for all text elements present on the PDF page.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure
- static textedge(table, ax=None)[source]#
Generate a plot for relevant textedges.
- Parameters:
table (camelot.core.Table)
ax (matplotlib.axes.Axes (optional))
- Returns:
fig
- Return type:
matplotlib.fig.Figure