API Reference¶
Main Interface¶
- camelot.read_pdf(filepath: str | IO | Path, pages='1', password=None, flavor='lattice', suppress_stdout=False, layout_kwargs=None, **kwargs)[source]¶
Read PDF and return extracted tables.
Note: kwargs annotated with ^ can only be used with flavor=’stream’ and kwargs annotated with * can only be used with flavor=’lattice’.
- Parameters:
filepath (str, Path, IO) – Filepath or URL of the PDF file.
pages (str, optional (default: '1')) – Comma-separated page numbers. Example: ‘1,3,4’ or ‘1,4-end’ or ‘all’.
password (str, optional (default: None)) – Password for decryption.
flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.
suppress_stdout (bool, optional (default: True)) – Print all logs and warnings.
layout_kwargs (dict, optional (default: {})) – A dict of pdfminer.layout.LAParams kwargs.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns^ (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
row_tol^ (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol^ (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
process_background* (bool, optional (default: False)) – Process background lines.
line_scale* (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text* (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text* (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
line_tol* (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol* (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize* (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.
For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant* (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.
For more information, refer OpenCV’s adaptiveThreshold.
iterations* (int, optional (default: 0)) –
Number of times for erosion/dilation is applied.
For more information, refer OpenCV’s dilate.
resolution* (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.
- Returns:
tables
- Return type:
Lower-Level Classes¶
- class camelot.handlers.PDFHandler(filepath: str | IO | Path, pages='1', password=None)[source]¶
Handles all operations like temp directory creation, splitting file into single page PDFs, parsing each PDF and then removing the temp directory.
- Parameters:
- parse(flavor='lattice', suppress_stdout=False, layout_kwargs=None, **kwargs)[source]¶
Extracts tables by calling parser.get_tables on all single page PDFs.
- Parameters:
flavor (str (default: 'lattice')) – The parsing method to use (‘lattice’ or ‘stream’). Lattice is used by default.
suppress_stdout (str (default: False)) – Suppress logs and warnings.
layout_kwargs (dict, optional (default: {})) –
A dict of pdfminer.layout.LAParams kwargs.
kwargs (dict) – See camelot.read_pdf kwargs.
- Returns:
tables – List of tables found in PDF.
- Return type:
- class camelot.parsers.Stream(table_regions=None, table_areas=None, columns=None, split_text=False, flag_size=False, strip_text='', edge_tol=50, row_tol=2, column_tol=0, **kwargs)[source]¶
Stream method of parsing looks for spaces between text to parse the table.
If you want to specify columns when specifying multiple table areas, make sure that the length of both lists are equal.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
columns (list, optional (default: None)) – List of column x-coordinates strings where the coordinates are comma-separated.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
edge_tol (int, optional (default: 50)) – Tolerance parameter for extending textedges vertically.
row_tol (int, optional (default: 2)) – Tolerance parameter used to combine text vertically, to generate rows.
column_tol (int, optional (default: 0)) – Tolerance parameter used to combine text horizontally, to generate columns.
- class camelot.parsers.Lattice(table_regions=None, table_areas=None, process_background=False, line_scale=15, copy_text=None, shift_text=['l', 't'], split_text=False, flag_size=False, strip_text='', line_tol=2, joint_tol=2, threshold_blocksize=15, threshold_constant=-2, iterations=0, resolution=300, backend='ghostscript', **kwargs)[source]¶
Lattice method of parsing looks for lines between text to parse the table.
- Parameters:
table_regions (list, optional (default: None)) – List of page regions that may contain tables of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
table_areas (list, optional (default: None)) – List of table area strings of the form x1,y1,x2,y2 where (x1, y1) -> left-top and (x2, y2) -> right-bottom in PDF coordinate space.
process_background (bool, optional (default: False)) – Process background lines.
line_scale (int, optional (default: 15)) – Line size scaling factor. The larger the value the smaller the detected lines. Making it very large will lead to text being detected as lines.
copy_text (list, optional (default: None)) – {‘h’, ‘v’} Direction in which text in a spanning cell will be copied over.
shift_text (list, optional (default: ['l', 't'])) – {‘l’, ‘r’, ‘t’, ‘b’} Direction in which text in a spanning cell will flow.
split_text (bool, optional (default: False)) – Split text that spans across multiple cells.
flag_size (bool, optional (default: False)) – Flag text based on font size. Useful to detect super/subscripts. Adds <s></s> around flagged text.
strip_text (str, optional (default: '')) – Characters that should be stripped from a string before assigning it to a cell.
line_tol (int, optional (default: 2)) – Tolerance parameter used to merge close vertical and horizontal lines.
joint_tol (int, optional (default: 2)) – Tolerance parameter used to decide whether the detected lines and points lie close to each other.
threshold_blocksize (int, optional (default: 15)) –
Size of a pixel neighborhood that is used to calculate a threshold value for the pixel: 3, 5, 7, and so on.
For more information, refer OpenCV’s adaptiveThreshold.
threshold_constant (int, optional (default: -2)) –
Constant subtracted from the mean or weighted mean. Normally, it is positive but may be zero or negative as well.
For more information, refer OpenCV’s adaptiveThreshold.
iterations (int, optional (default: 0)) –
Number of times for erosion/dilation is applied.
For more information, refer OpenCV’s dilate.
resolution (int, optional (default: 300)) – Resolution used for PDF to PNG conversion.
Lower-Lower-Level Classes¶
- class camelot.core.TableList(tables)[source]¶
Defines a list of camelot.core.Table objects. Each table can be accessed using its index.
- class camelot.core.Table(cols, rows)[source]¶
Defines a table with coordinates relative to a left-bottom origin. (PDF coordinate space)
- Parameters:
cols (list) – List of tuples representing column x-coordinates in increasing order.
rows (list) – List of tuples representing row y-coordinates in decreasing order.
- df¶
- Type:
- shape¶
Shape of the table.
- Type:
tuple
- property data¶
Returns two-dimensional list of strings in table.
- property parsing_report¶
Returns a parsing report with %accuracy, %whitespace, table number on page and page number.
- set_edges(vertical, horizontal, joint_tol=2)[source]¶
Sets a cell’s edges to True depending on whether the cell’s coordinates overlap with the line’s coordinates within a tolerance.
- Parameters:
vertical (list) – List of detected vertical lines.
horizontal (list) – List of detected horizontal lines.
- set_span()[source]¶
Sets a cell’s hspan or vspan attribute to True depending on whether the cell spans horizontally or vertically.
- to_csv(path, **kwargs)[source]¶
Writes Table to a comma-separated values (csv) file.
For kwargs, check
pandas.DataFrame.to_csv()
.- Parameters:
path (str) – Output filepath.
- to_excel(path, **kwargs)[source]¶
Writes Table to an Excel file.
For kwargs, check
pandas.DataFrame.to_excel()
.- Parameters:
path (str) – Output filepath.
- to_html(path, **kwargs)[source]¶
Writes Table to an HTML file.
For kwargs, check
pandas.DataFrame.to_html()
.- Parameters:
path (str) – Output filepath.
- to_json(path, **kwargs)[source]¶
Writes Table to a JSON file.
For kwargs, check
pandas.DataFrame.to_json()
.- Parameters:
path (str) – Output filepath.
- to_markdown(path, **kwargs)[source]¶
Writes Table to a Markdown file.
For kwargs, check
pandas.DataFrame.to_markdown()
.- Parameters:
path (str) – Output filepath.
- to_sqlite(path, **kwargs)[source]¶
Writes Table to sqlite database.
For kwargs, check
pandas.DataFrame.to_sql()
.- Parameters:
path (str) – Output filepath.
- class camelot.core.Cell(x1, y1, x2, y2)[source]¶
Defines a cell in a table with coordinates relative to a left-bottom origin. (PDF coordinate space)
- Parameters:
- lb¶
Tuple representing left-bottom coordinates.
- Type:
tuple
- lt¶
Tuple representing left-top coordinates.
- Type:
tuple
- rb¶
Tuple representing right-bottom coordinates.
- Type:
tuple
- rt¶
Tuple representing right-top coordinates.
- Type:
tuple
- text¶
Text assigned to cell.
- Type:
string