How It Works

This part of the documentation includes a high-level explanation of how Camelot extracts tables from PDF files.

You can choose between two table parsing methods, Stream and Lattice. These names for parsing methods inside Camelot were inspired from Tabula.

Stream

Stream can be used to parse tables that have whitespaces between cells to simulate a table structure. It looks for these spaces between text to form a table representation.

It is built on top of PDFMiner’s functionality of grouping characters on a page into words and sentences, using margins. After getting the words on a page, it groups them into rows based on their y coordinates. It then tries to guess the number of columns the table might have by calculating the mode of the number of words in each row. This mode is used to calculate x ranges for the table’s columns. It then adds columns to this column range list based on any words that may lie outside or inside the current column x ranges.

Note

By default, Stream treats the whole PDF page as a table, which isn’t ideal when there are more than two tables on a page with different number of columns. Automatic table detection for Stream is in the works.

Lattice

Lattice is more deterministic in nature, and it does not rely on guesses. It can be used to parse tables that have demarcated lines between cells, and it can automatically parse multiple tables present on a page.

It starts by converting the PDF page to an image using ghostscript, and then processes it to get horizontal and vertical line segments by applying a set of morphological transformations (erosion and dilation) using OpenCV.

Let’s see how Lattice processes the second page of this PDF, step-by-step.

  1. Line segments are detected.
../_images/geometry_line.png
  1. Line intersections are detected, by overlapping the detected line segments and “and”ing their pixel intensities.
../_images/geometry_joint.png
  1. Table boundaries are computed by overlapping the detected line segments again, this time by “or”ing their pixel intensities.
../_images/geometry_contour.png
  1. Since dimensions of the PDF page and its image vary, the detected table boundaries, line intersections, and line segments are scaled and translated to the PDF page’s coordinate space, and a representation of the table is created.
../_images/table.png
  1. Spanning cells are detected using the line segments and line intersections.
../_images/geometry_table.png
  1. Finally, the words found on the page are assigned to the table’s cells based on their x and y coordinates.