Advanced Usage

This page covers some of the more advanced configurations for Lattice and Stream.

Process background lines

To detect line segments, Lattice needs the lines that make the table to be in the foreground. Here’s an example of a table with lines in the background:

A table with lines in background

Source: PDF

To process background lines, you can pass process_background=True.

>>> tables = camelot.read_pdf('background_lines.pdf', process_background=True)
>>> tables[1].df
State Date Halt stations Halt days Persons directly reached(in lakh) Persons trained Persons counseled Persons testedfor HIV
Delhi 1.12.2009 8 17 1.29 3,665 2,409 1,000
Rajasthan 2.12.2009 to 19.12.2009            
Gujarat 20.12.2009 to 3.1.2010 6 13 6.03 3,810 2,317 1,453
Maharashtra 4.01.2010 to 1.2.2010 13 26 1.27 5,680 9,027 4,153
Karnataka 2.2.2010 to 22.2.2010 11 19 1.80 5,741 3,658 3,183
Kerala 23.2.2010 to 11.3.2010 9 17 1.42 3,559 2,173 855
Total   47 92 11.81 22,455 19,584 10,644

Visual debugging

Note

Visual debugging using plot() requires matplotlib which is an optional dependency. You can install it using $ pip install camelot-py[plot].

You can use the plot() method to generate a matplotlib plot of various elements that were detected on the PDF page while processing it. This can help you select table areas, column separators and debug bad table outputs, by tweaking different configuration parameters.

You can specify the type of element you want to plot using the kind keyword argument. The generated plot can be saved to a file by passing a filename keyword argument. The following plot types are supported:

  • ‘text’
  • ‘grid’
  • ‘contour’
  • ‘line’
  • ‘joint’

Note

The last three plot types can only be used with Lattice, i.e. when flavor='lattice'.

Let’s generate a plot for each type using this PDF as an example. First, let’s get all the tables out.

>>> tables = camelot.read_pdf('foo.pdf')
>>> tables
<TableList n=1>

text

Let’s plot all the text present on the table’s PDF page.

>>> camelot.plot(tables[0], kind='text')
>>> plt.show()
A plot of all text on a PDF page

This, as we shall later see, is very helpful with Stream for noting table areas and column separators, in case Stream does not guess them correctly.

Note

The x-y coordinates shown above change as you move your mouse cursor on the image, which can help you note coordinates.

table

Let’s plot the table (to see if it was detected correctly or not). This plot type, along with contour, line and joint is useful for debugging and improving the extraction output, in case the table wasn’t detected correctly. (More on that later.)

>>> camelot.plot(tables[0], kind='grid')
>>> plt.show()
A plot of all tables on a PDF page

The table is perfect!

contour

Now, let’s plot all table boundaries present on the table’s PDF page.

>>> camelot.plot(tables[0], kind='contour')
>>> plt.show()
A plot of all contours on a PDF page

line

Cool, let’s plot all line segments present on the table’s PDF page.

>>> camelot.plot(tables[0], kind='line')
>>> plt.show()
A plot of all lines on a PDF page

joint

Finally, let’s plot all line intersections present on the table’s PDF page.

>>> camelot.plot(tables[0], kind='joint')
>>> plt.show()
A plot of all line intersections on a PDF page

Specify table areas

In cases such as these, it can be useful to specify table boundaries. You can plot the text on this page and note the top left and bottom right coordinates of the table.

Table areas that you want Camelot to analyze can be passed as a list of comma-separated strings to read_pdf(), using the table_areas keyword argument.

>>> tables = camelot.read_pdf('table_areas.pdf', flavor='stream', table_areas=['316,499,566,337'])
>>> tables[0].df
  One Withholding
Payroll Period Allowance
Weekly $71.15
Biweekly 142.31
Semimonthly 154.17
Monthly 308.33
Quarterly 925.00
Semiannually 1,850.00
Annually 3,700.00
Daily or Miscellaneous 14.23
(each day of the payroll period)  

Specify column separators

In cases like these, where the text is very close to each other, it is possible that Camelot may guess the column separators’ coordinates incorrectly. To correct this, you can explicitly specify the x coordinate for each column separator by plotting the text on the page.

You can pass the column separators as a list of comma-separated strings to read_pdf(), using the columns keyword argument.

In case you passed a single column separators string list, and no table area is specified, the separators will be applied to the whole page. When a list of table areas is specified and you need to specify column separators as well, the length of both lists should be equal. Each table area will be mapped to each column separators’ string using their indices.

For example, if you have specified two table areas, table_areas=['12,54,43,23', '20,67,55,33'], and only want to specify column separators for the first table, you can pass an empty string for the second table in the column separators’ list like this, columns=['10,120,200,400', ''].

Let’s get back to the x coordinates we got from plotting the text that exists on this PDF, and get the table out!

>>> tables = camelot.read_pdf('column_separators.pdf', flavor='stream', columns=['72,95,209,327,442,529,566,606,683'])
>>> tables[0].df
LICENSE       PREMISE          
NUMBER TYPE DBA NAME     LICENSEE NAME ADDRESS CITY ST ZIP PHONE NUMBER EXPIRES

Ah! Since PDFMiner merged the strings, “NUMBER”, “TYPE” and “DBA NAME”, all of them were assigned to the same cell. Let’s see how we can fix this in the next section.

Split text along separators

To deal with cases like the output from the previous section, you can pass split_text=True to read_pdf(), which will split any strings that lie in different cells but have been assigned to a single cell (as a result of being merged together by PDFMiner).

>>> tables = camelot.read_pdf('column_separators.pdf', flavor='stream', columns=['72,95,209,327,442,529,566,606,683'], split_text=True)
>>> tables[0].df
LICENSE       PREMISE          
NUMBER TYPE DBA NAME LICENSEE NAME ADDRESS CITY ST ZIP PHONE NUMBER EXPIRES

Flag superscripts and subscripts

There might be cases where you want to differentiate between the text and superscripts or subscripts, like this PDF.

A PDF with superscripts

In this case, the text that other tools return, will be 24.912. This is relatively harmless when that decimal point is involved. But when it isn’t there, you’ll be left wondering why the results of your data analysis are 10x bigger!

You can solve this by passing flag_size=True, which will enclose the superscripts and subscripts with <s></s>, based on font size, as shown below.

>>> tables = camelot.read_pdf('superscript.pdf', flavor='stream', flag_size=True)
>>> tables[0].df
Karnataka 22.44 19.59
2.86 1.22
0.89
0.69
Kerala 29.03 24.91<s>2</s>
4.11 1.77
0.48
1.45
Madhya Pradesh 27.13 23.57
3.56 0.38
1.86
1.28

Control how text is grouped into rows

You can pass row_close_tol=<+int> to group the rows closer together, as shown below.

>>> tables = camelot.read_pdf('group_rows.pdf', flavor='stream')
>>> tables[0].df
Clave   Clave     Clave  
  Nombre Entidad     Nombre Municipio   Nombre Localidad
Entidad   Municipio     Localidad  
01 Aguascalientes 001 Aguascalientes   0094 Granja Adelita
01 Aguascalientes 001 Aguascalientes   0096 Agua Azul
01 Aguascalientes 001 Aguascalientes   0100 Rancho Alegre
>>> tables = camelot.read_pdf('group_rows.pdf', flavor='stream', row_close_tol=10)
>>> tables[0].df
Clave Nombre Entidad Clave   Nombre Municipio Clave Nombre Localidad
Entidad   Municipio     Localidad  
01 Aguascalientes 001 Aguascalientes   0094 Granja Adelita
01 Aguascalientes 001 Aguascalientes   0096 Agua Azul
01 Aguascalientes 001 Aguascalientes   0100 Rancho Alegre

Detect short lines

There might be cases while using Lattice when smaller lines don’t get detected. The size of the smallest line that gets detected is calculated by dividing the PDF page’s dimensions with a scaling factor called line_size_scaling. By default, its value is 15.

As you can guess, the larger the line_size_scaling, the smaller the size of lines getting detected.

Warning

Making line_size_scaling very large (>150) will lead to text getting detected as lines.

Here’s a PDF where small lines separating the the headers don’t get detected with the default value of 15.

A PDF table with short lines

Let’s plot the table for this PDF.

>>> tables = camelot.read_pdf('short_lines.pdf')
>>> camelot.plot(tables[0], kind='grid')
>>> plt.show()
A plot of the PDF table with short lines

Clearly, the smaller lines separating the headers, couldn’t be detected. Let’s try with line_size_scaling=40, and plot the table again.

>>> tables = camelot.read_pdf('short_lines.pdf', line_size_scaling=40)
>>> camelot.plot(tables[0], kind='grid')
>>> plt.show()
An improved plot of the PDF table with short lines

Voila! Camelot can now see those lines. Let’s get our table.

>>> tables[0].df
Investigations No. ofHHs Age/Sex/Physiological Group Preva-lence C.I* RelativePrecision Sample sizeper State
Anthropometry 2400 All …        
Clinical Examination            
History of morbidity            
Diet survey 1200 All …        
Blood Pressure # 2400 Men (≥ 18yrs) 10% 95% 20% 1728
    Women (≥ 18 yrs)       1728
Fasting blood glucose 2400 Men (≥ 18 yrs) 5% 95% 20% 1825
    Women (≥ 18 yrs)       1825
Knowledge &Practices on HTN &DM 2400 Men (≥ 18 yrs)
1728
  2400 Women (≥ 18 yrs)
1728

Shift text in spanning cells

By default, the Lattice method shifts text in spanning cells, first to the left and then to the top, as you can observe in the output table above. However, this behavior can be changed using the shift_text keyword argument. Think of it as setting the gravity for a table — it decides the direction in which the text will move and finally come to rest.

shift_text expects a list with one or more characters from the following set: ('', l', 'r', 't', 'b'), which are then applied in order. The default, as we discussed above, is ['l', 't'].

We’ll use the PDF from the previous example. Let’s pass shift_text=[''], which basically means that the text will experience weightlessness! (It will remain in place.)

A PDF table with short lines
>>> tables = camelot.read_pdf('short_lines.pdf', line_size_scaling=40, shift_text=[''])
>>> tables[0].df
Investigations No. ofHHs Age/Sex/Physiological Group Preva-lence C.I* RelativePrecision Sample sizeper State
Anthropometry            
Clinical Examination 2400   All …      
History of morbidity            
Diet survey 1200   All …      
    Men (≥ 18yrs)       1728
Blood Pressure # 2400 Women (≥ 18 yrs) 10% 95% 20% 1728
    Men (≥ 18 yrs)       1825
Fasting blood glucose 2400 Women (≥ 18 yrs) 5% 95% 20% 1825
Knowledge &Practices on HTN & 2400 Men (≥ 18 yrs)
1728
DM 2400 Women (≥ 18 yrs)
1728

No surprises there — it did remain in place (observe the strings “2400” and “All the available individuals”). Let’s pass shift_text=['r', 'b'] to set the gravity to right-bottom and move the text in that direction.

>>> tables = camelot.read_pdf('short_lines.pdf', line_size_scaling=40, shift_text=['r', 'b'])
>>> tables[0].df
Investigations No. ofHHs Age/Sex/Physiological Group Preva-lence C.I* RelativePrecision Sample sizeper State
Anthropometry            
Clinical Examination            
History of morbidity 2400         All …
Diet survey 1200         All …
    Men (≥ 18yrs)       1728
Blood Pressure # 2400 Women (≥ 18 yrs) 10% 95% 20% 1728
    Men (≥ 18 yrs)       1825
Fasting blood glucose 2400 Women (≥ 18 yrs) 5% 95% 20% 1825
  2400 Men (≥ 18 yrs)
1728
Knowledge &Practices on HTN &DM 2400 Women (≥ 18 yrs)
1728

Copy text in spanning cells

You can copy text in spanning cells when using Lattice, in either the horizontal or vertical direction, or both. This behavior is disabled by default.

copy_text expects a list with one or more characters from the following set: ('v', 'h'), which are then applied in order.

Let’s try it out on this PDF. First, let’s check out the output table to see if we need to use any other configuration parameters.

>>> tables = camelot.read_pdf('copy_text.pdf')
>>> tables[0].df
Sl. No. Name of State/UT Name of District Disease/ Illness No. of Cases No. of Deaths Date of start of outbreak Date of reporting Current Status
1 Kerala Kollam
  1. Food Poisoning
19 0 31/12/13 03/01/14 Under control
2 Maharashtra Beed
  1. Dengue & Chikungunya i
11 0 03/01/14 04/01/14 Under control
3 Odisha Kalahandi
  1. Food Poisoning
42 0 02/01/14 03/01/14 Under control
4 West Bengal West Medinipur
  1. Acute Diarrhoeal Disease
145 0 04/01/14 05/01/14 Under control
    Birbhum
  1. Food Poisoning
199 0 31/12/13 31/12/13 Under control
    Howrah
  1. Viral Hepatitis A &E
85 0 26/12/13 27/12/13 Under surveillance

We don’t need anything else. Now, let’s pass copy_text=['v'] to copy text in the vertical direction. This can save you some time by not having to add this step in your cleaning script!

>>> tables = camelot.read_pdf('copy_text.pdf', copy_text=['v'])
>>> tables[0].df
Sl. No. Name of State/UT Name of District Disease/ Illness No. of Cases No. of Deaths Date of start of outbreak Date of reporting Current Status
1 Kerala Kollam
  1. Food Poisoning
19 0 31/12/13 03/01/14 Under control
2 Maharashtra Beed
  1. Dengue & Chikungunya i
11 0 03/01/14 04/01/14 Under control
3 Odisha Kalahandi
  1. Food Poisoning
42 0 02/01/14 03/01/14 Under control
4 West Bengal West Medinipur
  1. Acute Diarrhoeal Disease
145 0 04/01/14 05/01/14 Under control
4 West Bengal Birbhum
  1. Food Poisoning
199 0 31/12/13 31/12/13 Under control
4 West Bengal Howrah
  1. Viral Hepatitis A &E
85 0 26/12/13 27/12/13 Under surveillance