Merge lines
Merges lines distributed along a horizontal axis more aggressively than the built-in line merger. This preprocessor solves line-recognition problems caused by poor-quality document scans, handwritten text, and other formatting issues. For example, this preprocessor solves:
- oversplit lines
- lines overlapping on the x-axis
- ”jittery” lines misaligned on the y-axis
There are limitations to the combinations of parameter values you can set. For more information, see the Notes section.
Parameters
key | value | description |
---|---|---|
type (required) | mergeLines | merges lines distributed along a horizontal axis. |
directlyAdjacentThreshold (required) | number >= 0.16 | Usually, it’s recommended to leave the default for this parameter (0.16).Sensible uses the default setting for this parameter to transform separate tokens output from Google OCR into lines.This parameter specifies the fraction of line height under which to merge two adjacent lines distributed along an x-axis without a space. For example, at 0.16, this preprocessor merges two lines separated by a small gap whose width is less than 16% of the line height. Choosing a larger number merges more aggressively. |
adjacentThreshold (required) | number >= 0.6 | Corrects oversplit lines. Specifies the fraction of line height under which to merge two adjacent lines distributed along an x-axis with a space. The built-in merger uses 0.6, so choosing a larger number merges more aggressively. For an example, see the Examples section. |
yOverlapThreshold | number between 0 and 1.0. default: 1.0 | Merges lines that aren’t perfectly aligned at the same height on the page. Specifies the y overlap above which the Merge Lines preprocessor merges two adjacent lines. Y overlap is the section of the joint y-axis range of two lines that’s occupied by both lines. For example, if two lines share the same minimum and maximum y-axis values, their overlap is 1. If one line’s extent is from 0 to 10 and the other line’s extent is from 2 to 12 on the y-axis, their overlap is .667 (8 / 12). For an example, see the Examples section. |
minXGapThreshold | number in inches | Configure this parameter if two lines overlap on an x-axis. The default behavior is to merge these overlapping lines into one line. To split them instead, set a cap on the amount of allowable overlap. For example:0 - splits lines if their line boundaries are touching but not overlapping.0.1 - splits lines if their boundaries overlap a little, up to 0.1 inches.2.0 - splits lines even when they overlap a lot, up to 2.0 inches.For an example, see the Examples section. |
Examples
Handwriting OCR
Use the Merge Lines preprocessor to clean up OCRed handwriting text. This preprocessor is useful for Google OCR, which by default groups text into words rather than lines.
PROBLEM
Without a Merge Line preprocessor, the placeholder handwritten data in an example document is oversplit by Google OCR:
For example, the phrase Name (First, Middle, Last, Suffix, Trust or Custodian)
isn’t one line, but is instead split on words.
SOLUTION
CONFIG
Example document
The following image shows the example document used with this example config:
Example document | Download link |
---|
To run this example, verify that the document type uses Google OCR (click the gear icon for the Document Type and select Google):
OUTPUT
Modify this example to observe the effects of the different parameters on the output. For example:
- set
"adjacentThreshold": 0.1
to see oversplit lines. - set
"adjacentThreshold": 2.0
to see aggressively merged lines. - revert Adjacent Threshold to the original setting, then set
"yOverlapThreshold": 0.2
to observe how lines with misaligned heights (like the email address) merges more aggressively.
Oversplit lines
PROBLEM
The following image shows oversplit lines. For example, Sensible splits the phrase “premium driver discount” into three lines even though the human eye perceives it as one phrase:
SOLUTION
The following example shows using the Merge Lines preprocessor to fix the oversplit lines and find a discount amount for a specific vehicle.
CONFIG
Example document
The following image shows the example document used with this example config:
OUTPUT
Jittery lines on a y-axis
The following example shows using the Y Overlap parameter to correct vertical misalignment or “jitter” in lines (for example, as the result of a low-quality scan or because of handwriting).
Config
Example document
The following image shows the example document used with this example config:
Example document | Download link |
---|
Output
Overlapping lines on an x-axis
The following example shows using the Min X Gap Threshold parameter to extract overlapping text in a poorly formatted document. In this example, the built-in behavior without a Min X Gap Threshold is to merge the overlapping lines into one line (Supplementary underinsured/uninsured motorist coverage500,000 USD Combined single limit incl. umbl
).
The Min X Gap Threshold preserves the intended document formatting, which is a two-column table. By preserving this format, you can consistently use the Row method on the table in this document, as well as in other examples of this table in documents in which the lines don’t overlap.
Config
Example document
The following image shows the example document used with this example config:
Example document | Download link |
---|
Output
Notes
Because the Merge Lines preprocessor evaluates after the built-in line merger, there are limitations to the combinations of parameter values you can set:
yOverlapThreshold
In general, when you set "yOverlapThreshold":1.0
or leave its value unspecified, then you set "adjacentThreshold"
to 0.6 or higher.
In this situation, "directlyAdjacentThreshold"
and "adjacentThreshold"
have no effect if both their values are less than 0.6. In other words, the following configuration has no effect: