In this tutorial, you’ll learn to extract data out of a set of similar documents using a layout-based query language, SenseML. You’ll write JSON to tell Sensible about which data to extract from an example document, using what you know about the layout of the document. SenseML uses a mix of techniques, including machine learning, heuristics, and rules, to extract your target information.
You can then save your descriptions as a “config.” Publish your config to automate extracting from similar documents.
Use this tutorial if you want a guided tour of SenseML concepts and the Sensible app. Or see the following links:
sensible_instruct_basics
document type.Let’s get started with SenseML!
If you can write basic SQL queries, you can write SenseML queries. SenseML shields you from the underlying complexities of PDFs, so you can write queries that are visually and logically clear to a human programmer.
In this tutorial, you’ll:
Download the following document:
| Example document | Download link |
As the following screenshot shows, click the auto_insurance_quote document type you created, click the Reference documents tab, and click Upload document:
In the file upload dialog, choose the generic car insurance quote you downloaded in a previous step.
For this tutorial, you’ll extract these fields:
The following image shows this example in the Sensible app:
You should see the following extracted data in the right pane:
Congratulations! You created your first config and extracted your first document data. If you want to process car insurance quotes generated by a different company, you can create a new config and upload a new reference document.
This guide focuses on layout-based document extraction, which works as follows:
id
as the key in the key/value JSON output. For more information, see Field.This config uses three types of layout-based methods:
Type of method | explanation | description |
---|---|---|
layout | How it works: label method | Grab info immediately proximate to labeling text. |
layout | How it works: row method | Grab info from a cell in a row. |
layout | How it works: box method | Grab info from a box. |
This config also uses one natural-language, or AI-powered, method, to demonstrate that you can combine layout-based and natural-language methods in the same config:
Type of method | explanation | description |
---|---|---|
Natural-language | How it works: query method | Ask a free-text question about simple information in the document |
The easiest way to start extracting simple information is to ask a natural-language question.
For example, to extract the bodily injury liability:
The config uses the Query Group method to query for the bodily injury premium
. You can group together other queries if the answers are located within a page or two of each other in the document. For example, in the group, the config also queries for the insurer's customer service phone number
.
This config returns:
Try it out: change one of the questions to "street address for the Anyco insurance company"
and see what you get. For easy authoring, try out this method in Sensible’s visual authoring tool.
LLM-based methods such as the Query Group method can run up against limitations with complex document formatting. In such cases, combine LLM-based methods with layout-based methods in the same document extraction configuration.
Let’s look next at several simple layout-based methods.
To extract the policy period from the document:
The config uses the Label method:
This describes the layout of the data to extract relative to the anchor:
"policy period"
) is text that’s pretty close to the text to extract, so it can serve as a “label” for that text ("id": "label"
)."position": "right"
).This config returns:
You can extract text to the right, left, above, or below a label. For example, how would you use a label to extract the driver’s name? Try it out.
See those gray boxes around the text in the following image?
Each gray box shows the boundaries for a “line.” Sensible recognizes lines using whitespaces and other factors, so “lines” can occupy the same height on the page.
The Label method can operate in a single line, or on consecutive lines. Here’s a question: for the preceding image, can you use the Label method to anchor on “Bodily injury” and return “$25,000 each”? Try it out:
This returns null, because the Label method works for text in the same line or in proximate lines. In this case, the problem is that the gap between the two lines of text is more than 0.2 inches:
Take a look instead at a purpose-built Row method instead to extract text in a table.
To extract the comprehensive premium of $150:
The config uses the Row method:
This describes the data to extract:
"comprehensive"
) is part of a row of lines ("id": "row"
)."type": "currency"
). For other data types you can define, see Field query object."tiebreaker": "second"
). Use tiebreakers to select lines in rows, for example maximum and minimum values (<
and >
)."position":"left"
).This returns:
But wait! Why didn’t "tiebreaker": "second"
select 150, since $250 is the second line after the anchor (the first line is ............
)?
The reason is that "tiebreaker": "second"
evaluates after the data type specified in the field, "type": "currency"
. Instead of looking for the second line after the anchor in general, Sensible looks for the second line that contains a currency. Convenient, right?
In the app, you can visually inspect anchors and methods by looking at their color coding:
To continue the Row method example from the previous section, in the following image the orange box shows that “Comprehensive” is the anchor line:
The dotted blue boxes show you that the Row method matches all the lines in the row after the anchor, but then narrows down the actual output to $150 using "tiebreaker": "second"
.
To extract the policy number from this document:
The config uses the Box method:
This describes the data to extract:
"id": "box"
).policy number
."type": "startsWith"
). You can write a simpler string anchor as "anchor":"policy number"
, or you can expand to complex anchors. For more information, see Anchor object.This returns:
Note: Sensible extracts the box contents, but not the anchor itself. By default, Sensible returns method results, not anchor results.
You can get more advanced with this auto insurance config. For example:
xRangeFilter
parameter in the Document Range method to capture the limits."match":"all"
anchor coupled with a Passthrough method, or the Regex method.To check out other methods, see Methods.
Before integrating the config with an application and writing validation tests against it, double check the config by uploading another quote.
Repeat the steps in the previous section to upload a second generic car insurance quote:
| auto_insurance_anyco_2 | Download link |
Click the anyco config, select the “auto_insurance_anyco_2” document, and look at the output. Unlike the first document, the policy period takes up two lines, so Sensible misses the end year (2021):
That seems like sloppy document formatting, but let’s work with it. There are several options for capturing the policy period reliably, including:
Alternative 1: Document Range method
You can use the Document Range method to extract the policy period. This method extracts succeeding lines of text after an anchor. You need to configure some optional parameters, because the Document Range method by default discards anchor lines. Since the date range is part of the anchor line (the line containing "policy period"
), you need to specify to:
Try it out by replacing your existing policy_period
field with this example:
Alternative 2: Region method
You can use the Region method to extract the policy period. A region is a rectangular space defined by coordinates relative to the anchor.
Replace the existing policy_period
field with the following field in the Sensible app:
This field defines a region in inches relative to the anchor. Since the region overlaps the anchor, specify a Word Filters parameter to remove the anchor text in the output. See the green box representing the region in the editor? This box dynamically resizes as you adjust the region parameters (such as the Height and Start parameters), so you can visually tweak the region till you’re satisfied.
Let’s double check that this region also works with the first document:
Yes, it works too.
In a production scenario, continue testing documents until you have confidence your configs work with the document type you’ve defined. Then, write tests to validate the extractions in production.
When you’re ready to integrate with your application, enable using the config with the Sensible SDKs or API by taking the following steps:
env=development
to test the integration before you go to production:.In a previous section, you tested a couple of documents manually. Now it’s time to scale up and quality control the extractions by writing tests that run for all API extractions in a doc type.
Use JsonLogic to validate that the extracted information makes sense for the car insurance document:
Test that the property damage liability premium is cheaper than the comprehensive premium:
{"<":[{"var":"property_liability_premium.value"},{"var":"comprehensive_premium.value"}]}
Test that the policy number is a nine-digit number:
{"match":[{"var":"policy_number.value"},"\\d{9}"]}
To add these tests:auto_insurance_anyco_3 | Download link |
---|
You should receive a response with errors and warnings in the Validations array, as shown in the following API response excerpt:
In this tutorial, you’ll learn to extract data out of a set of similar documents using a layout-based query language, SenseML. You’ll write JSON to tell Sensible about which data to extract from an example document, using what you know about the layout of the document. SenseML uses a mix of techniques, including machine learning, heuristics, and rules, to extract your target information.
You can then save your descriptions as a “config.” Publish your config to automate extracting from similar documents.
Use this tutorial if you want a guided tour of SenseML concepts and the Sensible app. Or see the following links:
sensible_instruct_basics
document type.Let’s get started with SenseML!
If you can write basic SQL queries, you can write SenseML queries. SenseML shields you from the underlying complexities of PDFs, so you can write queries that are visually and logically clear to a human programmer.
In this tutorial, you’ll:
Download the following document:
| Example document | Download link |
As the following screenshot shows, click the auto_insurance_quote document type you created, click the Reference documents tab, and click Upload document:
In the file upload dialog, choose the generic car insurance quote you downloaded in a previous step.
For this tutorial, you’ll extract these fields:
The following image shows this example in the Sensible app:
You should see the following extracted data in the right pane:
Congratulations! You created your first config and extracted your first document data. If you want to process car insurance quotes generated by a different company, you can create a new config and upload a new reference document.
This guide focuses on layout-based document extraction, which works as follows:
id
as the key in the key/value JSON output. For more information, see Field.This config uses three types of layout-based methods:
Type of method | explanation | description |
---|---|---|
layout | How it works: label method | Grab info immediately proximate to labeling text. |
layout | How it works: row method | Grab info from a cell in a row. |
layout | How it works: box method | Grab info from a box. |
This config also uses one natural-language, or AI-powered, method, to demonstrate that you can combine layout-based and natural-language methods in the same config:
Type of method | explanation | description |
---|---|---|
Natural-language | How it works: query method | Ask a free-text question about simple information in the document |
The easiest way to start extracting simple information is to ask a natural-language question.
For example, to extract the bodily injury liability:
The config uses the Query Group method to query for the bodily injury premium
. You can group together other queries if the answers are located within a page or two of each other in the document. For example, in the group, the config also queries for the insurer's customer service phone number
.
This config returns:
Try it out: change one of the questions to "street address for the Anyco insurance company"
and see what you get. For easy authoring, try out this method in Sensible’s visual authoring tool.
LLM-based methods such as the Query Group method can run up against limitations with complex document formatting. In such cases, combine LLM-based methods with layout-based methods in the same document extraction configuration.
Let’s look next at several simple layout-based methods.
To extract the policy period from the document:
The config uses the Label method:
This describes the layout of the data to extract relative to the anchor:
"policy period"
) is text that’s pretty close to the text to extract, so it can serve as a “label” for that text ("id": "label"
)."position": "right"
).This config returns:
You can extract text to the right, left, above, or below a label. For example, how would you use a label to extract the driver’s name? Try it out.
See those gray boxes around the text in the following image?
Each gray box shows the boundaries for a “line.” Sensible recognizes lines using whitespaces and other factors, so “lines” can occupy the same height on the page.
The Label method can operate in a single line, or on consecutive lines. Here’s a question: for the preceding image, can you use the Label method to anchor on “Bodily injury” and return “$25,000 each”? Try it out:
This returns null, because the Label method works for text in the same line or in proximate lines. In this case, the problem is that the gap between the two lines of text is more than 0.2 inches:
Take a look instead at a purpose-built Row method instead to extract text in a table.
To extract the comprehensive premium of $150:
The config uses the Row method:
This describes the data to extract:
"comprehensive"
) is part of a row of lines ("id": "row"
)."type": "currency"
). For other data types you can define, see Field query object."tiebreaker": "second"
). Use tiebreakers to select lines in rows, for example maximum and minimum values (<
and >
)."position":"left"
).This returns:
But wait! Why didn’t "tiebreaker": "second"
select 150, since $250 is the second line after the anchor (the first line is ............
)?
The reason is that "tiebreaker": "second"
evaluates after the data type specified in the field, "type": "currency"
. Instead of looking for the second line after the anchor in general, Sensible looks for the second line that contains a currency. Convenient, right?
In the app, you can visually inspect anchors and methods by looking at their color coding:
To continue the Row method example from the previous section, in the following image the orange box shows that “Comprehensive” is the anchor line:
The dotted blue boxes show you that the Row method matches all the lines in the row after the anchor, but then narrows down the actual output to $150 using "tiebreaker": "second"
.
To extract the policy number from this document:
The config uses the Box method:
This describes the data to extract:
"id": "box"
).policy number
."type": "startsWith"
). You can write a simpler string anchor as "anchor":"policy number"
, or you can expand to complex anchors. For more information, see Anchor object.This returns:
Note: Sensible extracts the box contents, but not the anchor itself. By default, Sensible returns method results, not anchor results.
You can get more advanced with this auto insurance config. For example:
xRangeFilter
parameter in the Document Range method to capture the limits."match":"all"
anchor coupled with a Passthrough method, or the Regex method.To check out other methods, see Methods.
Before integrating the config with an application and writing validation tests against it, double check the config by uploading another quote.
Repeat the steps in the previous section to upload a second generic car insurance quote:
| auto_insurance_anyco_2 | Download link |
Click the anyco config, select the “auto_insurance_anyco_2” document, and look at the output. Unlike the first document, the policy period takes up two lines, so Sensible misses the end year (2021):
That seems like sloppy document formatting, but let’s work with it. There are several options for capturing the policy period reliably, including:
Alternative 1: Document Range method
You can use the Document Range method to extract the policy period. This method extracts succeeding lines of text after an anchor. You need to configure some optional parameters, because the Document Range method by default discards anchor lines. Since the date range is part of the anchor line (the line containing "policy period"
), you need to specify to:
Try it out by replacing your existing policy_period
field with this example:
Alternative 2: Region method
You can use the Region method to extract the policy period. A region is a rectangular space defined by coordinates relative to the anchor.
Replace the existing policy_period
field with the following field in the Sensible app:
This field defines a region in inches relative to the anchor. Since the region overlaps the anchor, specify a Word Filters parameter to remove the anchor text in the output. See the green box representing the region in the editor? This box dynamically resizes as you adjust the region parameters (such as the Height and Start parameters), so you can visually tweak the region till you’re satisfied.
Let’s double check that this region also works with the first document:
Yes, it works too.
In a production scenario, continue testing documents until you have confidence your configs work with the document type you’ve defined. Then, write tests to validate the extractions in production.
When you’re ready to integrate with your application, enable using the config with the Sensible SDKs or API by taking the following steps:
env=development
to test the integration before you go to production:.In a previous section, you tested a couple of documents manually. Now it’s time to scale up and quality control the extractions by writing tests that run for all API extractions in a doc type.
Use JsonLogic to validate that the extracted information makes sense for the car insurance document:
Test that the property damage liability premium is cheaper than the comprehensive premium:
{"<":[{"var":"property_liability_premium.value"},{"var":"comprehensive_premium.value"}]}
Test that the policy number is a nine-digit number:
{"match":[{"var":"policy_number.value"},"\\d{9}"]}
To add these tests:auto_insurance_anyco_3 | Download link |
---|
You should receive a response with errors and warnings in the Validations array, as shown in the following API response excerpt: