Validating extractions
Quality control the data extractions in a document type by writing validations using JsonLogic:
- Test extracted fields using Boolean, logic, numeric, array, string, and other operations.
- If Sensible extracted a field from OCR’d text, test the confidence score for the field’s anchor and value as a measure of the quality of the text images. For example, test that text in a scanned document isn’t blurry or illegible.
Then write your own logic based on the validations, for example:
- pass a document extraction automatically through your pipeline if there are no errors and 10% of warning validations fail
- flag a document extraction for human review if 5% of error validations fail
Sensible uses validation errors to calculate coverage for an extraction.
Create validations
Sensible app
To create validations in the Sensible app:
- Click the document type.
- Click Create validation.
- Enter the parameters for the validation.
- Click Create.
Parameters
A validation has the following parameters:
id | value | notes |
---|---|---|
description (required) | string | A description of the test |
severity (required) | error, warning, skipped | The severity of the failing test. |
prerequisite fields | array | Use this parameter to generate skipped error messages when optional extracted fields are null. For example, if a missing broker’s email address doesn’t greatly affect the quality of your extraction, then write a condition to verify broker.email is properly formatted, but specify [“broker\\.email”] in this parameter to skip the verification if the email is null. For an example, see Validation 3 in the Examples section. Double escape any dots in the field keys (for example, delivery\\.zip\\.code). |
condition (required) | JsonLogic object | Tests extracted fields using Boolean, logic, numeric, array, string, and other operations. Supports all JsonLogic operations and extends them with Sensible operations. For the list of Sensible operations, and for more information about syntax, see the Custom Computation method. |
Examples
Say that you have a document type for scanned sales quotes, called “sales_quotes”, with configs for
- company_A
- company_B
- company_C
You test sales quote extractions from all the companies with the following validations:
Validation 1
- Description: If OCR’d, the source text for quoted rate value is a high-quality, unblurred image.
- Severity: warning
- Condition:
Notes: Since some sales quotes for company_A
are scanned documents, check if the field came from OCR’d text. If it was OCR’d (confidence score is not null), then test that it has a high OCR confidence score for both the anchor text and the extracted value text. This validation requires that you set a high verbosity setting in the SenseML configuration.
Validation 2
- Description: The quoted rate value isn’t null
- Severity: error
- Condition:
Notes: Tests that the zip_code
is a 5-digit number if the country
field equals USA, or 6 alphanumeric characters if the country
field equals Canada. Uses a Sensible operation (match
) to test regular expressions.
Validations output
For example output of the preceding conditions, see the following extraction excerpt and validation output:
Extraction excerpt
Validations output
For the preceding extraction excerpt, Sensible outputs the following validations:
- Validation 4: Sensible skips the broker email because the prerequisite field
broker.email
is null - Validation 5: fails because
zip_code
is 17 digits