Concepts
OCR
When you extract document data with Sensible, Sensible automatically performs optical character recognition (OCR) on the document for you, except in advanced cases. If the document doesn’t require OCR, Sensible automatically extracts embedded text directly from the document to optimize performance.
For advanced cases, you can configure how Sensible OCRs documents using the following parameters:
option | configurable for | notes |
---|---|---|
OCR Level parameter | document types | Use this option to configure the criteria by which Sensible determines if a whole document requires OCR. |
OCR preprocessor | configs | Use this option to OCR specified pages or page ranges in a document. |
OCR Engine parameter | document types | Use this option to choose your OCR provider, for example, Amazon, Google, or Microsoft. |
For an overview of how Sensible handles OCR, see the following steps:
- Sensible converts supported Microsoft Office file types into PDFs.
- Sensible transforms the bytes of the document into raw text, and determines whether the document needs OCR:
- If the file type is an image (for example, PNG), Sensible runs OCR for the whole document, as specified by the document type’s OCR Engine parameter.
- (Configurable) if the file is a PDF, Sensible processes the file, as specified by the document type’s OCR Level parameter and OCR Engine. For more information, see the following table.
- (Configurable) After additional intervening steps, Sensible applies your configured preprocessors, including the OCR preprocessor. This preprocessor runs for documents that don’t trigger whole-document OCR in a previous step.
Notes
- For more information about OCR versus embedded text extraction, see Solving direct text extraction from PDFs.
- For information about extracting data from non-text images, such as photographs, charts, or illustrations, see the Query Group method’s Multimodal Engine parameter. You can use the Multimodal Engine parameter as an alternative to OCR to extract from poor-quality text images, such as handwriting.