Introduction

If you’re trying to convert a document, such as a PDF, into an Excel file, you often find tools that copy the document’s visual layout into a spreadsheet, with no meaningful relationship between the extracted text and the underlying cells.

In contrast, Sensible converts document tables, checkboxes, paragraphs, and even complex repeating section layouts into meaningfully labeled column/row pairs and linked sheets. For a list of document file types that Sensible can extract data from, see Supported file types.

This topic describes the rules Sensible uses to:

  • convert data extracted from one document, such as a PDF, into an Excel spreadsheet or CSV file.
  • combine data extracted from multiple documents into one spreadsheet

Sensible extracts the data as JSON using a SenseML configuration, then applies the rules to create spreadsheet data such as tables, labeled columns, and linked sheets.

Prerequisites

To get a document’s data into a spreadsheet, you must first:

  • Configure extractions for a document type, either by authoring an extraction configuration using Sensible, or by using Sensible’s open-source configuration library for common document types.
  • Run an extraction on a target document that’s in your configured document type

CSV files

To generate CSV files, Sensible uses the same rules described in the following sections, except that Sensible uses blank rows instead of tabbed sheets to separate extraction content.

Multi-document spreadsheet

Sensible combines multiple document extractions into one spreadsheet when:

To combine extractions, Sensible uses the same rules described in the following single-document spreadsheet rules. Sensible uses the following additional rules:

  • For all sheets other than the <fields> sheet, Sensible creates a new sheet for each document, and prefixes each sheet name with a zero-indexed document number, for example doc_0 or doc_1.
  • For the <fields>sheet, Sensible appends data for each document in a new row. For each field ID, Sensible appends data under the same column if it finds a matching field ID from another extraction in your call to the Generate Excel method or endpoint, or under a new column if it doesn’t find an matching field ID.
  • For portfolio extractions, Sensible adds metadata to the <fields> sheet, for example, the page range of the document in the portfolio.

Tips

  • To avoid manually merging the similarly named columns, Sensible recommends using the same IDs for corresponding fields across different SenseML configs in a document type. For example, if document A outputs fields car_model and car_year, and document B outputs car_model and manufacture_year, Sensible creates a spreadsheet like the following:

In the preceding example, Sensible recommends using either car_year or manufacture_year, not both.

  • To configure multi-document spreadsheet output using different rules than those described here, configure rules with Sensible’s Zapier integration. You can map single-value field output with this integration, but not multi-value field output.

Single-document spreadsheet

Overview

Sensible transforms JSON data extracted from a single document to a spreadsheet using the following rules:

  • the fields sheet lists each piece of document data that can be represented as a single-cell value. For example, an extracted total monthly mortgage dollar amount.
  • <field_id> sheets hold more complex pieces of document data. For example, an extracted table.
  • <field_id>.<index> sheets hold complex repeating document data. For example, an extracted claims loss run.

For more information, see the following sections.

fields sheet

The fields sheet lists fields and their values as key-value dictionaries, with the field ID as the column heading and value as the row. Sensible outputs a field to this sheet if the field:

  • outputs a single value. For example, the Box, Label, Row and Region methods each output a single value by default.
  • outputs a predictably short array of values that can easily be stringified in a single cell. Typically this is the result of a Type configuration, for example, the Name type always outputs an array.

Example

Sensible converts the JSON extraction output from the auto_insurance_anyco example described in the Getting started with layout-based extractions to the following spreadsheet:

Spreadsheet output

Example document

The preceding spreadsheet contains data from the following example document:

Example documentDownload link

Example configuration

See the Getting started with layout-based extractions for the SenseML configuration for this example.

JSON output

The following JSON document extraction output is the source for this spreadsheet:

JSON
{
  "policy_period": {
    "type": "string",
    "value": "April 14, 2021 - Oct 14, 2021"
  },
  "comprehensive_premium": {
    "source": "$150",
    "value": 150,
    "unit": "$",
    "type": "currency"
  },
  "property_liability_premium": {
    "source": "$10",
    "value": 10,
    "unit": "$",
    "type": "currency"
  },
  "policy_number": {
    "value": "123456789",
    "type": "string"
  }
}

<field_id> sheets

Each <field_id> sheet lists the output of a single field. Sensible outputs a field to this sheet if the field outputs multiple values. For example:

  • the table methods and other methods that output nested JSON objects.
  • methods that output arrays of unpredictable length, for example, fields with "match":"all" configured.

Example

Sensible converts the example JSON output from the example document described in the Fixed Table method to the following spreadsheet:

The preceding example shows that the fields sheet lists the corresponding sheets for fields that have complex output. In this case, you must click on the agile_risks_table_updates_monthly sheet to view the table output.

Example document

The preceding spreadsheet contains data from the following example document:

Example documentDownload link

Example configuration

See Fixed Table method for the SenseML configuration for this example.

JSON output

The following JSON extraction output is the source for this spreadsheet:

JSON
{
  "agile_risks_table": {
    "columns": [
      {
        "id": "col1_risk_description",
        "values": [
          {
            "value": "Poor task point estimation",
            "type": "string"
          },
          {
            "value": "Poor epic scope definition",
            "type": "string"
          },
          {
            "value": "Inadequate scrum master training",
            "type": "string"
          }
        ]
      },
      {
        "id": "col4_rank_last_month",
        "values": [
          {
            "source": "2",
            "value": 2,
            "type": "number"
          },
          {
            "source": "1",
            "value": 1,
            "type": "number"
          },
          {
            "source": "3",
            "value": 3,
            "type": "number"
          }
        ]
      }
    ]
  }
}

<field_id>.<index> sheets

Each <field_id>.<index> sheet lists the output of a single field that contains complex repeating data, for example, Sections output.

Example

Sensible converts the example JSON output from the example document described in the Advanced Sections nested table example topic to the following spreadsheet:

The preceding example shows that Sensible outputs nested sections in linked, indexed sheets.

Example document

The preceding spreadsheet contains data from the following example document:

Example documentDownload link

Example configuration

See the Advanced sections nested table example topic for the SenseML configuration for this example.

JSON output

The following JSON extraction output is the source for this spreadsheet:

JSON
{
  "table_columns": [
    {
      "employee_category": {
        "type": "string",
        "value": "Employees paid \u0000100k"
      },
      "employee_benefit": {
        "value": "100% of salary, max $100k",
        "type": "string"
      },
      "reduction_subtable": {
        "columns": [
          {
            "id": "col1_age",
            "values": [
              {
                "source": "65",
                "value": 65,
                "type": "number"
              },
              {
                "source": "70",
                "value": 70,
                "type": "number"
              },
              {
                "source": "75",
                "value": 75,
                "type": "number"
              }
            ]
          },
          {
            "id": "col2_reduction",
            "values": [
              {
                "source": "35%",
                "value": 35,
                "type": "percentage"
              },
              {
                "source": "60%",
                "value": 60,
                "type": "percentage"
              },
              {
                "source": "75%",
                "value": 75,
                "type": "percentage"
              }
            ]
          }
        ]
      },
      "everything_in_this_vertical_section": {
        "type": "string",
        "value": "Employees paid \u0000100k  Notes Employee benefit 100% of salary, max $100k  After a 3 month waiting period Common carrier Not included  Benefit reduction Age Reduction   Not adjusted for 65 35%   inflation 70 60%   75 75%   For more details about coverage and benefits, see the following sections."
      }
    },
    {
      "employee_category": {
        "type": "string",
        "value": "All other employees"
      },
      "employee_benefit": {
        "value": "50% of salary, max $50k",
        "type": "string"
      },
      "reduction_subtable": {
        "columns": [
          {
            "id": "col1_age",
            "values": [
              {
                "source": "65",
                "value": 65,
                "type": "number"
              },
              {
                "source": "70",
                "value": 70,
                "type": "number"
              },
              {
                "source": "75",
                "value": 75,
                "type": "number"
              }
            ]
          },
          {
            "id": "col2_reduction",
            "values": [
              {
                "source": "35%",
                "value": 35,
                "type": "percentage"
              },
              {
                "source": "60%",
                "value": 60,
                "type": "percentage"
              },
              {
                "source": "75%",
                "value": 75,
                "type": "percentage"
              }
            ]
          }
        ]
      },
      "everything_in_this_vertical_section": {
        "type": "string",
        "value": "All other employees Notes Employee benefit  50% of salary, max $50k After a 3 month waiting period Common carrier  Not included Benefit reduction   Age Reduction Not adjusted for   65 35% inflation   70 60%   75 75% For more details about coverage and benefits, see the following sections."
      }
    }
  ]
}

Next