Extract structured data from email

Order confirmations, invoices, shipping updates, and job applications all arrive as prose written for humans, not as clean JSON. Pulling the order number, total, and ship date out of a thousand differently-formatted messages is where rule-based parsers fall apart: every merchant phrases it differently, and a regex that works for one breaks on the next.

A language model handles that variety well. This recipe fetches the full message, hands it to a model with a schema describing the fields you want, and gets back typed data you can write straight to a database. It also covers pulling fields out of PDF and image attachments.

How do you extract structured data from an email with AI?

You extract structured data in three steps: fetch the full message body with the Messages API, pass it to a language model constrained to a JSON schema, then validate the typed result before storing it. The schema forces the model to return the same fields every time, whether the source is an order email or an invoice.

One fetch path covers all 6 providers. Unlike a regex parser, the model reads meaning rather than matching patterns, so it survives layout changes and new senders. Your job is to constrain its output and check it, not to anticipate every format.

Fetch the full message body, not the snippet

Extraction needs the complete text. When you already have a message ID, from a webhook or a prior list call, fetch it directly with GET /v3/grants/{grant_id}/messages/{message_id}. Read the full body field rather than the snippet, which is only the first 100 characters and drops the order total that sits halfway down the email. The response also returns standard fields like sender, recipients, subject, and date; pass fields=include_headers if your extraction also needs the raw headers.

The function below fetches one message and strips the HTML to plain text before extraction. Models read plain text more reliably than raw HTML, and stripping tags also cuts the token count by roughly 40% on a typical marketing-styled email. Keep the sender and subject, since they often carry the merchant name the body omits.

import os, requests
from bs4 import BeautifulSoup

NYLAS = "https://api.us.nylas.com"
HEADERS = {"Authorization": f"Bearer {os.environ['NYLAS_API_KEY']}"}

def fetch_text(grant_id, message_id):
    r = requests.get(
        f"{NYLAS}/v3/grants/{grant_id}/messages/{message_id}", headers=HEADERS
    )
    r.raise_for_status()
    msg = r.json()["data"]
    text = BeautifulSoup(msg["body"], "html.parser").get_text(" ", strip=True)
    return msg["subject"], text

Define the fields you want as a JSON schema

A schema is what turns a chatty model into a parser. Describe each field, its type, and whether it’s required, then pass that schema to the model so the response is guaranteed to match. Mark a field nullable when it may be absent, because forcing a value is what makes a model invent one. A focused schema of 5 to 10 fields extracts more accurately than a sprawling one.

The schema below targets an order confirmation. Typing total as a number rather than a string means the model returns 49.99, not "$49.99", so you skip a parsing step downstream. Setting additionalProperties: false blocks the model from padding the object with fields you didn’t ask for.

ORDER_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "required": ["order_number", "total", "currency"],
    "properties": {
        "order_number": {"type": "string"},
        "total": {"type": "number"},
        "currency": {"type": "string"},
        "ship_date": {"type": ["string", "null"], "format": "date"},
        "items": {"type": "array", "items": {"type": "string"}},
    },
}

Extract the fields with the model

With text and schema in hand, the extraction is one model call that returns JSON matching the schema. Use a temperature of 0 so the same email always yields the same fields, and keep the prompt short because the schema, not the prose, does the constraining. Extracting from one order email takes about 2 seconds and costs a fraction of a cent with GPT-4o-mini.

The call below pins the model to the order schema and returns a Python dict. Passing the schema through response_format is stronger than asking for JSON in the prompt, since the model is decoded against the schema rather than trusted to follow instructions. Always wrap the parse in a try/except, because a truncated response is still possible on very long messages.

import json
from openai import OpenAI
client = OpenAI()

def extract(subject, text, schema):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_schema",
                         "json_schema": {"name": "order", "schema": schema}},
        messages=[
            {"role": "system", "content": "Extract the fields. Use null when absent."},
            {"role": "user", "content": f"Subject: {subject}\n\n{text}"},
        ],
    )
    return json.loads(resp.choices[0].message.content)

Extract data from PDF and image attachments

Plenty of the data you want lives in an attached invoice, not the email body. Each message lists its attachments with an id, a content_type, and a size, and you download the bytes with GET /v3/grants/{grant_id}/attachments/{attachment_id}/download?message_id={message_id}. The download returns the raw file as binary, so a 2 MB scanned invoice comes back ready to hand to a model.

Route the file by type. Send a PDF’s extracted text or an image directly to a vision-capable model, which reads a scanned invoice the same way it reads body text. The download endpoint takes the attachment_id in the path and the message_id as a query parameter. For the inline-versus-download mechanics, see Download email attachments.

def download_attachment(grant_id, message_id, attachment_id):
    r = requests.get(
        f"{NYLAS}/v3/grants/{grant_id}/attachments/{attachment_id}/download",
        headers=HEADERS,
        params={"message_id": message_id},
    )
    r.raise_for_status()
    return r.content  # bytes: PDF or image, hand to a vision/OCR model

Validate every extracted field

A model is a probabilistic parser, so treat its output as untrusted input. Check types, ranges, and formats before the data touches your database: an order_number that doesn’t match your known pattern, or a total of 0, is a signal the extraction missed. Rejecting on a failed check beats storing a confident-looking wrong value, which is far harder to catch later.

Where a field has a strict format, validate it deterministically. Dates, currency codes, and email addresses have well-defined shapes, so a regex or a date parser confirms the model’s answer for free. Roughly 2% to 5% of extractions on messy real-world mail need a human, so flag low-confidence results for review rather than dropping them silently.

Things to know about AI extraction

Models hallucinate most when a required field is missing, so make optional fields nullable and tell the model to use null. Marking even 1 field nullable removes the most common class of fabricated value. For fields with an exact pattern, like an email address or an ISO date, a deterministic extractor is cheaper and more reliable than a model, so reserve the model for the free-form fields a regex can’t reach.

Privacy applies the same way it does to any inbox integration. Invoices and applications carry personal data, so decide whether full bodies and attachments may leave your infrastructure or need a local model. The trust-boundary and local-model options are covered in Connect an LLM to a user’s inbox.

What’s next

Connect an LLM to a user’s inbox for the fetch-and-act foundation
Summarize email threads with AI for multi-message conversations
Download email attachments for the attachment download paths
Read a single message or thread for the full message fetch
Messages API reference for every message and attachment field