# Extract structured data from email

Source: https://developer.nylas.com/docs/cookbook/ai/extract-data-from-email/

Order confirmations, invoices, shipping updates, and job applications all arrive as prose written for humans, not as clean JSON. Pulling the order number, total, and ship date out of a thousand differently-formatted messages is where rule-based parsers fall apart: every merchant phrases it differently, and a regex that works for one breaks on the next.

A language model handles that variety well. This recipe fetches the full message, hands it to a model with a schema describing the fields you want, and gets back typed data you can write straight to a database. It also covers pulling fields out of PDF and image attachments.

## How do you extract structured data from an email with AI?

You extract structured data in three steps: fetch the full message body with the [Messages API](/docs/reference/api/messages/), pass it to a language model constrained to a JSON schema, then validate the typed result before storing it. The schema forces the model to return the same fields every time, whether the source is an order email or an invoice.

One fetch path covers all 6 providers. Unlike a regex parser, the model reads meaning rather than matching patterns, so it survives layout changes and new senders. Your job is to constrain its output and check it, not to anticipate every format.

## Fetch the full message body, not the snippet

Extraction needs the complete text. When you already have a message ID, from a webhook or a prior list call, fetch it directly with `GET /v3/grants/{grant_id}/messages/{message_id}`. Read the full `body` field rather than the `snippet`, which is only the first 100 characters and drops the order total that sits halfway down the email. The response also returns standard fields like sender, recipients, subject, and date; pass `fields=include_headers` if your extraction also needs the raw headers.

The function below fetches one message and strips the HTML to plain text before extraction. Models read plain text more reliably than raw HTML, and stripping tags also cuts the token count by roughly 40% on a typical marketing-styled email. Keep the sender and subject, since they often carry the merchant name the body omits.

```python

from bs4 import BeautifulSoup

NYLAS = "https://api.us.nylas.com"
HEADERS = {"Authorization": f"Bearer {os.environ['NYLAS_API_KEY']}"}

def fetch_text(grant_id, message_id):
    r = requests.get(
        f"{NYLAS}/v3/grants/{grant_id}/messages/{message_id}", headers=HEADERS
    )
    r.raise_for_status()
    msg = r.json()["data"]
    text = BeautifulSoup(msg["body"], "html.parser").get_text(" ", strip=True)
    return msg["subject"], text
```

## Define the fields you want as a JSON schema

A schema is what turns a chatty model into a parser. Describe each field, its type, and whether it's required, then pass that schema to the model so the response is guaranteed to match. Mark a field nullable when it may be absent, because forcing a value is what makes a model invent one. A focused schema of 5 to 10 fields extracts more accurately than a sprawling one.

The schema below targets an order confirmation. Typing `total` as a number rather than a string means the model returns `49.99`, not `"$49.99"`, so you skip a parsing step downstream. Setting `additionalProperties: false` blocks the model from padding the object with fields you didn't ask for.

```python
ORDER_SCHEMA = {
    "type": "object",
    "additionalProperties": False,
    "required": ["order_number", "total", "currency"],
    "properties": {
        "order_number": {"type": "string"},
        "total": {"type": "number"},
        "currency": {"type": "string"},
        "ship_date": {"type": ["string", "null"], "format": "date"},
        "items": {"type": "array", "items": {"type": "string"}},
    },
}
```

## Extract the fields with the model

With text and schema in hand, the extraction is one model call that returns JSON matching the schema. Use a temperature of 0 so the same email always yields the same fields, and keep the prompt short because the schema, not the prose, does the constraining. Extracting from one order email takes about 2 seconds and costs a fraction of a cent with GPT-4o-mini.

The call below pins the model to the order schema and returns a Python dict. Passing the schema through `response_format` is stronger than asking for JSON in the prompt, since the model is decoded against the schema rather than trusted to follow instructions. Always wrap the parse in a try/except, because a truncated response is still possible on very long messages.

```python

from openai import OpenAI
client = OpenAI()

def extract(subject, text, schema):
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_schema",
                         "json_schema": {"name": "order", "schema": schema}},
        messages=[
            {"role": "system", "content": "Extract the fields. Use null when absent."},
            {"role": "user", "content": f"Subject: {subject}\n\n{text}"},
        ],
    )
    return json.loads(resp.choices[0].message.content)
```

## Extract data from PDF and image attachments

Plenty of the data you want lives in an attached invoice, not the email body. Each message lists its attachments with an `id`, a `content_type`, and a `size`, and you download the bytes with `GET /v3/grants/{grant_id}/attachments/{attachment_id}/download?message_id={message_id}`. The download returns the raw file as binary, so a 2 MB scanned invoice comes back ready to hand to a model.

Route the file by type. Send a PDF's extracted text or an image directly to a vision-capable model, which reads a scanned invoice the same way it reads body text. The download endpoint takes the `attachment_id` in the path and the `message_id` as a query parameter. For the inline-versus-download mechanics, see [Download email attachments](/docs/cookbook/email/attachments/download-attachments/).

```python
def download_attachment(grant_id, message_id, attachment_id):
    r = requests.get(
        f"{NYLAS}/v3/grants/{grant_id}/attachments/{attachment_id}/download",
        headers=HEADERS,
        params={"message_id": message_id},
    )
    r.raise_for_status()
    return r.content  # bytes: PDF or image, hand to a vision/OCR model
```

## Validate every extracted field

A model is a probabilistic parser, so treat its output as untrusted input. Check types, ranges, and formats before the data touches your database: an `order_number` that doesn't match your known pattern, or a `total` of 0, is a signal the extraction missed. Rejecting on a failed check beats storing a confident-looking wrong value, which is far harder to catch later.

Where a field has a strict format, validate it deterministically. Dates, currency codes, and email addresses have well-defined shapes, so a regex or a date parser confirms the model's answer for free. Roughly 2% to 5% of extractions on messy real-world mail need a human, so flag low-confidence results for review rather than dropping them silently.

## Things to know about AI extraction

Models hallucinate most when a required field is missing, so make optional fields nullable and tell the model to use null. Marking even 1 field nullable removes the most common class of fabricated value. For fields with an exact pattern, like an email address or an ISO date, a deterministic extractor is cheaper and more reliable than a model, so reserve the model for the free-form fields a regex can't reach.

Privacy applies the same way it does to any inbox integration. Invoices and applications carry personal data, so decide whether full bodies and attachments may leave your infrastructure or need a local model. The trust-boundary and local-model options are covered in [Connect an LLM to a user's inbox](/docs/cookbook/ai/connect-llm-to-inbox/).

## What's next

- [Connect an LLM to a user's inbox](/docs/cookbook/ai/connect-llm-to-inbox/) for the fetch-and-act foundation
- [Summarize email threads with AI](/docs/cookbook/ai/summarize-email-threads/) for multi-message conversations
- [Download email attachments](/docs/cookbook/email/attachments/download-attachments/) for the attachment download paths
- [Read a single message or thread](/docs/cookbook/email/get-message-thread/) for the full message fetch
- [Messages API reference](/docs/reference/api/messages/) for every message and attachment field