# Parse email signatures for contact enrichment

Source: https://developer.nylas.com/docs/cookbook/agents/signature-enrichment/

Email signatures are structured data masquerading as prose. Roughly 82% of business email contains a signature with at least name and title. Most platforms ignore it; with a few hundred lines of regex you can extract titles, phone numbers, LinkedIn URLs, websites, and company affiliations and ship them straight into your CRM.

This recipe argues for regex over LLM (deterministic + free) and shows the cross-referencing trick that lifts accuracy from "decent" to "production-usable".

## Why regex, not an LLM

For unstructured prose, an LLM wins. Signatures aren't unstructured — they're *predictably* structured. They're 3–6 lines, separated from the body by `--` (per RFC 3676), with field types in a small set: name, title, company, phone, email, URL, social handle. A regex catches >95% of well-formed signatures, runs in microseconds, and costs nothing per message.

The case for an LLM fallback exists, but only as the last 5%. Skip it for the first version.

## Detect the signature boundary

```python


SIG_DELIMITERS = [
    r"\n--\s*\n",                # RFC 3676 standard
    r"\nSent from my (iPhone|iPad|Android)",
    r"\nGet Outlook for iOS",
    r"\nThanks?,?\s*\n",
    r"\nBest,?\s*\n",
    r"\nRegards,?\s*\n",
    r"\nCheers,?\s*\n",
]

def split_signature(body: str) -> tuple[str, str]:
    for pat in SIG_DELIMITERS:
        m = re.search(pat, body)
        if m:
            return body[:m.start()], body[m.end():]
    return body, ""
```

You'll miss inline signatures with no delimiter — but they're a small minority and the cross-referencing step (below) backfills the gaps.

## Extract the fields

```python
def extract(sig: str) -> dict:
    return {
        "phone": re.search(r"(?:\+?1[-.\s]?)?\(?[\d]{3}\)?[-.\s]?[\d]{3}[-.\s]?[\d]{4}", sig),
        "linkedin": re.search(r"linkedin\.com/in/[\w-]+", sig),
        "website": re.search(r"https?://(?!.*linkedin\.com)[\w./-]+", sig),
        "title": extract_title(sig),
        "company": extract_company(sig),
    }
```

`extract_title` and `extract_company` deserve their own functions because they need a keyword vocabulary:

```python
TITLE_KEYWORDS = {
    "C-suite":  ["CEO", "CTO", "CFO", "COO", "CIO", "CMO"],
    "VP":       ["VP", "Vice President"],
    "Director": ["Director", "Head of"],
    "Manager":  ["Manager", "Lead"],
    "IC":       ["Engineer", "Designer", "Analyst", "Specialist"],
}

def extract_title(sig: str) -> dict | None:
    for tier, keywords in TITLE_KEYWORDS.items():
        for kw in keywords:
            m = re.search(rf"\b({kw}[^\n,]*)", sig, re.IGNORECASE)
            if m:
                return {"raw": m.group(1).strip(), "tier": tier}
    return None
```

The tier classification is what makes this useful for sales outreach — you want "C-suite" as a separate signal from "raw title text".

## Cross-reference for accuracy

Single emails give incomplete signatures. The "Sent from my iPhone" reply has nothing. The thank-you note has just a name. The mid-thread message has the full block.

Pull the last N messages from the same sender, extract the signature from each, and merge:

```python
def enrich(sender_email: str, n: int = 3) -> dict:
    messages = list_messages_from(sender_email, limit=n)
    signatures = [split_signature(m["body"])[1] for m in messages]
    fields = [extract(s) for s in signatures]
    return merge_fields(fields)   # take the most complete value for each key
```

The lift is large: per the original analysis, single-message extraction nets ~67% completeness across the five fields; three-message cross-reference hits ~91%.

`list_messages_from` is straightforward via the CLI:

```python
def list_messages_from(email: str, limit: int = 3) -> list[dict]:
    out = subprocess.run(
        ["nylas", "email", "search", f"from:{email}", "--limit", str(limit), "--json"],
        capture_output=True, text=True, check=True,
    )
    return json.loads(out.stdout)
```

## Bonus: DNS-derived intelligence

The sender's email domain reveals more than the signature does:

- **MX records** — Google Workspace vs. Microsoft 365 vs. self-hosted (sales-relevant signal)
- **SPF records** — what tools the company integrates (SendGrid, Salesforce, Mailgun)
- **DMARC** — email-security maturity (sometimes a buying signal in security tooling)

```python


def domain_intel(domain: str) -> dict:
    return {
        "mx":    [r.exchange.to_text() for r in dns.resolver.resolve(domain, "MX")],
        "spf":   [r.to_text() for r in dns.resolver.resolve(domain, "TXT") if "v=spf1" in r.to_text()],
        "dmarc": [r.to_text() for r in dns.resolver.resolve(f"_dmarc.{domain}", "TXT")],
    }
```

These three queries enrich every contact for free without touching the email body.

## Things to know

- **GDPR / privacy.** The data is in the email; you have it because the sender sent it. But surfacing inferred attributes (job tier, sales-readiness) into a CRM is a different processing context. Document it in your privacy notice.
- **International phone formats.** The regex above is North America-leaning. Add patterns for E.164 (`\+\d{6,15}`) and country-specific shapes if your inbox has international correspondents.
- **LinkedIn deprecated `/pub/` URLs.** Match `/in/` only — the `/pub/` shape was retired years ago.

## Next steps

- [Map communication patterns between organizations](/docs/cookbook/agents/communication-patterns/)
- [Email triage agent](/docs/cookbook/agents/email-triage-agent/)
- [Email recipes (API)](/docs/cookbook/email/messages/list-messages-google/)