Parse email signatures for contact enrichment

Email signatures are structured data masquerading as prose. Roughly 82% of business email contains a signature with at least name and title. Most platforms ignore it; with a few hundred lines of regex you can extract titles, phone numbers, LinkedIn URLs, websites, and company affiliations and ship them straight into your CRM.

This recipe argues for regex over LLM (deterministic + free) and shows the cross-referencing trick that lifts accuracy from “decent” to “production-usable”.

Why regex, not an LLM

For unstructured prose, an LLM wins. Signatures aren’t unstructured — they’re predictably structured. They’re 3–6 lines, separated from the body by -- (per RFC 3676), with field types in a small set: name, title, company, phone, email, URL, social handle. A regex catches >95% of well-formed signatures, runs in microseconds, and costs nothing per message.

The case for an LLM fallback exists, but only as the last 5%. Skip it for the first version.

Detect the signature boundary

import re

SIG_DELIMITERS = [
    r"\n--\s*\n",                # RFC 3676 standard
    r"\nSent from my (iPhone|iPad|Android)",
    r"\nGet Outlook for iOS",
    r"\nThanks?,?\s*\n",
    r"\nBest,?\s*\n",
    r"\nRegards,?\s*\n",
    r"\nCheers,?\s*\n",
]

def split_signature(body: str) -> tuple[str, str]:
    for pat in SIG_DELIMITERS:
        m = re.search(pat, body)
        if m:
            return body[:m.start()], body[m.end():]
    return body, ""

You’ll miss inline signatures with no delimiter — but they’re a small minority and the cross-referencing step (below) backfills the gaps.

Extract the fields

def extract(sig: str) -> dict:
    return {
        "phone": re.search(r"(?:\+?1[-.\s]?)?\(?[\d]{3}\)?[-.\s]?[\d]{3}[-.\s]?[\d]{4}", sig),
        "linkedin": re.search(r"linkedin\.com/in/[\w-]+", sig),
        "website": re.search(r"https?://(?!.*linkedin\.com)[\w./-]+", sig),
        "title": extract_title(sig),
        "company": extract_company(sig),
    }

extract_title and extract_company deserve their own functions because they need a keyword vocabulary:

TITLE_KEYWORDS = {
    "C-suite":  ["CEO", "CTO", "CFO", "COO", "CIO", "CMO"],
    "VP":       ["VP", "Vice President"],
    "Director": ["Director", "Head of"],
    "Manager":  ["Manager", "Lead"],
    "IC":       ["Engineer", "Designer", "Analyst", "Specialist"],
}

def extract_title(sig: str) -> dict | None:
    for tier, keywords in TITLE_KEYWORDS.items():
        for kw in keywords:
            m = re.search(rf"\b({kw}[^\n,]*)", sig, re.IGNORECASE)
            if m:
                return {"raw": m.group(1).strip(), "tier": tier}
    return None

The tier classification is what makes this useful for sales outreach — you want “C-suite” as a separate signal from “raw title text”.

Cross-reference for accuracy

Single emails give incomplete signatures. The “Sent from my iPhone” reply has nothing. The thank-you note has just a name. The mid-thread message has the full block.

Pull the last N messages from the same sender, extract the signature from each, and merge:

def enrich(sender_email: str, n: int = 3) -> dict:
    messages = list_messages_from(sender_email, limit=n)
    signatures = [split_signature(m["body"])[1] for m in messages]
    fields = [extract(s) for s in signatures]
    return merge_fields(fields)   # take the most complete value for each key

The lift is large: per the original analysis, single-message extraction nets ~67% completeness across the five fields; three-message cross-reference hits ~91%.

list_messages_from is straightforward via the CLI:

def list_messages_from(email: str, limit: int = 3) -> list[dict]:
    out = subprocess.run(
        ["nylas", "email", "search", f"from:{email}", "--limit", str(limit), "--json"],
        capture_output=True, text=True, check=True,
    )
    return json.loads(out.stdout)

Bonus: DNS-derived intelligence

The sender’s email domain reveals more than the signature does:

MX records — Google Workspace vs. Microsoft 365 vs. self-hosted (sales-relevant signal)
SPF records — what tools the company integrates (SendGrid, Salesforce, Mailgun)
DMARC — email-security maturity (sometimes a buying signal in security tooling)

import dns.resolver

def domain_intel(domain: str) -> dict:
    return {
        "mx":    [r.exchange.to_text() for r in dns.resolver.resolve(domain, "MX")],
        "spf":   [r.to_text() for r in dns.resolver.resolve(domain, "TXT") if "v=spf1" in r.to_text()],
        "dmarc": [r.to_text() for r in dns.resolver.resolve(f"_dmarc.{domain}", "TXT")],
    }

These three queries enrich every contact for free without touching the email body.

Things to know

GDPR / privacy. The data is in the email; you have it because the sender sent it. But surfacing inferred attributes (job tier, sales-readiness) into a CRM is a different processing context. Document it in your privacy notice.
International phone formats. The regex above is North America-leaning. Add patterns for E.164 (\+\d{6,15}) and country-specific shapes if your inbox has international correspondents.
LinkedIn deprecated /pub/ URLs. Match /in/ only — the /pub/ shape was retired years ago.