Email signatures are structured data masquerading as prose. Roughly 82% of business email contains a signature with at least name and title. Most platforms ignore it; with a few hundred lines of regex you can extract titles, phone numbers, LinkedIn URLs, websites, and company affiliations and ship them straight into your CRM.
This recipe argues for regex over LLM (deterministic + free) and shows the cross-referencing trick that lifts accuracy from “decent” to “production-usable”.
Why regex, not an LLM
Section titled “Why regex, not an LLM”For unstructured prose, an LLM wins. Signatures aren’t unstructured — they’re predictably structured. They’re 3–6 lines, separated from the body by -- (per RFC 3676), with field types in a small set: name, title, company, phone, email, URL, social handle. A regex catches >95% of well-formed signatures, runs in microseconds, and costs nothing per message.
The case for an LLM fallback exists, but only as the last 5%. Skip it for the first version.
Detect the signature boundary
Section titled “Detect the signature boundary”import re
SIG_DELIMITERS = [ r"\n--\s*\n", # RFC 3676 standard r"\nSent from my (iPhone|iPad|Android)", r"\nGet Outlook for iOS", r"\nThanks?,?\s*\n", r"\nBest,?\s*\n", r"\nRegards,?\s*\n", r"\nCheers,?\s*\n",]
def split_signature(body: str) -> tuple[str, str]: for pat in SIG_DELIMITERS: m = re.search(pat, body) if m: return body[:m.start()], body[m.end():] return body, ""You’ll miss inline signatures with no delimiter — but they’re a small minority and the cross-referencing step (below) backfills the gaps.
Extract the fields
Section titled “Extract the fields”def extract(sig: str) -> dict: return { "phone": re.search(r"(?:\+?1[-.\s]?)?\(?[\d]{3}\)?[-.\s]?[\d]{3}[-.\s]?[\d]{4}", sig), "linkedin": re.search(r"linkedin\.com/in/[\w-]+", sig), "website": re.search(r"https?://(?!.*linkedin\.com)[\w./-]+", sig), "title": extract_title(sig), "company": extract_company(sig), }extract_title and extract_company deserve their own functions because they need a keyword vocabulary:
TITLE_KEYWORDS = { "C-suite": ["CEO", "CTO", "CFO", "COO", "CIO", "CMO"], "VP": ["VP", "Vice President"], "Director": ["Director", "Head of"], "Manager": ["Manager", "Lead"], "IC": ["Engineer", "Designer", "Analyst", "Specialist"],}
def extract_title(sig: str) -> dict | None: for tier, keywords in TITLE_KEYWORDS.items(): for kw in keywords: m = re.search(rf"\b({kw}[^\n,]*)", sig, re.IGNORECASE) if m: return {"raw": m.group(1).strip(), "tier": tier} return NoneThe tier classification is what makes this useful for sales outreach — you want “C-suite” as a separate signal from “raw title text”.
Cross-reference for accuracy
Section titled “Cross-reference for accuracy”Single emails give incomplete signatures. The “Sent from my iPhone” reply has nothing. The thank-you note has just a name. The mid-thread message has the full block.
Pull the last N messages from the same sender, extract the signature from each, and merge:
def enrich(sender_email: str, n: int = 3) -> dict: messages = list_messages_from(sender_email, limit=n) signatures = [split_signature(m["body"])[1] for m in messages] fields = [extract(s) for s in signatures] return merge_fields(fields) # take the most complete value for each keyThe lift is large: per the original analysis, single-message extraction nets ~67% completeness across the five fields; three-message cross-reference hits ~91%.
list_messages_from is straightforward via the CLI:
def list_messages_from(email: str, limit: int = 3) -> list[dict]: out = subprocess.run( ["nylas", "email", "search", f"from:{email}", "--limit", str(limit), "--json"], capture_output=True, text=True, check=True, ) return json.loads(out.stdout)Bonus: DNS-derived intelligence
Section titled “Bonus: DNS-derived intelligence”The sender’s email domain reveals more than the signature does:
- MX records — Google Workspace vs. Microsoft 365 vs. self-hosted (sales-relevant signal)
- SPF records — what tools the company integrates (SendGrid, Salesforce, Mailgun)
- DMARC — email-security maturity (sometimes a buying signal in security tooling)
import dns.resolver
def domain_intel(domain: str) -> dict: return { "mx": [r.exchange.to_text() for r in dns.resolver.resolve(domain, "MX")], "spf": [r.to_text() for r in dns.resolver.resolve(domain, "TXT") if "v=spf1" in r.to_text()], "dmarc": [r.to_text() for r in dns.resolver.resolve(f"_dmarc.{domain}", "TXT")], }These three queries enrich every contact for free without touching the email body.
Things to know
Section titled “Things to know”- GDPR / privacy. The data is in the email; you have it because the sender sent it. But surfacing inferred attributes (job tier, sales-readiness) into a CRM is a different processing context. Document it in your privacy notice.
- International phone formats. The regex above is North America-leaning. Add patterns for E.164 (
\+\d{6,15}) and country-specific shapes if your inbox has international correspondents. - LinkedIn deprecated
/pub/URLs. Match/in/only — the/pub/shape was retired years ago.