Skip to content
Skip to main content

Build an AI email triage agent

A triage agent does what you’d ask an over-caffeinated EA to do: open each unread message, decide whether it needs you in the next hour, the next day, the next week, or never, draft replies for the urgent ones, and bulk-archive the rest. The pattern below runs as a cron job every fifteen minutes and costs roughly $0.002 per 100 emails using GPT-4o-mini for classification.

Most of the work happens in two prompts and three CLI calls.

BucketMeaningAction
URGENTProduction incident, executive askDraft a reply within the hour
ACTIONCode review, meeting follow-upDraft a reply same-day
FYIStatus update, FYI threadLeave it alone
NOISENewsletter, marketing, automated alertArchive

Four is the right number. Three loses fidelity (everything ends up in “important”). Five and the model starts confusing categories.

Run it with temperature=0 and max_tokens=10 — you want deterministic output and a single token of output, not a paragraph. The model gets sender + subject + 200-char snippet, not the full body. That’s plenty for >90% accuracy and it keeps the per-email cost trivial.

You triage email into one of four categories:
URGENT — production incidents, executive requests; reply within 1 hour
ACTION — code reviews, meeting follow-ups; reply same day
FYI — informational, no response needed
NOISE — newsletters, marketing, automated notifications
From: {sender}
Subject: {subject}
Snippet: {snippet}
Return ONLY the category name. Nothing else.

Always validate the output against the four valid strings — LLMs occasionally invent a category. Fall back to FYI on anything unrecognized.

import json, subprocess
from openai import OpenAI
VALID = {"URGENT", "ACTION", "FYI", "NOISE"}
client = OpenAI()
def fetch_unread(limit=50):
out = subprocess.run(
["nylas", "email", "list", "--unread", "--limit", str(limit), "--json"],
capture_output=True, text=True, check=True,
)
return json.loads(out.stdout)
def classify(msg):
resp = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
max_tokens=10,
messages=[{"role": "user", "content": CLASSIFY_PROMPT.format(
sender=msg["from"][0]["email"],
subject=msg["subject"],
snippet=msg["snippet"][:200],
)}],
)
cat = resp.choices[0].message.content.strip()
return cat if cat in VALID else "FYI"
def draft_reply(msg):
resp = client.chat.completions.create(
model="gpt-4o",
temperature=0.7,
messages=[{"role": "user", "content": DRAFT_PROMPT.format(
sender=msg["from"][0]["email"],
subject=msg["subject"],
body=msg["body"],
)}],
)
body = resp.choices[0].message.content
subprocess.run(
["nylas", "email", "draft",
"--to", msg["from"][0]["email"],
"--subject", "Re: " + msg["subject"],
"--body", body, "--json"],
check=True,
)
def archive(msg):
subprocess.run(["nylas", "email", "archive", msg["id"]], check=True)
for msg in fetch_unread():
cat = classify(msg)
if cat in ("URGENT", "ACTION"):
draft_reply(msg)
elif cat == "NOISE":
archive(msg)
# FYI: do nothing

Drafts land in your drafts folder — the agent never sends. You review, edit, and hit send (or not).

Write a short, professional reply to this email. Three sentences max. Be direct.
From: {sender}
Subject: {subject}
Body: {body}
Reply:

Higher temperature here (0.7) lets the model produce natural prose. The “three sentences max” is load-bearing — without it, you’ll get drafts that read like a politely overcompensating intern.

*/15 * * * * /usr/bin/python3 /opt/triage/triage.py >> /var/log/triage.log 2>&1

The whole thing is idempotent — only --unread messages are pulled, and once you read a draft (or the original) it falls out of subsequent runs.

GPT-4o-mini classification: ~$0.15 per 1M input tokens. A 200-char snippet plus the prompt is 150 tokens. 100 emails ≈ 15K tokens ≈ $0.002. Drafting uses GPT-4o ($2.50 / 1M input) but only on the URGENT + ACTION subset — typically under 20% of the inbox. A heavy day at 200 unread emails costs roughly a nickel.

For mail you don’t want hitting OpenAI / Anthropic, swap to a local Ollama:

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# model="llama3.1" or similar

Llama 3.1 classifies almost as well as GPT-4o-mini for this task. Drafting quality drops noticeably with smaller models — keep that on a hosted LLM unless you’re running a 70B+ parameter local model.

  • Only --unread. The agent never re-classifies what it already drafted. If you re-read a draft, the original message stays read; the next cron skips it.
  • No auto-send. Always land in drafts. The cost of a wrong send (to the wrong person, at the wrong tone) is way higher than the friction of one extra click.
  • Tune per inbox. Eng inboxes hit URGENT differently than sales inboxes. Customize the four categories and the prompt to your context.