Notifications are the boring-looking subsystem that every product eventually needs and that most teams underestimate. This article walks through the architecture of a notifications service that handles millions of messages across channels reliably.
The scope
“Send a notification” is actually many things:
- Web push to browsers
- Mobile push to iOS/Android
- SMS
- In-app notification feeds
- Webhook to partner systems
Each channel has its own API, rate limits, failure modes, and latency expectations. A good notification service abstracts them behind a unified internal interface.
Architecture
[upstream services]
│
▼ event: "user-123 got a new message"
┌─────────────────────┐
│ Notification API │ applies user prefs, templates, dedup
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Dispatch Queue │ per-channel Kafka topics
└─────────────────────┘
│
├──▶ Email Worker ──▶ Email provider (SES, SendGrid)
├──▶ SMS Worker ──▶ SMS provider (Twilio)
├──▶ Push Worker ──▶ FCM / APNs
└──▶ InApp Worker ──▶ DB + websocket fanoutCore ideas: upstream publishes intent (“user got a message”); notification service decides channels and routes to per-channel workers.
The input contract
Upstream services shouldn’t know about channels. They publish semantic events:
{
"event": "message.received",
"user_id": "u-123",
"data": { "sender_name": "Alex", "preview": "Hey, can we meet..." },
"idempotency_key": "msg-456-event"
}Notification service handles:
- Lookup user’s channel preferences (email yes, SMS off, push yes)
- Apply rate limits / bundling (don’t spam on 50 messages in 5 minutes)
- Pick templates per channel
- Deduplicate by idempotency_key
- Emit per-channel dispatch messages
Upstream services don’t care about any of this. They just emit events.
User preferences
A core responsibility. Users configure per event type:
- Which channels
- Quiet hours
- Digest (immediate, hourly, daily)
- Language
Represent as a preferences table, queried on every notification. Cache aggressively — prefs don’t change often.
Template system
Templates per (event_type, channel, language):
templates:
- event: message.received
channel: email
language: en
subject: "{sender_name} sent you a message"
body: |
{sender_name} says: "{preview}"
Reply at https://app.example.com/messages
- event: message.received
channel: push
language: en
title: "{sender_name}"
body: "{preview}"Rendering applies data + localization + personalization. Test templates in CI — bad template = bad notification.
Idempotency
The backbone. Every dispatch has an idempotency key. Dedupe on receipt; don’t send twice.
dispatch_idempotency_key = hash(event.idempotency_key + channel + user_id)Store recently-processed keys (TTL 24-48 hours). On retry, drop duplicates silently.
Provider abstraction
Email via SES, SendGrid, Mailgun — abstract behind an interface:
public interface EmailProvider {
DeliveryResult send(EmailMessage msg);
}Switch providers without touching business code. Multiple providers for resilience — primary + fallback if primary errors.
Delivery guarantees
Per channel:
- Push / SMS / Email — at-least-once. Deduplicate at the provider level if possible; providers handle most of this.
- In-app — exactly-once per user (persist to DB, fanout via websocket). Web clients reconcile on reconnect.
User-visible duplicates from “at-least-once” are rare if idempotency keys are right. Accept small risk of rare dupes vs certainty of missing notifications.
Rate limits
Per user per channel. Burst limits. Quiet hours. Aggregation:
User has 20 new messages in 5 minutes.
Rule: at most 1 push notification per minute per event type.
Action: send first push; bundle 19 others into a digest "You have 20 new messages".Implement via per-user-per-channel buckets (Redis counters with TTL). Delay/bundle rather than drop.
Digest / aggregation
For users who chose daily/weekly digest, hold events and send summary on schedule:
Daily digest table:
user_id, pending_events (JSON array), scheduled_atCron scans pending, renders summary, dispatches, clears.
Webhooks (outgoing)
For partner integrations:
- HTTP POST to their URL
- Retry with exponential backoff
- Dead-letter after N attempts
- Sign the request (HMAC) so they can verify
- Timeout aggressively (don’t let slow partners degrade service)
Partners fail in creative ways. Track failure rate per partner; disable after sustained failures (alert them).
Observability
Metrics that matter:
- Dispatch success rate per channel
- End-to-end delivery latency (event → delivered)
- Provider error rate (alert on spikes)
- Queue depth (backlog = behind)
- Cost per channel (SMS and email have real $)
Tracing: each notification should trace from upstream event → decision → dispatch → delivery receipt.
Deliverability (email specifically)
Email has its own deep world:
- SPF, DKIM, DMARC setup
- Bounce handling (hard vs soft)
- Suppression lists (don’t email addresses that bounced)
- Reputation monitoring
- Inbox vs spam folder analysis
Email provider (SES, SendGrid, etc.) handles most of this; still your job to handle bounces and honor unsubscribes.
Scale notes
At 10 million notifications / day:
- Typical setup handles it on modest infrastructure
- Per-channel workers scale independently
- DB writes (delivery history) become the load; consider write-only append table + analytical DB separately
- Cache user prefs aggressively; they’re read on every notification
At 100 million+ / day: optimize further — batch API calls to providers, sharding per user, reduce retention of delivery history.
Closing note
Notifications are one of those “easy” features that reveal their complexity over time. Done right, they delight users and boost engagement. Done wrong, they cause unsubscribes and abuse reports that hurt the whole sending infrastructure. Invest in preferences, templates, idempotency, and rate limits from the start — retrofitting them into a shipping system is painful. The best notification services feel invisible; users get the right thing at the right time and nothing else.