Building a Real-Time Notification System

Notifications are the boring-looking subsystem that every product eventually needs and that most teams underestimate. This article walks through the architecture of a notifications service that handles millions of messages across channels reliably.

The scope

“Send a notification” is actually many things:

Web push to browsers
Mobile push to iOS/Android
Email
SMS
In-app notification feeds
Webhook to partner systems

Each channel has its own API, rate limits, failure modes, and latency expectations. A good notification service abstracts them behind a unified internal interface.

Architecture

[upstream services] 
        │
        ▼ event: "user-123 got a new message"
┌─────────────────────┐
│  Notification API   │   applies user prefs, templates, dedup
└─────────────────────┘
        │
        ▼
┌─────────────────────┐
│  Dispatch Queue     │   per-channel Kafka topics
└─────────────────────┘
        │
        ├──▶ Email Worker   ──▶ Email provider (SES, SendGrid)
        ├──▶ SMS Worker     ──▶ SMS provider (Twilio)
        ├──▶ Push Worker    ──▶ FCM / APNs
        └──▶ InApp Worker   ──▶ DB + websocket fanout

Core ideas: upstream publishes intent (“user got a message”); notification service decides channels and routes to per-channel workers.

The input contract

Upstream services shouldn’t know about channels. They publish semantic events:

{
  "event": "message.received",
  "user_id": "u-123",
  "data": { "sender_name": "Alex", "preview": "Hey, can we meet..." },
  "idempotency_key": "msg-456-event"
}

Notification service handles:

Lookup user’s channel preferences (email yes, SMS off, push yes)
Apply rate limits / bundling (don’t spam on 50 messages in 5 minutes)
Pick templates per channel
Deduplicate by idempotency_key
Emit per-channel dispatch messages

Upstream services don’t care about any of this. They just emit events.

User preferences

A core responsibility. Users configure per event type:

Which channels
Quiet hours
Digest (immediate, hourly, daily)
Language

Represent as a preferences table, queried on every notification. Cache aggressively — prefs don’t change often.

Template system

Templates per (event_type, channel, language):

templates:
  - event: message.received
    channel: email
    language: en
    subject: "{sender_name} sent you a message"
    body: |
      {sender_name} says: "{preview}"
      Reply at https://app.example.com/messages
  - event: message.received
    channel: push
    language: en
    title: "{sender_name}"
    body: "{preview}"

Rendering applies data + localization + personalization. Test templates in CI — bad template = bad notification.

Idempotency

The backbone. Every dispatch has an idempotency key. Dedupe on receipt; don’t send twice.

dispatch_idempotency_key = hash(event.idempotency_key + channel + user_id)

Store recently-processed keys (TTL 24-48 hours). On retry, drop duplicates silently.

Provider abstraction

Email via SES, SendGrid, Mailgun — abstract behind an interface:

public interface EmailProvider {
    DeliveryResult send(EmailMessage msg);
}

Switch providers without touching business code. Multiple providers for resilience — primary + fallback if primary errors.

Delivery guarantees

Per channel:

Push / SMS / Email — at-least-once. Deduplicate at the provider level if possible; providers handle most of this.
In-app — exactly-once per user (persist to DB, fanout via websocket). Web clients reconcile on reconnect.

User-visible duplicates from “at-least-once” are rare if idempotency keys are right. Accept small risk of rare dupes vs certainty of missing notifications.

Rate limits

Per user per channel. Burst limits. Quiet hours. Aggregation:

User has 20 new messages in 5 minutes.
Rule: at most 1 push notification per minute per event type.
Action: send first push; bundle 19 others into a digest "You have 20 new messages".

Implement via per-user-per-channel buckets (Redis counters with TTL). Delay/bundle rather than drop.

Digest / aggregation

For users who chose daily/weekly digest, hold events and send summary on schedule:

Daily digest table:
  user_id, pending_events (JSON array), scheduled_at

Cron scans pending, renders summary, dispatches, clears.

Webhooks (outgoing)

For partner integrations:

HTTP POST to their URL
Retry with exponential backoff
Dead-letter after N attempts
Sign the request (HMAC) so they can verify
Timeout aggressively (don’t let slow partners degrade service)

Partners fail in creative ways. Track failure rate per partner; disable after sustained failures (alert them).

Observability

Metrics that matter:

Dispatch success rate per channel
End-to-end delivery latency (event → delivered)
Provider error rate (alert on spikes)
Queue depth (backlog = behind)
Cost per channel (SMS and email have real $)

Tracing: each notification should trace from upstream event → decision → dispatch → delivery receipt.

Deliverability (email specifically)

Email has its own deep world:

SPF, DKIM, DMARC setup
Bounce handling (hard vs soft)
Suppression lists (don’t email addresses that bounced)
Reputation monitoring
Inbox vs spam folder analysis

Email provider (SES, SendGrid, etc.) handles most of this; still your job to handle bounces and honor unsubscribes.

Scale notes

At 10 million notifications / day:

Typical setup handles it on modest infrastructure
Per-channel workers scale independently
DB writes (delivery history) become the load; consider write-only append table + analytical DB separately
Cache user prefs aggressively; they’re read on every notification

At 100 million+ / day: optimize further — batch API calls to providers, sharding per user, reduce retention of delivery history.

Closing note

Notifications are one of those “easy” features that reveal their complexity over time. Done right, they delight users and boost engagement. Done wrong, they cause unsubscribes and abuse reports that hurt the whole sending infrastructure. Invest in preferences, templates, idempotency, and rate limits from the start — retrofitting them into a shipping system is painful. The best notification services feel invisible; users get the right thing at the right time and nothing else.