Most debugging articles stop at “add more logs.” That’s the easy advice. The hard, useful advice is which logs, at what level, with what structure, and what to do when the bug happens only once a week in production. This article is a field guide to logging and debugging in real Java systems — specifically the patterns that make a difference when you’re the one paged at 3 AM, staring at a terminal, trying to reconstruct what happened.
The central idea: logs are for investigation, not surveillance
A lot of teams treat logs as a dumping ground. Every function logs its entry and exit. Every HTTP call logs request and response. Every loop iteration logs progress. The log volume is enormous, the signal-to-noise ratio is terrible, and when something goes wrong the useful lines are buried.
The reframe: logs are structured data for people investigating incidents. That changes everything. You log not because something happened, but because someone will ask a specific question later, and this line will help answer it.
With that frame, most of the rules below fall out naturally.
Rule 1 — Structured logs or nothing
Free-text logs don’t scale past one developer’s terminal. Every log line should be JSON, one event per line:
{
"ts": "2026-04-22T14:22:03.491Z",
"level": "INFO",
"service": "orders",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"event": "order.placed",
"order_id": "o-8421",
"customer_id": "c-12",
"amount_cents": 4499,
"currency": "USD"
}Every field is queryable in Loki or Elasticsearch. You can answer “all orders over $100 placed by tier-1 customers in the last hour” in one query, without regex, without scanning.
Logback setup is one file:
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<includeMdcKeyName>trace_id</includeMdcKeyName>
<includeMdcKeyName>span_id</includeMdcKeyName>
<includeMdcKeyName>customer_id</includeMdcKeyName>
<timeZone>UTC</timeZone>
</encoder>
</appender>One-time cost, permanent benefit. If your project still ships System.out.println-style logs in 2026, that’s the first thing to fix.
Rule 2 — Log levels with intent
Most log noise comes from level confusion. The levels should mean something specific:
- ERROR — something broke, a human needs to investigate. In aggregate, this is alert-worthy. A line at ERROR that fires every minute is a bug.
- WARN — unexpected but handled. The circuit breaker opened, a retry succeeded, a deprecated endpoint was called. If it’s fine to ignore, it’s not a warn.
- INFO — meaningful business events. Order placed, payment settled, user signed up. Low-volume, durable, the story of what the system did.
- DEBUG — developer-level detail. Off in prod by default. Switch on for a specific service when investigating.
- TRACE — near-useless in practice. Method-level, usually better achieved with a profiler.
The test for whether a log line is at the right level: if this line fires at 2 AM, is the response “wake someone up,” “ticket tomorrow,” or “ignore”? Align the level accordingly.
A common anti-pattern — logging expected negative outcomes at ERROR:
// WRONG
try {
user = userService.findById(id);
} catch (UserNotFoundException e) {
log.error("User not found: {}", id, e);
return 404;
}
// RIGHT
User user = userService.findOrNull(id);
if (user == null) {
log.debug("User lookup miss: {}", id);
return 404;
}“User not found” is a valid outcome for a lookup. Logging it at ERROR inflates the error rate, desensitizes the alerts, and buries real problems.
Rule 3 — Log events, not narratives
Bad logging describes what the code is doing in prose:
Starting to process order
Validating order
Looking up customer
Customer found
Validating items
Items OK
Starting payment
Payment succeeded
Finalizing order
Order finalizedTen lines that together tell you nothing you couldn’t deduce from the code. Good logging emits events — named, structured, queryable:
{"event": "order.placed", "order_id": "o-8421", "items": 3, "amount_cents": 4499}
{"event": "payment.charged", "order_id": "o-8421", "payment_id": "p-221", "provider": "stripe"}Two lines, both searchable as facts. “Show me all orders placed in the last hour” = event:order.placed AND ts:[now-1h TO now]. Done.
The rule: every log line represents a thing that happened, not the fact that the code reached this line.
Rule 4 — Context over content
When you log an event, the hardest question to answer later is usually “for which request / user / order?” Put the identifiers in the log, not just the message.
Use MDC (Mapped Diagnostic Context) to attach context once per request:
@Component
public class RequestContextFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res, FilterChain chain)
throws ServletException, IOException {
try {
MDC.put("request_id", UUID.randomUUID().toString());
MDC.put("customer_id", extractCustomerId(req));
chain.doFilter(req, res);
} finally {
MDC.clear();
}
}
}Every subsequent log line in that request automatically carries request_id and customer_id. A single query filters the entire lifecycle of one request across all services — because the trace_id joins them.
Rule 5 — Never log secrets, PII, or large payloads
The incidents that make it to the evening news are often about this. Three hard rules:
No secrets. API keys, tokens, passwords, session IDs. Not even “for debugging, I’ll remove it later.” A redacting log filter is cheaper than an incident:
public class SecretsRedactor extends ClassicConverter {
private static final Pattern AUTH = Pattern.compile("(Bearer|Basic)\\s+\\S+", Pattern.CASE_INSENSITIVE);
@Override public String convert(ILoggingEvent e) {
return AUTH.matcher(e.getFormattedMessage()).replaceAll("$1 [REDACTED]");
}
}No raw PII. Emails, phone numbers, national IDs. Log a stable hash or a last-4 pattern if you need to correlate:
log.info("email.sent", kv("recipient_hash", hash(email)), kv("template", "order_confirmation"));No full request / response bodies. A well-intentioned “log every request body for debugging” captures credit card numbers, personal messages, and puts them in Elasticsearch. Log the fact, log the size, log the structure, but not the content.
Review this on every PR that adds logging. This is one area where “we’ll clean it up later” never happens before the audit.
Rule 6 — Correlate logs with traces
In distributed systems, one user request touches 5–20 services. Without correlation, you have five haystacks instead of one. With it, you have a single thread connecting them.
The standard: every log line carries trace_id and span_id. OpenTelemetry’s Java agent does this automatically:
{"ts":"...","level":"INFO","service":"orders","trace_id":"4bf9...","span_id":"00f0...","event":"order.placed", ...}
{"ts":"...","level":"INFO","service":"payments","trace_id":"4bf9...","span_id":"e8a2...","event":"payment.charged", ...}Same trace_id across services. In Grafana, one click on a trace filters logs to exactly that request’s lifecycle, across every service.
If your logs don’t include trace_id, fix that first, before any other logging work. It’s the single largest multiplier on debugging speed.
Rule 7 — Log once per event, at the right boundary
A common mistake is logging the same fact at every layer — controller, service, repository, and the HTTP client all log “placing order”. Now the same event appears 4 times in different shapes.
The rule: log at the boundary where the fact becomes final. An HTTP request gets logged when the response is being returned (by a servlet filter / interceptor). An external API call gets logged when it completes (by the client). A domain operation gets logged when it commits (by the service method). Everything else is redundant.
A generic access-log interceptor is usually enough for HTTP:
@Component
public class AccessLogInterceptor implements HandlerInterceptor {
@Override
public void afterCompletion(HttpServletRequest req, HttpServletResponse res, Object h, Exception ex) {
log.info("http.request",
kv("method", req.getMethod()),
kv("path", maskPath(req.getRequestURI())),
kv("status", res.getStatus()),
kv("duration_ms", durationFrom(req)),
kv("user_agent", req.getHeader("User-Agent")));
}
}One line per request, every field queryable. No per-layer duplication.
Rule 8 — Log exceptions with their full context
Exception logs are the most useful in incidents and the most often done wrong. Two patterns for getting them right:
Log the exception once, at the boundary that handles it. Not at every layer that catches-and-rethrows:
// Rethrowing catches: don't log
try {
return orderService.place(req);
} catch (InventoryException e) {
throw new OrderFailedException("inventory", e);
}
// Boundary that handles: log with context
@ExceptionHandler(OrderFailedException.class)
public ResponseEntity<Error> handle(OrderFailedException e) {
log.error("order.failed",
kv("reason", e.getReason()),
kv("request_id", MDC.get("request_id")),
e);
return ResponseEntity.status(502).body(Error.of(e.getReason()));
}Include the cause chain, not just the top-level message. A java.util.concurrent.ExecutionException: something wrapped is almost never what you actually need. Use logger APIs that print the full cause chain (SLF4J’s log.error(msg, args, throwable) does by default).
The debugging playbook
Logging is the foundation. But when a bug lands, you still need the techniques to find it.
Step 1 — Reproduce at the right scale
Most bugs that seem mysterious in production are reproducible in a simpler environment, but only if you reproduce the right thing:
- If it’s a concurrency bug, test with real concurrency (not a single-threaded test)
- If it’s a data-size bug, seed a realistic amount of data
- If it’s a timing bug, involve real network latency (Toxiproxy is great for this)
- If it’s a state bug, reproduce the state, not just the trigger
Many hours of “mysterious production bug” become one-hour bugs once the reproduction environment matches production along the relevant axis.
Step 2 — Use the trace, not the theory
When you get paged, the instinct is to form a theory of what went wrong and start investigating. The faster path is: pull up the trace for a failing request, read it, let it tell you where time went or where the error originated.
A trace collapses thousands of lines of logs into a picture:
/checkout [2,847 ms] ← slow
├── auth [12 ms]
├── cart [34 ms]
├── pricing [180 ms]
├── payments [2,510 ms] ← guilty
│ └── stripe.charge [2,502 ms] ← real culprit
└── orders.persist [95 ms]No theorizing needed. Stripe is slow, root cause in 30 seconds.
Step 3 — Differential investigation
When something changed (“it was working yesterday”), the most powerful tool is comparing what changed against what used to work:
- What deployed in the last 24 hours?
- What config changed?
- What traffic pattern is new?
- What downstream changed?
A deploy marker on every dashboard, a config change log, and a feature flag audit log turn “something changed” from a mystery into a query.
Step 4 — Use the runtime, not just the code
Java gives you access to live production state that many engineers never use:
Heap dumps. When memory is suspect:
jcmd <pid> GC.heap_dump /tmp/heap.hprofOpen in Eclipse MAT, follow the dominator tree, find the unexpected retention. An actual case I’ve seen: a 4 GB heap dominated by a single ConcurrentHashMap used as a cache that was never bounded.
Thread dumps. When latency is suspect or threads are stuck:
jcmd <pid> Thread.printRepeat 3 times at 5-second intervals. Any thread in the same stack across all three dumps is stuck. You’ll find the exact line.
Java Flight Recorder. When CPU is suspect and you can’t attach a profiler:
jcmd <pid> JFR.start duration=60s filename=/tmp/prod.jfr settings=profileZero meaningful overhead. Open in JDK Mission Control, look at the flame graph.
Running JFR continuously in production with a rotating 1-hour window costs nothing and means you always have the data when you need it:
-XX:StartFlightRecording=name=prod,filename=/var/log/jfr/rec.jfr,maxage=1h,maxsize=100m,settings=profileStep 5 — Bisect, don’t theorize
When you have a bug that was introduced in the last N commits:
git bisect start
git bisect bad HEAD
git bisect good v1.4.0
# repeat: test, git bisect good/badTwenty commits → 5 tests. The computer picks the middle; you say good or bad; it converges. More reliable than guessing at which change probably caused it.
Debugging production specifically
Some bugs only show up in production. The techniques that work:
Targeted logging via dynamic config. Ship a feature flag that increases log level for a specific class or package. When the bug happens, flip the flag for 10 minutes, capture details, flip it off. Don’t deploy — that changes the system.
Sampling-based debug logs. Log full request/response bodies for 1% of traffic. When a bug is reported with a request ID, there’s a 1-in-100 chance you already captured it. For rare bugs, increase the sample rate temporarily on the specific route.
Canary cohorts. Roll out the new version to 1% of traffic; compare error rates, latency histograms, and specific event counts between old and new cohorts. Catches regressions that show up only under real load.
On-call-driven instrumentation. After every incident, the action item list includes “what log / metric / alert would have let us find this in 2 minutes instead of 20?” Ship that. Over time, your observability converges on real needs, not speculation.
Anti-patterns that make debugging harder
A short list of habits to avoid:
- Catching Exception and logging a generic message. Real stack trace lost forever. Let it propagate to a boundary that handles it meaningfully.
- Log, then throw, then log again. Same error appears three times at different stack depths. Either log at the boundary or rethrow without logging.
- Logging format strings with string concatenation.
log.info("user " + id + " did X")builds the string even when the log level is disabled. Use parameterized:log.info("user {} did X", id). - Ad-hoc println / System.err. Bypasses the logging framework, doesn’t carry context, doesn’t get shipped to the aggregator.
- Logging in tight loops. One line per row on a 1-million-row batch. Aggregate: log start, log every Nth row or every N seconds, log end.
- Sensitive data “just for this PR.” Gets merged, gets shipped, surfaces in an audit. Redact at the source, not after the fact.
Making debugging faster across the team
The tech is only half. The other half is team practices:
- Every incident produces a logging/observability action item. What log line would have made this 10× faster? Add it.
- Runbooks include debugging steps, not just recovery. “How to tell if it’s DB vs upstream”, “how to get a thread dump”, “where logs for service X live”. Checked-in markdown.
- Everyone can run the tools. If only one engineer knows how to take a heap dump, you have one engineer, not a team, for that kind of incident.
- Post-incident reviews are blameless but specific. “We didn’t see it” → specific fix. “We saw it too late” → specific alert. “We couldn’t find the cause” → specific instrumentation.
Checklist: is your logging & debugging setup production-grade?
- All logs are JSON, one event per line
-
trace_idandspan_idpresent on every log line - MDC used for request-scoped context (request_id, customer_id)
- Log levels follow intent (ERROR = actionable, WARN = unexpected-but-handled)
- No secrets, no raw PII, no full bodies in logs (redactor verified)
- Access logging at one boundary, not per layer
- Exception logging at handling boundary, with full cause chain
- OpenTelemetry auto-instrumentation on HTTP, JDBC, Kafka
- Grafana / Kibana filters easily by trace_id across services
- JFR running continuously with rolling window
- Heap dump on OOM enabled (
-XX:+HeapDumpOnOutOfMemoryError) - Dynamic log-level control via config (no redeploy to debug)
- Sampling-based body capture for hard-to-reproduce bugs
- Runbook per service with debugging steps, not just recovery
- Every incident produces a concrete observability action item
Closing thought
The teams that debug production fastest aren’t the ones with the most logs. They’re the ones whose logs answer the questions investigators actually ask. Every log line has an implicit promise: “when something goes wrong, I will help.” Most log lines in most codebases break that promise. The discipline of logging — what to include, what to omit, where to put it, how to structure it — isn’t glamorous, but it compounds. The system you build now determines how long your 3 AM incidents take a year from now.