Postmortem and Incident Ana...

Postmortem and Incident Analysis Summary - August 16th, 2025

Aug 16 at 06:10pm CEST

Affected services

Reverse proxy health check

Resolved
Aug 16 at 06:10pm CEST

Root Cause Identified: Connection Tracking Table Exhaustion

Primary Issue:
Starting at approximately 19:05:58 UTC, the system experienced a severe netfilter connection tracking table exhaustion, indicated by the kernel message:
nfconntrack: nfconntrack: table full, dropping packet
Key Findings:

Connection Tracking Limits:
- Maximum connections allowed: 4,096 (nf_conntrack_max)
- Current connections (now): 132 (normal level)
- During the incident: Table was completely full
Timeline of Events:
- Early morning (01:03:36): First Vector TCP connection errors appeared
- Throughout the day: Intermittent Vector host metrics collection failures
- 19:05:58: Critical threshold reached - connection tracking table full
- 19:05:58 to ~20:00: Sustained period of packet drops and connection failures
- 98 total occurrences of "table full" messages logged
Impact on Services:
- Vector logging system: Failed to load TCP connection info with various error patterns:
  - Invalid message length: 524293
  - Invalid message length: 3012335950
  - Invalid message length: 992969696
  - Could not parse netlink response: Decode error
- Nginx reverse proxy: IPv6 upstream connection failures
- HTTP monitoring: Connection closures and retry failures
Attack Pattern:
- 814 SSH connection attempts on August 16th (compared to 1,320 on August 15th)
- High volume of UFW-blocked connection attempts from various IP addresses
- Consistent brute-force attempts against SSH service

Trigger Analysis:
The incident was triggered by a combination of:
1. Sustained brute-force attacks creating many short-lived connections
2. Vector metrics collection attempting to read TCP connection information
3. Normal web traffic through the nginx reverse proxy
4. Low connection tracking table limit (4,096) insufficient for the traffic load

Resolution:
The incident appears to have resolved naturally as the attack volume decreased and connections timed out, freeing up space in the connection tracking table.

Recommendations:
1. Increase nf_conntrack_max to handle higher connection volumes
2. Implement fail2ban or similar to automatically block brute-force attempts
3. Consider tuning connection timeout values
4. Monitor connection tracking table utilization
5. Review Vector configuration for more resilient TCP metrics collection

This was a network resource exhaustion incident rather than a security breach, caused by the system's connection tracking table becoming overwhelmed by legitimate traffic combined with attack attempts.