Postmortem and Incident Analysis Summary - August 16th, 2025
Resolved
Aug 16 at 06:10pm CEST
Root Cause Identified: Connection Tracking Table Exhaustion
Primary Issue:
Starting at approximately 19:05:58 UTC, the system experienced a severe netfilter connection tracking table exhaustion, indicated by the kernel message:
nfconntrack: nfconntrack: table full, dropping packet
Key Findings:
Connection Tracking Limits:
- Maximum connections allowed: 4,096 (
nf_conntrack_max
) - Current connections (now): 132 (normal level)
- During the incident: Table was completely full
- Maximum connections allowed: 4,096 (
Timeline of Events:
- Early morning (01:03:36): First Vector TCP connection errors appeared
- Throughout the day: Intermittent Vector host metrics collection failures
- 19:05:58: Critical threshold reached - connection tracking table full
- 19:05:58 to ~20:00: Sustained period of packet drops and connection failures
- 98 total occurrences of "table full" messages logged
Impact on Services:
- Vector logging system: Failed to load TCP connection info with various error patterns:
Invalid message length: 524293
Invalid message length: 3012335950
Invalid message length: 992969696
Could not parse netlink response: Decode error
- Nginx reverse proxy: IPv6 upstream connection failures
- HTTP monitoring: Connection closures and retry failures
- Vector logging system: Failed to load TCP connection info with various error patterns:
Attack Pattern:
- 814 SSH connection attempts on August 16th (compared to 1,320 on August 15th)
- High volume of UFW-blocked connection attempts from various IP addresses
- Consistent brute-force attempts against SSH service
Trigger Analysis:
The incident was triggered by a combination of:
1. Sustained brute-force attacks creating many short-lived connections
2. Vector metrics collection attempting to read TCP connection information
3. Normal web traffic through the nginx reverse proxy
4. Low connection tracking table limit (4,096) insufficient for the traffic load
Resolution:
The incident appears to have resolved naturally as the attack volume decreased and connections timed out, freeing up space in the connection tracking table.
Recommendations:
1. Increase nf_conntrack_max
to handle higher connection volumes
2. Implement fail2ban or similar to automatically block brute-force attempts
3. Consider tuning connection timeout values
4. Monitor connection tracking table utilization
5. Review Vector configuration for more resilient TCP metrics collection
This was a network resource exhaustion incident rather than a security breach, caused by the system's connection tracking table becoming overwhelmed by legitimate traffic combined with attack attempts.
Affected services