Observability

Observability relies on collecting and analyzing various metrics to ensure the system is operating as expected and to identify and troubleshoot issues when they arise.

Monitoring these metrics can provide comprehensive observability into our systems, helping us maintain reliability, performance, and security.

Here are the main metrics to monitor for better observability:

Infrastructure Metrics

CPU Usage: Percentage of CPU used by the system.
Memory Usage: Amount of RAM used.
Disk I/O: Read/write operations on the disk.

Application Performance Metrics

Latency: Time taken to process a request.
Throughput: Number of requests processed per unit of time.
Error Rate: Percentage of failed requests.
Request Rate: Number of requests received by the application.

Container and Orchestration Metrics

Container CPU/Memory Usage: Resource usage by containers.
Pod Status: Health and status of Kubernetes pods.
Node Health: Status and resource usage of Kubernetes nodes.

Service Metrics

Availability/Uptime: Percentage of time the service is operational.
Dependency Health: Status of external services and dependencies.

Database Metrics

Query Latency: Time taken to execute a database query.
Connection Pooling: Number of active database connections.
Read/Write Operations: Number of read/write operations per second.
Cache Hit Rate: Percentage of database queries served from cache.

Network Metrics

Latency: Time taken for a packet to travel from source to destination.
Packet Loss: Percentage of packets lost during transmission.
Bandwidth Usage: Amount of data transmitted over the network.
Network I/O: Incoming and outgoing network traffic.

Security Metrics

Authentication Failures: Number of failed login attempts.
Unauthorized Access Attempts: Number of attempts to access restricted resources.
Vulnerability Scans: Results from regular security scans.

User Experience Metrics

Page Load Time: Time taken for a web page to load.
Transaction Success Rate: Percentage of successful user transactions.
User Engagement: Metrics like session duration, bounce rate, and user interactions.

Logs and Traces

Log Volume: Amount of log data generated.
Log Severity Levels: Distribution of log entries by severity (info, warning, error).
Trace Spans and Durations: Detailed trace information showing the flow and timing of requests through the system.