>_
EngineeringNotes
← Back to DevOps
Module 11

Monitoring & Logging

Why console.log isn't enough, and how to save your skin when servers crash.

01

The Scale Problem

When you are a small startup or working on a side project, you generally don't care much about logging and monitoring. If something breaks, you just fix it.

But for large scale companies, it is absolutely critical. Imagine a bank's transaction server crashing for 1 hour. They could lose millions. They need to know immediately when something goes wrong, and exactly why.

02

Logging & Monitoring

Logging 📝

Logging refers to the practice of recording events, messages, or data points generated by software applications and systems.

  • Error Messages: Details about errors and exceptions that occur.
  • Access Logs: Records of who accessed what resources and when.
  • Audit Logs: Records for compliance and security purposes.

Monitoring 📊

Monitoring involves the continuous observation of a system to ensure it is operating correctly and efficiently. It includes tracking:

  • CPU & Memory Usage: How much processing power/RAM is being utilized.
  • Disk I/O & Space: Read/write operations and available storage.
  • Network Traffic: Data transfer rates.
  • Application Performance: Response times, throughput, and error rates.
03

Alerts

When there are logging/monitoring systems in place, you can put up alerts to be called/messaged/slacked/paged when:

  • A system goes down
  • CPU usage goes above a certain point
  • Error count goes up
04

The Limits of PM2 logs

We love PM2. It keeps our Node.js applications alive.

Scenario A: App Crash

Your code throws an error. PM2 catches it and restarts the app.

Terminal
bash
$ pm2 logs 1
Error: Database connection failed...
App-1 restarting...

✅ You can check the logs and fix the issue.

Scenario B: Machine Crash 💥

The entire EC2 instance (the machine itself) runs out of memory or has a kernel panic and crashes.

Question: Where are your logs now?

Gone. You cannot SSH into a dead machine. Even if you restart it, some logs might be lost or hard to find. You have NO idea what caused the crash.

05

The Solution: Centralized Logging

Since we can't trust the local machine to keep our logs safe forever, we need to ship them somewhere else in real-time.

Enter New Relic (or Datadog, CloudWatch, etc).

Real-time Streaming

As soon as your app logs something, it is sent to New Relic's cloud. If your machine explodes 1 second later, the log is already safe in the cloud.

Alerting

You can set alerts. "If CPU usage > 90% for 5 mins, SMS the DevOps lead." This lets you react BEFORE the crash.

06

Services

There are a few ways to monitor systems.

Paid 💸

  • 1. AWS CloudWatchMonitoring and observability service for AWS resources.
  • 2. DatadogLogging and monitoring.
  • 3. New RelicLogging and monitoring.

In house / Self hosted 🏠

Prometheus + Grafana + Loki

We'll be going through:

  1. New Relic
  2. Prometheus + Grafana
07

Setting up New Relic (Host)

Step 1: Create Account

Go to newrelic.com and create a free account.

Step 2: Select Environment

You can install the agent on Linux, Windows, Docker, or Kubernetes.
(You can even install it on your local machine to test!)

Step 3: Installation (The Docker Way 🐳)

The standard Linux script often fails. We recommend using Docker.
* Note: This specific Docker method with host networking does NOT work on macOS.

Prerequisite: You need Docker installed on your server.
👉Guide: Install Docker on Ubuntu

  1. Select Docker as the platform.
  2. Click "Create new key" and continue.
  3. Select Installation Method: Basic.
  4. Select Docker Network: Default (Host).
    (It uses --network=host to see your system's metrics)
  5. Copy the command and run it on your Ubuntu server using sudo.
Terminal
bash
sudo docker run -d --name newrelic-infra ...
How the Agent works

Think of the Agent as a spy living on your server.

  • 1Collection: It constantly monitors your PM2 processes, CPU usage, and log files.
  • 2Transmission: It packages this data and securely sends it to the New Relic cloud.
  • 3Visualization: You view the beautiful graphs and logs on the New Relic dashboard.
Your Server (EC2)
New Relic Agent
New Relic Cloud

Step 4: Verify

Go to All Entities > Hosts on the New Relic dashboard. Click on your host to see live charts and monitoring data! 🚀

Wait, where is my application logic?Right now, we are doing Host Monitoring (CPU, RAM, Disk).
To monitor your actual Node.js code (APM), you would need to install the newrelic npm package in your app. But Host monitoring is the most critical first step!
08

Bonus: The Power of NRQL ⚡

NRQL (New Relic Query Language) is the most powerful tool in your observability arsenal. It is a SQL-flavored language used to query the New Relic Database (NRDB).

Every time an event happens in your system (a request, an error, a custom event), it is stored as a JSON object in NRDB. NRQL allows you to slice and dice this data in real-time.

Why use it?

  • -Deep Dive Troubleshooting: Find the exact user who experienced an error, or filter logs by a specific traceId.
  • -Custom Dashboards: Create bespoke visualizations that the default UI doesn't provide.
  • -Advanced Alerting: Create complex alert logic (e.g., "Alert if error rate > 1% AND latency > 500ms").
  • -Business Analytics: Analyze specific feature usage, conversion rates, and business KPIs using technical events.

Example 1: Time Series Graph 📈

This creates the line charts you see on dashboards.

sql
sql
SELECT average(cpuPercent) FROM SystemSample TIMESERIES
  • SELECT ...Pick the metric (e.g., average(cpuPercent) or max(memoryUsedBytes)).
  • FROM ...The table/source (e.g., SystemSample for host stats, Transaction for API calls).
  • TIMESERIESThe Magic Word. This tells NR to plot the data as a line chart over time. Without this, you just get a single number (the average of the whole hour).

Example 2: Grouping (Facet) 🥧

This creates pie charts or lists of "Top 5 ...".

sql
sql
SELECT count(*) FROM Transaction FACET appName SINCE 1 day ago
  • count(*)Count how many times something happened (requests).
  • FACET ...Group By. This breaks the data down by a field. Here, it shows requests per app name.
  • SINCE ...The time range. Default is 60 minutes. You can say SINCE 1 week ago or SINCE yesterday.
09

Setting up Alerts

Graphs are great, but you can't stare at them 24/7. You need New Relic to scream at you when something breaks.

Understanding: Policy vs. Condition

Many people get this wrong. You cannot have an alert without a Policy.

  • 👮 Policy: The "Folder" that decidesWHO gets notified (Email, Slack, etc).
  • Condition: The "Rule" (e.g., CPU > 90%). It lives inside a policy.

Step 1: Start from the Graph

The easiest way to create an alert is from a chart you already have.

  1. Go to your Host Monitoring Dashboard.
  2. Find the CPU Usage chart.
  3. Click the ... (three dots) in the top-right corner of the chart.
  4. Select Create alert condition.

Step 2: The Query (NRQL)

New Relic will automatically generate the NRQL for you based on the chart.

sql
sql
SELECT average(cpuPercent) AS 'CPU used %' FROM SystemSample WHERE (entityGuid = '...')

Make sure the query looks correct and click Run to see the preview line.

Step 3: Fine-tune the Signal

This is where you prevent false positives.

Window DurationSet to 1 minute (How we group data points).
Streaming MethodSelect Event Flow.
(Best for steady metrics like Host CPU)
DelaySet to 2 minutes.
(Gives data time to arrive before alerting)
Set the Threshold (The Danger Zone)
  • Create a Critical incident with Static threshold.
  • If query returns a value above 80...
  • ...for at least 5 minutes.

Step 4: Connect to Policy

This is the most critical step. Without a policy, no one gets notified.

  1. Name your alert condition (e.g., "High CPU Loading").
  2. Policy Name: Select "Create new policy" (or choose existing).
    Name it: "Production Server Policy".
  3. Issue Preference:
One issue per policy

HIGHLY RECOMMENDED. This groups all alerts (CPU, RAM, Disk) for this server into ONE big incident.
Instead of getting 50 emails, you get 1 email saying "Server is dying (CPU high, RAM full)".

10

SSH for Logs

While centralized monitoring is the goal, you will still often need to SSH into machines to check files or debug "live" issues that haven't crashed the server yet.

Terminal
bash
# 1. Connect to the server
ssh -i my-key.pem ubuntu@1.2.3.4

# 2. View logs in real-time (tail)
pm2 logs --lines 100

# 3. Check system health
htop
11

Application Performance Monitoring (APM)

Host monitoring tells you if the server is slow. APM tells you which function in your code is slow. This is how you debug "Why is my API taking 5 seconds?".

Step 1: Get the Agent

  1. Go to APM & Services in New Relic.
  2. Click Add Data (or the + button).
  3. Select Node.js.
  4. Select On a host.

Step 2: Install & Configure

Terminal
bash
npm install newrelic

You need to configure the agent. You can do this via a newrelic.js file matching the keys from the UI, or use environment variables in your start script.

json
json
"start": "NEW_RELIC_APP_NAME=my-app NEW_RELIC_LICENSE_KEY=... node dist/index.js"

Step 3: Inject the Agent

This is the most common mistake. You MUST require New Relic as the very first line of your application.

typescript
typescript
import 'newrelic'; // <-- MUST BE LINE 1

import express from 'express';
const app = express();

Step 4: Verify

Run your app and generate some traffic.

Terminal
bash
# Install loadtest tool
npm i -g loadtest

# Send 200 requests per second
loadtest -c 10 --rps 200 http://localhost:3000/

Check the APM & Services page in New Relic. You should see your service name and data flowing in!

12

Logs in Context (Winston)

⚠️ Why are my logs empty?

If you go to the Logs tab right now, it will be empty. This is because:

  • By default, the New Relic agent does not forward logs for security/cost reasons.
  • We must explicitly enable Log Forwarding in our configuration.

🤔 Why not just console.log?

  • 🚫Performance Killer: In Node.js, console.log is a blocking operation (synchronous). Use it too much, and your server stops handling requests while it writes to the screen.
  • 🚫Unsearchable: It outputs plain text strings. You can't say "Show me error logs where userId = 123".
  • Winston is Better: It is asynchronous (non-blocking) and outputs Structured JSON, which New Relic can index and search instantly.

Step 1: Enable Forwarding

Add this environment variable to your start script to turn on the firehose.

json
json
"start": "... NEW_RELIC_APPLICATION_LOGGING_FORWARDING_ENABLED=true ..."

Step 2: Install Winston

Terminal
bash
npm install winston

Step 3: The Code Setup

Configure Winston to output JSON. New Relic intercepts this automatically.

javascript
javascript
require('newrelic'); // Must be first! // import newrelic 
const winston = require('winston') // import winston from "winston";

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.json(), // <-- Critical for New Relic
  defaultMeta: { service: 'user-service' },
  transports: [
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
  ]
});

// Add console logs for local dev only
if (process.env.NODE_ENV !== 'production') {
  logger.add(new winston.transports.Console({
    format: winston.format.simple(),
  }));
}

Metrics on logs

You can add metrics on top of log counts (esp for errors) to catch if a certain error is being thrown too often/there is a spike.

sql
sql
SELECT count(`message`) FROM Log WHERE message LIKE '%error%'
13

Percentiles vs Averages

When measuring performance (response times, CPU usage), which metric should you use? Mean (Average) or Median?

The Example 🧮

Let's say in a 20-second interval, you receive 20 requests. Their response times (in ms) are:

[1, 2, 3, 4, 4, 4, 5, 6, 7, 9, 10, 11, 11, 11, 12, 13, 13, 13, 50, 100]
Average
14.45 ms

Skewed high by the two outliers (50, 100).

Median (p50)
9.5 ms

A more accurate representation of the "typical" user.

Why Percentiles (p95, p99)?

The average tells you nothing about the worst experience. In our example:

  • 90% of your users had a response time of 13ms or less (p90).
  • But 10% (the outliers) waited 50ms or 100ms.

If you only look at the average (14.45), you might think "Everything represents 14ms". You miss the fact that some users are suffering.

The Rule of Thumb:

Optimize for p95 or p99. If your p95 is good, it means 95% of your users are happy. The average is for amateurs.

NRQL for Percentiles

sql
sql
SELECT percentile(duration, 95, 99) FROM Transaction TIMESERIES