Monitoring & Logging
Why console.log isn't enough, and how to save your skin when servers crash.
The Scale Problem
When you are a small startup or working on a side project, you generally don't care much about logging and monitoring. If something breaks, you just fix it.
But for large scale companies, it is absolutely critical. Imagine a bank's transaction server crashing for 1 hour. They could lose millions. They need to know immediately when something goes wrong, and exactly why.
Logging & Monitoring
Logging 📝
Logging refers to the practice of recording events, messages, or data points generated by software applications and systems.
- •Error Messages: Details about errors and exceptions that occur.
- •Access Logs: Records of who accessed what resources and when.
- •Audit Logs: Records for compliance and security purposes.
Monitoring 📊
Monitoring involves the continuous observation of a system to ensure it is operating correctly and efficiently. It includes tracking:
- •CPU & Memory Usage: How much processing power/RAM is being utilized.
- •Disk I/O & Space: Read/write operations and available storage.
- •Network Traffic: Data transfer rates.
- •Application Performance: Response times, throughput, and error rates.
Alerts
When there are logging/monitoring systems in place, you can put up alerts to be called/messaged/slacked/paged when:
- A system goes down
- CPU usage goes above a certain point
- Error count goes up
The Limits of PM2 logs
We love PM2. It keeps our Node.js applications alive.
Scenario A: App Crash
Your code throws an error. PM2 catches it and restarts the app.
$ pm2 logs 1
Error: Database connection failed...
App-1 restarting...✅ You can check the logs and fix the issue.
Scenario B: Machine Crash 💥
The entire EC2 instance (the machine itself) runs out of memory or has a kernel panic and crashes.
Question: Where are your logs now?
❌ Gone. You cannot SSH into a dead machine. Even if you restart it, some logs might be lost or hard to find. You have NO idea what caused the crash.
The Solution: Centralized Logging
Since we can't trust the local machine to keep our logs safe forever, we need to ship them somewhere else in real-time.
Enter New Relic (or Datadog, CloudWatch, etc).
Real-time Streaming
As soon as your app logs something, it is sent to New Relic's cloud. If your machine explodes 1 second later, the log is already safe in the cloud.
Alerting
You can set alerts. "If CPU usage > 90% for 5 mins, SMS the DevOps lead." This lets you react BEFORE the crash.
Services
There are a few ways to monitor systems.
Paid 💸
- 1. AWS CloudWatchMonitoring and observability service for AWS resources.
- 2. DatadogLogging and monitoring.
- 3. New RelicLogging and monitoring.
In house / Self hosted 🏠
Prometheus + Grafana + Loki
We'll be going through:
- New Relic
- Prometheus + Grafana
Setting up New Relic (Host)
Step 1: Create Account
Go to newrelic.com and create a free account.
Step 2: Select Environment
You can install the agent on Linux, Windows, Docker, or Kubernetes.
(You can even install it on your local machine to test!)
Step 3: Installation (The Docker Way 🐳)
The standard Linux script often fails. We recommend using Docker.
* Note: This specific Docker method with host networking does NOT work on macOS.
Prerequisite: You need Docker installed on your server.
👉Guide: Install Docker on Ubuntu
- Select Docker as the platform.
- Click "Create new key" and continue.
- Select Installation Method: Basic.
- Select Docker Network: Default (Host).
(It uses--network=hostto see your system's metrics) - Copy the command and run it on your Ubuntu server using sudo.
sudo docker run -d --name newrelic-infra ...How the Agent works
Think of the Agent as a spy living on your server.
- 1Collection: It constantly monitors your PM2 processes, CPU usage, and log files.
- 2Transmission: It packages this data and securely sends it to the New Relic cloud.
- 3Visualization: You view the beautiful graphs and logs on the New Relic dashboard.
Step 4: Verify
Go to All Entities > Hosts on the New Relic dashboard. Click on your host to see live charts and monitoring data! 🚀
To monitor your actual Node.js code (APM), you would need to install the
newrelic npm package in your app. But Host monitoring is the most critical first step!Bonus: The Power of NRQL ⚡
NRQL (New Relic Query Language) is the most powerful tool in your observability arsenal. It is a SQL-flavored language used to query the New Relic Database (NRDB).
Every time an event happens in your system (a request, an error, a custom event), it is stored as a JSON object in NRDB. NRQL allows you to slice and dice this data in real-time.
Why use it?
- -Deep Dive Troubleshooting: Find the exact user who experienced an error, or filter logs by a specific
traceId. - -Custom Dashboards: Create bespoke visualizations that the default UI doesn't provide.
- -Advanced Alerting: Create complex alert logic (e.g., "Alert if error rate > 1% AND latency > 500ms").
- -Business Analytics: Analyze specific feature usage, conversion rates, and business KPIs using technical events.
Example 1: Time Series Graph 📈
This creates the line charts you see on dashboards.
SELECT average(cpuPercent) FROM SystemSample TIMESERIESSELECT ...Pick the metric (e.g.,average(cpuPercent)ormax(memoryUsedBytes)).FROM ...The table/source (e.g.,SystemSamplefor host stats,Transactionfor API calls).TIMESERIESThe Magic Word. This tells NR to plot the data as a line chart over time. Without this, you just get a single number (the average of the whole hour).
Example 2: Grouping (Facet) 🥧
This creates pie charts or lists of "Top 5 ...".
SELECT count(*) FROM Transaction FACET appName SINCE 1 day agocount(*)Count how many times something happened (requests).FACET ...Group By. This breaks the data down by a field. Here, it shows requests per app name.SINCE ...The time range. Default is 60 minutes. You can saySINCE 1 week agoorSINCE yesterday.
Setting up Alerts
Graphs are great, but you can't stare at them 24/7. You need New Relic to scream at you when something breaks.
Understanding: Policy vs. Condition
Many people get this wrong. You cannot have an alert without a Policy.
- 👮 Policy: The "Folder" that decidesWHO gets notified (Email, Slack, etc).
- ⚡ Condition: The "Rule" (e.g., CPU > 90%). It lives inside a policy.
Step 1: Start from the Graph
The easiest way to create an alert is from a chart you already have.
- Go to your Host Monitoring Dashboard.
- Find the CPU Usage chart.
- Click the ... (three dots) in the top-right corner of the chart.
- Select Create alert condition.
Step 2: The Query (NRQL)
New Relic will automatically generate the NRQL for you based on the chart.
SELECT average(cpuPercent) AS 'CPU used %' FROM SystemSample WHERE (entityGuid = '...')Make sure the query looks correct and click Run to see the preview line.
Step 3: Fine-tune the Signal
This is where you prevent false positives.
(Best for steady metrics like Host CPU)
(Gives data time to arrive before alerting)
- Create a Critical incident with Static threshold.
- If query returns a value above 80...
- ...for at least 5 minutes.
Step 4: Connect to Policy
This is the most critical step. Without a policy, no one gets notified.
- Name your alert condition (e.g., "High CPU Loading").
- Policy Name: Select "Create new policy" (or choose existing).
Name it: "Production Server Policy". - Issue Preference:
HIGHLY RECOMMENDED. This groups all alerts (CPU, RAM, Disk) for this server into ONE big incident.
Instead of getting 50 emails, you get 1 email saying "Server is dying (CPU high, RAM full)".
SSH for Logs
While centralized monitoring is the goal, you will still often need to SSH into machines to check files or debug "live" issues that haven't crashed the server yet.
# 1. Connect to the server
ssh -i my-key.pem ubuntu@1.2.3.4
# 2. View logs in real-time (tail)
pm2 logs --lines 100
# 3. Check system health
htopApplication Performance Monitoring (APM)
Host monitoring tells you if the server is slow. APM tells you which function in your code is slow. This is how you debug "Why is my API taking 5 seconds?".
Step 1: Get the Agent
- Go to APM & Services in New Relic.
- Click Add Data (or the + button).
- Select Node.js.
- Select On a host.
Step 2: Install & Configure
npm install newrelicYou need to configure the agent. You can do this via a newrelic.js file matching the keys from the UI, or use environment variables in your start script.
"start": "NEW_RELIC_APP_NAME=my-app NEW_RELIC_LICENSE_KEY=... node dist/index.js"Step 3: Inject the Agent
This is the most common mistake. You MUST require New Relic as the very first line of your application.
import 'newrelic'; // <-- MUST BE LINE 1
import express from 'express';
const app = express();Step 4: Verify
Run your app and generate some traffic.
# Install loadtest tool
npm i -g loadtest
# Send 200 requests per second
loadtest -c 10 --rps 200 http://localhost:3000/Check the APM & Services page in New Relic. You should see your service name and data flowing in!
Logs in Context (Winston)
⚠️ Why are my logs empty?
If you go to the Logs tab right now, it will be empty. This is because:
- By default, the New Relic agent does not forward logs for security/cost reasons.
- We must explicitly enable Log Forwarding in our configuration.
🤔 Why not just console.log?
- 🚫Performance Killer: In Node.js,
console.logis a blocking operation (synchronous). Use it too much, and your server stops handling requests while it writes to the screen. - 🚫Unsearchable: It outputs plain text strings. You can't say "Show me error logs where userId = 123".
- ✅Winston is Better: It is asynchronous (non-blocking) and outputs Structured JSON, which New Relic can index and search instantly.
Step 1: Enable Forwarding
Add this environment variable to your start script to turn on the firehose.
"start": "... NEW_RELIC_APPLICATION_LOGGING_FORWARDING_ENABLED=true ..."Step 2: Install Winston
npm install winstonStep 3: The Code Setup
Configure Winston to output JSON. New Relic intercepts this automatically.
require('newrelic'); // Must be first! // import newrelic
const winston = require('winston') // import winston from "winston";
const logger = winston.createLogger({
level: 'info',
format: winston.format.json(), // <-- Critical for New Relic
defaultMeta: { service: 'user-service' },
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
]
});
// Add console logs for local dev only
if (process.env.NODE_ENV !== 'production') {
logger.add(new winston.transports.Console({
format: winston.format.simple(),
}));
}Metrics on logs
You can add metrics on top of log counts (esp for errors) to catch if a certain error is being thrown too often/there is a spike.
SELECT count(`message`) FROM Log WHERE message LIKE '%error%'Percentiles vs Averages
When measuring performance (response times, CPU usage), which metric should you use? Mean (Average) or Median?
The Example 🧮
Let's say in a 20-second interval, you receive 20 requests. Their response times (in ms) are:
Skewed high by the two outliers (50, 100).
A more accurate representation of the "typical" user.
Why Percentiles (p95, p99)?
The average tells you nothing about the worst experience. In our example:
- 90% of your users had a response time of 13ms or less (p90).
- But 10% (the outliers) waited 50ms or 100ms.
If you only look at the average (14.45), you might think "Everything represents 14ms". You miss the fact that some users are suffering.
Optimize for p95 or p99. If your p95 is good, it means 95% of your users are happy. The average is for amateurs.
NRQL for Percentiles
SELECT percentile(duration, 95, 99) FROM Transaction TIMESERIES