Home
DevOps & Cloud Engineering / Lesson 3 — Networking Essentials

Networking Essentials

IPs, ports, DNS, VPCs, firewalls — the network primitives every cloud engineer needs in their bones.


Why Networking Bites Cloud Engineers Hardest

The classic developer-to-DevOps trajectory hits a wall at networking. Code on your laptop talks to localhost without any thought. Cloud systems span dozens of machines across regions, and almost every "it doesn't work" debugging session eventually traces back to a network issue:

Networking knowledge isn't optional. Every cloud platform — AWS, GCP, Azure — is fundamentally just "rentable networks with some compute attached." Understand the network primitives and the rest of cloud is logical. Skip them, and cloud feels like magic that occasionally curses at you.

This lesson is the foundation. Refer back to it when you're knee-deep in a VPC peering issue at midnight.


IP Addresses — The Basic Identity

Every device on a network needs an address. IPv4 addresses are 32 bits, written as four 8-bit numbers separated by dots:

Text
192.168.1.10
↑   ↑   ↑ ↑
each is 0-255 (8 bits)

The address has two logical parts: the network portion (which network this device is on) and the host portion (which device within that network). The split is determined by a subnet mask.

CIDR notation (most common in modern systems): 192.168.1.0/24 means "the first 24 bits identify the network, the last 8 bits identify hosts." That gives 256 addresses (256 = 2^8) in the /24 — though 2 are reserved for network and broadcast, leaving 254 usable.

Text
192.168.1.0/24          254 usable hosts
                         (network: .0, broadcast: .255)

10.0.0.0/16             65,534 usable hosts

10.0.0.0/8              16+ million addresses

Reserved (private) IP ranges — used inside private networks, NOT routable on the public internet:

Text
10.0.0.0/8              10.0.0.0    – 10.255.255.255   (16M addresses)
172.16.0.0/12           172.16.0.0  – 172.31.255.255    (1M addresses)
192.168.0.0/16          192.168.0.0 – 192.168.255.255  (65K addresses)

Cloud VPCs almost always use ranges from 10.0.0.0/8. Your home WiFi probably uses 192.168.0.0/16. Docker uses 172.17.0.0/16 by default.

A cloud server typically has at least two IP addresses:
• Private IP — assigned by the VPC, used for traffic inside the network
• Public IP (if applicable) — what the internet sees

These don't have to be related. Your EC2 instance might have private IP 10.0.1.42 and public IP 54.182.91.17. The cloud's network gateway translates between them via NAT.

IPv6 — newer, vastly larger address space (128 bits). Modern systems support it but most cloud setups still use IPv4 for primary addressing. We'll stick to IPv4 throughout this series unless noted.


Ports — Multiplexing the IP

A single IP address can host many services. Ports distinguish them.

A port is a 16-bit number (0–65535) that identifies a specific service on a host. When you connect to example.com:80, you're asking for port 80 on that IP.

Text
Connection identifier (a "socket") is the 5-tuple:
   (protocol, source_ip, source_port, dest_ip, dest_port)
   
   e.g. TCP, 192.168.1.10:54321 → 142.250.190.78:443

Reserved / well-known ports (0–1023):

Text
20, 21    FTP                 22     SSH
23        Telnet              25     SMTP (email send)
53        DNS                 80     HTTP
110       POP3                143    IMAP
443       HTTPS               465    SMTPS
587       SMTP submission     993    IMAPS
3306      MySQL               5432   PostgreSQL
6379      Redis               9092   Kafka
27017     MongoDB

Privileged ports (< 1024) require root to bind on Linux. That's why nginx runs as root briefly to bind port 80, then drops privileges. Or why Node.js apps usually listen on port 3000+ and a reverse proxy handles the public 80/443.

How a connection works mechanically:

Text
1. Server "binds" to a port and "listens"
   $ python -m http.server 8080
   → process now owns port 8080 on this host

2. Client picks an ephemeral port and connects
   browser → 192.168.1.50:54321 → 142.250.190.78:80
   ephemeral port (random in 32768-60999 range on Linux)

3. Both sides identify the connection by the 5-tuple
   so the OS can route incoming bytes to the right process

4. When done, both sides close the connection

Useful commands for ports:

Bash
ss -tlnp                       # listening TCP ports + which process
ss -tnp                        # established TCP connections
lsof -i :8080                  # what's using port 8080
netstat -tlnp                  # older alternative to ss
nc -zv example.com 443         # is port 443 reachable?
nmap -p 22,80,443 host         # port scan (use carefully)

Common mistake: assuming a port is closed when actually a firewall is dropping packets. From a debugging standpoint:
Connection refused → host is up, nothing listening on that port
Connection timeout → packets dropped (firewall, network issue)
Connection reset → something accepted, then immediately closed


DNS — The Internet's Phone Book

We covered DNS in Backend Module 25 from a request-flow angle. Here we'll go deeper from a DevOps perspective — DNS records, propagation, TTLs, and the operational concerns that bite cloud engineers.

DNS is a distributed hierarchical database that maps names to records. The key record types you'll encounter:

Text
A           Maps a name to an IPv4 address
            example.com  →  192.0.2.42

AAAA        Same, but IPv6
            example.com  →  2001:db8::42

CNAME       Alias — "this name is the same as that name"
            www.example.com  →  example.com
            (cannot exist alongside other records on the same name)

MX          Mail server for this domain
            example.com  →  10 mail.example.com (priority 10)

TXT         Arbitrary text — used for SPF, DKIM, domain verification
            example.com  →  "v=spf1 include:_spf.google.com ~all"

NS          Authoritative name servers for this domain

SOA         Start of authority — domain metadata

PTR         Reverse DNS — IP to name (for verification, mostly)

SRV         Service records — host+port for a specific service
            (used by some protocols like SIP, AD, Kubernetes services)

Important: a CNAME cannot coexist with other records on the same name. This is why you can't point example.com (the apex/root) at something.cloudfront.net directly with CNAME — the apex needs SOA and NS records. Cloud providers solve this with "ALIAS" or "ANAME" records (AWS Route 53 ALIAS, Cloudflare CNAME flattening) which work at the DNS provider level.

TTLs — Time To Live, the killer concept

Every DNS record has a TTL (in seconds): "you can cache this answer for this long before re-asking." This sounds simple but causes endless DevOps pain.

Text
Low TTL (60 seconds):     Quick to update, but high query load on DNS
Medium TTL (300 seconds): Common default
High TTL (86400 = 1 day): Cheap, but takes a day for changes to propagate

Production planning rule: BEFORE you migrate (e.g., changing a domain to point at a new server), reduce the TTL to something low like 60s several days in advance. After the migration is stable, raise it back. This way the actual switchover propagates fast.

Resolvers, recursion, propagation:

When your laptop looks up api.example.com, it asks a resolver (typically your ISP's, or 8.8.8.8 from Google, or 1.1.1.1 from Cloudflare). The resolver:
1. Checks its cache. Hit → done.
2. If no cache, asks the root → .com TLD → example.com authoritative server.
3. Caches the answer for the TTL.

"DNS propagation" is a bit of a myth — there's no global broadcast. What's actually happening: each resolver caches independently, and they update only when their cache expires. So a change you make in DNS appears INSTANTLY at resolvers that didn't have a cached entry, and might take up to TTL seconds at resolvers that did.

Operational DNS commands:

Bash
dig example.com                       # full lookup detail
dig example.com +short                # just the answer
dig example.com @8.8.8.8              # ask a specific resolver
dig MX example.com                    # specific record type
dig +trace example.com                # show the full delegation chain
host example.com                      # simple
nslookup example.com                  # also simple

Common DNS gotchas in cloud:
• You created a CNAME for api.example.comxxx.elb.amazonaws.com but it doesn't resolve. Wait — the load balancer's DNS name might be regional and need a moment.
• You're using a private hosted zone in AWS but querying from outside the VPC. Won't resolve.
• You have two A records on the same name. DNS returns one randomly per query (this is "round-robin DNS" — primitive load balancing).
• Negative caching: "this name doesn't exist" gets cached too. Often for the SOA's negative TTL, which can be hours.


VPCs — Your Own Private Network in the Cloud

A VPC (Virtual Private Cloud) is exactly what it sounds like: a logically isolated network inside the cloud provider's infrastructure, that you control. Every AWS account has at least one VPC; same with GCP. EC2 instances and other resources live inside a VPC.

A VPC has:
• A CIDR block (private IP range) — e.g. 10.0.0.0/16
• Subnets within that CIDR — segments of the address space
• Route tables — rules for where traffic goes
• Gateways — connections to the outside world

Subnets — public vs private

A subnet is a slice of the VPC's CIDR, tied to one availability zone (AZ). Subnets are typically classified by their internet routing:

Text
Public subnet  — has a route to an Internet Gateway
                 instances here CAN talk to the internet
                 used for: load balancers, bastion hosts, public APIs

Private subnet — no direct route to the internet
                 instances here CAN'T be reached from the internet directly
                 used for: databases, internal services, app servers

A common 3-tier VPC layout looks like this:

Text
   VPC: 10.0.0.0/16  (in us-east-1, 3 AZs)
   ┌─────────────────────────────────────────────────────┐
   │                                                     │
   │  AZ-a                AZ-b                AZ-c       │
   │  ┌──────────┐        ┌──────────┐        ┌────────┐ │
   │  │ Public   │        │ Public   │        │ Public │ │
   │  │ subnet   │        │ subnet   │        │ subnet │ │
   │  │10.0.1.0  │        │10.0.2.0  │        │10.0.3.0│ │
   │  │ /24      │        │ /24      │        │ /24    │ │
   │  └──────────┘        └──────────┘        └────────┘ │
   │       ↑                   ↑                   ↑    │
   │  Load balancers, bastion, NAT gateway              │
   │                                                     │
   │  ┌──────────┐        ┌──────────┐        ┌────────┐ │
   │  │ App      │        │ App      │        │ App    │ │
   │  │ private  │        │ private  │        │ private│ │
   │  │10.0.11.0 │        │10.0.12.0 │        │10.0.13 │ │
   │  └──────────┘        └──────────┘        └────────┘ │
   │  EC2/ECS/Pods — talk to internet via NAT            │
   │                                                     │
   │  ┌──────────┐        ┌──────────┐        ┌────────┐ │
   │  │ Data     │        │ Data     │        │ Data   │ │
   │  │ private  │        │ private  │        │ private│ │
   │  │10.0.21.0 │        │10.0.22.0 │        │10.0.23 │ │
   │  └──────────┘        └──────────┘        └────────┘ │
   │  RDS, ElastiCache — no internet access at all       │
   │                                                     │
   └─────────────────────────────────────────────────────┘

The pattern: public-facing things in public subnets, application code in private subnets that can reach OUT to the internet via NAT but not be reached, databases in fully isolated subnets.

NAT Gateway — outbound-only internet for private subnets

A private subnet has no internet route — but your app needs to call third-party APIs, pull container images, etc. The solution is a NAT (Network Address Translation) gateway in a public subnet. Private subnet routes outbound traffic through the NAT, which forwards it to the internet using its own public IP. Inbound from the internet still doesn't work — NAT is one-way for security.

Cost gotcha: NAT gateways cost money per hour AND per GB processed. A chatty app talking to S3 through NAT can rack up surprising bills. Use VPC endpoints (covered below) to talk to AWS services without traversing the NAT.


Security Groups, NACLs, and Firewalls

Two layers of network filtering in AWS (and conceptually similar in other clouds):

Security Groups — instance-level, stateful, allow-only

A security group is a set of allow rules attached to instances (or load balancers, or RDS, etc.). Each rule says "allow traffic FROM this source ON this port FOR this protocol."

Key properties:
• ALLOW-ONLY: there are no deny rules. If no rule matches, traffic is dropped.
• STATEFUL: if you allow inbound on port 443, the response goes back automatically. You don't need a separate outbound rule.
• Default outbound: by default, all outbound is allowed.
• Sources can be CIDR blocks OR other security groups. The latter is powerful: "allow traffic from any instance in the app-sg security group" expresses an intent without hardcoding IPs.

Example: a typical web tier

Text
Web Server Security Group:
  Inbound:
    - 80, 443 from 0.0.0.0/0           (anyone can hit the website)
    - 22 from 10.0.0.0/16              (SSH only from inside the VPC)
  Outbound:
    - all traffic anywhere             (default)

Database Security Group:
  Inbound:
    - 5432 from web-server-sg          (only the web tier can connect)
  Outbound:
    - all                              (default)

This is the common "tier" pattern. The DB security group never allows internet access, only references to other security groups.

NACLs (Network ACLs) — subnet-level, stateless, allow OR deny

NACLs are a second filter, applied at the subnet level. They're:
• Stateless: you must explicitly allow both inbound AND the response in outbound (and vice versa)
• Allow OR deny rules
• Numbered: rules evaluated in order

Most teams leave NACLs at the default "allow all" and rely entirely on security groups. NACLs are useful for blanket bans (e.g. "block this attacker IP at the subnet level"). Don't use them as your primary security control — security groups are more flexible.

A connection has to pass BOTH layers: NACL inbound, NACL outbound, AND security group rules. If any layer drops it, no connection.

Cloud firewalls vs Linux iptables

The cloud's firewall (security groups) is configured via the API and applied at the hypervisor level — your instance never sees the dropped packets. Linux's iptables/nftables runs inside the OS and gives finer control but is often redundant in a properly-configured VPC.

For most cloud workloads, you don't manage iptables directly. You configure security groups. iptables is more relevant for on-prem or container networking (Kubernetes uses iptables under the hood for service routing).


Load Balancers — Distributing Traffic

A load balancer accepts connections on behalf of a group of backend servers and distributes them. We covered this conceptually in Backend Module 33 (API Gateway / Reverse Proxy); here we focus on cloud load balancer specifics.

Three layers / types in AWS terminology (others are similar):

Text
Layer 4 (Network Load Balancer / NLB)
  Operates on TCP/UDP — doesn't understand HTTP
  Extremely fast, can handle millions of requests/sec
  Use for: non-HTTP protocols, ultra-high throughput, static IP needed

Layer 7 (Application Load Balancer / ALB)
  Understands HTTP(S) — can route by path, header, hostname
  Supports WebSockets, HTTP/2
  Use for: most web traffic — the default choice

Classic Load Balancer (CLB) — legacy, generally use ALB or NLB instead.

GCP equivalents:
  Network Load Balancer    ≈ AWS NLB
  HTTP(S) Load Balancer    ≈ AWS ALB (with global anycast — really nice)
  Internal Load Balancer   for traffic inside the VPC

How a request flows through an ALB:

Text
Client (browser)
    │
    ▼
DNS → ALB DNS name (alb-xxxxx.us-east-1.elb.amazonaws.com)
    │
    ▼
ALB receives connection
    │
    ▼
ALB applies LISTENER RULES:
  IF host = api.example.com THEN forward to target-group-api
  IF path = /admin/*        THEN forward to target-group-admin
  ELSE                            forward to target-group-web
    │
    ▼
Target group has multiple TARGETS (EC2 instances, IPs, Lambda)
ALB picks one based on routing algorithm (round-robin default)
    │
    ▼
Health checks ensure only healthy targets receive traffic

Health checks — critical concept. The load balancer periodically pings each target on a configurable path/port. Targets that fail are removed from rotation until they pass again. Without good health checks:
• A broken instance keeps receiving traffic and returning errors
• A deploying instance gets traffic before it's ready

Get health checks right or you'll have a bad time. Common pattern:
/health returns 200 if the app is alive
/ready returns 200 only when the app is fully initialized AND can reach its database

TLS termination — the load balancer decrypts HTTPS, talks plain HTTP to backends. Saves CPU on backends, lets you manage certificates centrally (use AWS Certificate Manager — free, auto-renewing, integrates with ALB).

Sticky sessions vs stateless — most apps should be stateless (cookies/JWTs carry state), so any backend can serve any request. If you need stickiness (legacy apps with in-memory session state), ALB supports it via cookies. But the better long-term answer is to make your app stateless.


VPC Peering, Transit Gateway & Endpoints

As your cloud footprint grows, you'll need networks to talk to each other. Three common cloud-native patterns:

VPC Peering — direct one-to-one connection between two VPCs

Text
   VPC A (10.0.0.0/16) ◄────peering────► VPC B (10.1.0.0/16)

Once peered, instances in VPC A can reach instances in VPC B by their private IPs. CIDRs must not overlap. Peering is non-transitive: A↔B + B↔C does NOT mean A can reach C. For more than a handful of VPCs, peering becomes a maintenance nightmare ("full mesh" — N(N-1)/2 connections).

Transit Gateway (AWS) / Network Connectivity Center (GCP) — hub-and-spoke

Text
                ┌──────────────────────┐
                │  Transit Gateway     │
                └─────────┬────────────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
        ┌───▼──┐      ┌───▼──┐      ┌───▼──┐
        │ VPC  │      │ VPC  │      │ VPC  │
        │  A   │      │  B   │      │  C   │
        └──────┘      └──────┘      └──────┘

All VPCs connect to a central transit gateway. VPCs can talk to each other through it. Massively simpler at scale. Also connects to on-premises via VPN or Direct Connect.

VPC Endpoints — talk to AWS services privately

By default, your private subnet calls AWS APIs (S3, DynamoDB, etc.) by going OUT through the NAT gateway, OUT to the internet, and back IN to AWS's public APIs. That:
• Costs NAT gateway data charges
• Adds latency
• Sends sensitive data over the internet (encrypted, but still)

VPC endpoints let traffic stay inside AWS's network:

Text
Gateway Endpoints (S3, DynamoDB) — free
   Add a route in your route table: traffic for S3 goes via the endpoint
   
Interface Endpoints (most other services) — costs per hour + per GB
   Adds an ENI (network interface) in your subnet
   You connect to it as if it were the AWS service

For any production workload using S3 heavily from a private subnet: use a gateway endpoint. It's free and saves real money.

VPN and Direct Connect — connecting to on-premises

For hybrid setups, you connect your on-premises network to the cloud:
• VPN: encrypted tunnel over the internet. Cheap, easy to set up, modest throughput.
• Direct Connect (AWS) / Cloud Interconnect (GCP): a dedicated physical fiber connection. Expensive, high throughput, low latency. Used by serious enterprises.

Both terminate at a virtual gateway in your VPC, which is then routed to relevant subnets.


What to Carry Forward

The mental model that will save you time on every cloud problem:

Text
1. Every server has a private IP. Some also have public IPs.

2. Connections are 5-tuples. To debug "X can't reach Y", ask:
   - Is Y listening on the right port?
   - Does the route table allow the path?
   - Does Y's security group allow X (by IP or by SG)?
   - Does the NACL allow it?
   - Is there a NAT/proxy in the way?

3. DNS is a hierarchical cache. Things take TTL seconds to propagate.

4. VPCs are private networks; subnets are slices; route tables decide
   what's public vs private.

5. Security groups are stateful allow-only. NACLs are stateless
   allow-or-deny. Use SGs as primary control.

6. Load balancers solve scale and HA but only when health checks
   are accurate. Get those right.

7. NAT gateways cost money. Use VPC endpoints for AWS service traffic.

When something doesn't work in the cloud, walk through this list. It's almost always one of these layers.

The next lessons build on this foundation: shell scripting for automation, Git for versioning, then CI/CD pipelines that orchestrate the whole flow.


⁂ Back to all modules