Hallway troubleshooting

Crash

At some point I bought this little 7” LCD screen for a FPGA project I never got started on, and it had been sitting around in a drawer till now. Turns out, combined with a wireless keyboard, it’s the perfect kit for some emergency hallway troubleshooting for when you mangle your network configuration and lose SSH access.


Thursday, February 06, 2025


Link Aggregation

lacp1

So after moving our server to the linen closet I realized it had a second NIC in back, which of course meant I had to use it. Now that I have control over the network infrastructure, I can set up link aggregation.

lacp3

Link aggregation allows you to bundle multiple physical connections between two devices into a single one. On Cisco IOS, you create a numbered port channel, and configure the physical interfaces to use the corresponding channel group, also specifying the mode. This process has always been annoying to me because instead of explicitly declaring the protocol – either Link Aggregation Control Protocol (LACP) or Port Aggregation Protocol (PAgP) – you imply it with the mode.

LACP’s modes are active and passive while PAgP’s include a few more but you’ll usually use auto or desirable. The only reason I can think of for these interchangeable and sometimes misleading keywords is for cert test purposes.

Fortunately, in Ubiquiti you click into this dropdown, which activates LACP (PAgP is Cisco proprietary).

lacp4

Between network devices it’s pretty easy and you just configure both sides equally, but on the server side I finally had to reckon with one of my big blind spots – Linux networking. The configuration is scattered across a number of commands and files in such a way that I don’t even really know how to write about them. But I guess I’ll try.

For many years, ifconfig was the go-to command to do things like set ip addresses and turn interfaces on and off (not to be confused with the ipconfig command in Windows. But you would have to set your default gateway with the route add command. Now those are deprecated and the best practice these days is to use ip addr and ip route. This makes googling answers a bit difficult because a good number of writeups are still using ifconfig.

And then for DNS you need to configure dns by editing /etc/resolv.conf.

By default there’s the networking service that handles this all to a certain extent, but then you can also install NetworkManager that runs alongside networking so you can use the nmcli command to configure it.

There’s even more and I’m not even going to get into wireless, but nmcli did most of the heavy lifting here. I was going by this guide from Red Hat. Here, I think “bond” is the term for port channel or etherchannel or whatever you want to call it (I’m sure there are some very important distinctions depending on how pedantic you want to get about it).

nmcli conn add type bond ifname po1

The above makes a logical port arbitrarily named po1 (this is just what it would have been called on a Cisco device). Then you need to add your physical interfaces to it:

nmcli conn add type ethernet eno1 master po1
nmcli conn add type ethernet enp110s0 master po1

The two interfaces on the server are named, inexplicably, eno1 and enp110s0. At this point I lost connectivity to the server and had to set up the world’s most annoying crash cart (more on this in a later post).

The two NICs had link lights flashing, but I couldn’t get any layer 3 traffic through. It turned out I needed to enable LACP on the interfaces with:

nmcli conn modify po1 bond.options "mode=802.3ad,lacp_rate=slow"

And now we have a 2 gigabyte uplink for this server. I have no idea what the point of this all was. Also it broke VMWare for a minute because I needed to specify the new interface for it to use.


Sunday, January 26, 2025


Bird is not ready.

WHAT THE HELL IS BIRD.

Readiness probe failed: calico/node is not ready: BIRD is not ready: 
Error querying BIRD: unable to connect to BIRDv4 socket: 
dial unix /var/run/calico/bird.ctl: connect: connection refused

Followup: I did a bunch of network futzing on the node, and broke pretty much everything, and after I got it up again, Calico stayed broken with this error message. A reboot cleared it up and I never figured out what BIRD was, but I kind of want to remain ignorant of this one thing because it makes error querying bird/bird is not ready way funnier.


Saturday, January 25, 2025


It was DNS

❯ nslookup minecraft-0.minecraft.tristan
Server:		192.168.1.1
Address:	192.168.1.1#53

Name:	minecraft-0.minecraft.tristan
Address: 10.100.0.1

Had a fun breakthrough this morning on the cluster. I had been trying for a while to expose the CoreDNS to the home network. Like anything DNS, there were a number of hurdles to clear along the way.

Until now I had been adding static domain entries at the gateway, which works fine but just feels like too much of an easy way out. Also, having to ask for a new entry every time you pop open a new service is a bit of a pain, especially when you’re just messing around and have no idea which ones you’re even going to keep in the end.

So, the first order of business is CoreDNS. It’s already dynamically providing DNS inside the cluster. It’s exposed as a ClusterIP service:

❯ kubectl -n kube-system get svc
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP
kube-dns       ClusterIP      172.17.0.10      <none>     

You can query it from your node, and it’ll give you a response. If you have a service exposed, you should be able to find a record for service_name.namespace.svc.cluster.local.

❯ nslookup minecraft-0.minecraft.svc.cluster.local 172.17.0.10
Server:		172.17.0.10
Address:	172.17.0.10#53

Name:	minecraft-0.minecraft.svc.cluster.local
Address: 172.17.236.254

Now we have a few problems. First, CoreDNS is exposed via a ClusterIP (here, 172.17.0.10), which is only accessible from inside the cluster. Second, if you query it, it will give you another 172 address, which again is no good from outside the cluster! Also, the .svc.cluster.local domain is kinda clunky.

I came across this Reddit post, which also covers MetalLB, but I just picked out what I needed for CoreDNS, namely setting up the k8s_external plugin, which is a matter of editing its configmap with kubectl -n kube-system edit cm coredns and adding:

k8s_external tristan {                                                                                                                       
   headless                                                                                                                                   
}     

I added the headless keyword because we have a number of StatefulSets and those are headless. The above entry adds the tristan domain to CoreDNS, and when it is queried for service_name.namespace.tristan, it will give the external IP addresses:

❯ nslookup minecraft-0.minecraft.tristan 172.17.0.10
Server:		172.17.0.10
Address:	172.17.0.10#53

Name:	minecraft-0.minecraft.tristan
Address: 10.100.0.1

Progress! However we still have that 172 address for CoreDNS, so we’ll need to expose that through MetalLB:

apiVersion: v1
kind: Service
metadata:
  name: kube-dns-ext
  namespace: kube-system
  annotations:
    metallb.universe.tf/allow-shared-ip: "DNS"
spec:
  type: LoadBalancer
  ports:
  - port: 53
    name: "udp"
    targetPort: 53
    protocol: UDP
  - port: 53
    name: "tcp"
    targetPort: 53
    protocol: TCP
  selector:
    k8s-app: kube-dns
❯ kubectl get svc
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP  
kube-dns       ClusterIP      172.17.0.10      <none>      
kube-dns-ext   LoadBalancer   172.17.233.207   10.100.0.5 

So, now we can query a useful ip (here, 10.100.0.5) and get useful responses!

❯ nslookup minecraft-0.minecraft.tristan 10.100.0.5
Server:		10.100.0.5
Address:	10.100.0.5#53

Name:	minecraft-0.minecraft.tristan
Address: 10.100.0.1

There was more work to do at this point though, because I needed to set up the home gateway to forward queries for the tristan domain to CoreDNS. Coincidentally, Ubiquiti just added that functionality in their latest update:

dns

So now we don’t even need to specify CoreDNS and can just send queries to our gateway:

❯ nslookup minecraft-0.minecraft.tristan
Server:		192.168.1.1
Address:	192.168.1.1#53

Name:	minecraft-0.minecraft.tristan
Address: 10.100.0.1

Monday, January 06, 2025


Virtual Realty

I have a confession: I skipped pretty much all of hardware virtualization and dove straight into containerization and orchestration. I had been doing just fine till now. My ENCOR Lab Manual arrived.

lab manual

I had purchased a license to Cisco Modeling Labs during Black Friday and hadn’t really used it till now. You’ve got a few options – namely bare-metal or VM. I do have a few viable machines for bare metal installation, but that would have been too easy, right?

VMware Workstation was easy enough to get running on Nunu, our shared family server. I did run into a day’s worth of issues becuase I was trying to get it working through SSH X11 tunneling. The CML image wanted some swap space, which, unfortunately, is disabled on Nunu for the sake of Kubernetes. The alternative was to enable reserved memory from the host, which required root permissions.

This was where I ran into the “never run a GUI as root” issue because VMWare Workstation has a graphical console. I still don’t get what was going on but I think because I was SSH’ed in, pkexec was losing the DISPLAY variable. Eventually I installed TigerVNC on Nunu and my laptop and VNCed in as myself, which let me run it with the right privileges.

vmware

CML is now working properly and I finally get to do something with it. Things like these are why I try to keep as close as possible to CLI and text-based work. Whenever there’s an environment to build out and work in, I always find myself fighting with the environment instead of working with it.

Anyway, time to stop whining and get to work. I’ve got these in my .zshrc to bring it up and down:

alias cml-start='ssh nunu vmrun -T player start /home/jay/vmware/cml2_p_2.8.0-6_amd64-32/cml2_p_2.8.0-6_amd64-32.vmx nogui'
alias cml-stop='ssh nunu vmrun -T player stop /home/jay/vmware/cml2_p_2.8.0-6_amd64-32/cml2_p_2.8.0-6_amd64-32.vmx nogui'

Saturday, December 21, 2024