Docker Swarm Mode - Funky Penguin's Geek Cookbook

funkypenguin · September 29, 2017, 12:03pm

For truly highly-available services with Docker containers, we need an orchestration system. Docker Swarm (as defined at 1.13) is the simplest way to achieve redundancy, such that a single docker host could be turned off, and none of our services will be interrupted.

This is a companion discussion topic for the original entry at https://geek-cookbook.funkypenguin.co.nz/ha-docker-swarm/docker-swarm-mode/

ggilley · October 2, 2017, 3:50pm

I don’t understand this statement. You need the manager token to add a manager and it’s the same for all the nodes that join:

Repeat the process to add your third node. You need a new token for the third node, don't re-use the manager token you generated for the second node.

funkypenguin · October 3, 2017, 7:38am

You’re right. On my initial build, I had other problems which made me think that joining with the same token was the problem. I since fixed the actual fault (with the VM hypervisor layer), and it was completely unrelated to the swarm-joining process. I’ve updated the recipe, thanks!

ggilley · October 11, 2017, 8:36pm

The command is missing to launch the registry stack.

The docker-compose file for the automated cleanup is missing.

Finally, how to test that the registry mirror is working?

ggilley · October 11, 2017, 8:48pm

Also I tried grabbing your aliases, but it says “Could not resolve host: gitlab.funkypenguin.co.nz”

ggilley · October 11, 2017, 10:19pm

The registry doesn’t seem to be responding. I see this in my logs:

Oct 11 14:22:20 orange dockerd[1136]: time=“2017-10-11T14:22:20.826479524-07:00” level=warning msg=“Error getting v2 registry: Get https://registry-mirror.gerg.org/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)”
Oct 11 14:22:20 orange dockerd[1136]: time=“2017-10-11T14:22:20.826541119-07:00” level=info msg=“Attempting next endpoint for pull after error: Get https://registry-mirror.gerg.org/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)”

Gregs-Macbook-Pro:web ggilley$ curl -I registry-mirror.gerg.org:5000/v2/
HTTP/1.1 200 OK
Content-Length: 2
Content-Type: application/json; charset=utf-8
Docker-Distribution-Api-Version: registry/2.0
X-Content-Type-Options: nosniff
Date: Wed, 11 Oct 2017 22:15:01 GMT

curl -I https://registry-mirror.gerg.org:5000/v2/
curl: (35) error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol

Gregs-Macbook-Pro:web ggilley$ curl -I https://registry-mirror.gerg.org/v2
times out

This is what I have in my daemon.json:

"registry-mirrors": ["https://registry-mirror.gerg.org"]

Should it be:

"registry-mirrors": ["http://registry-mirror.gerg.org:5000"]

ggilley · October 11, 2017, 11:37pm

The answer is yes, I should set the daemon.json to:

"registry-mirrors": ["http://registry-mirror.gerg.org:5000"]

funkypenguin · October 12, 2017, 7:33am

Actually either should work… if you use http://watever:5000, you’re using swarm to load-balance you to the registry container. If you use https://whatever, and you’ve setup traefik, you’re using swarm to loadbalance you to the traefik container, which in turn sends you to the registry container. The advantage of using traefik is that you get SSL encryption, which (IIRC) stops docker complaining when it pulls in images.

funkypenguin · October 12, 2017, 7:55am

Oh, and I updated the location to the aliases:

cd ~
curl -O https://raw.githubusercontent.com/funkypenguin/geek-cookbook/master/examples/scripts/gcb-aliases.sh
echo 'source ~/gcb-aliases.sh' >> ~/.bash_profile

I’ve fixed this in the the following commit:

Mdleal · January 23, 2018, 8:25pm

should the network on the registry be this?
networks:
traefik**_public**:
external: true

or
networks:
traefik:
external: true

funkypenguin · February 4, 2018, 8:05pm

You’re right, thanks, I’ve updated

kruecab · May 13, 2018, 7:21am

Hi funkypenguin -

I’ve successfully started my swarm and have keepalived running okay, but I can’t get docker-cleanup or shepherd to start… I suspect other swam stacks will also fail. The error both of them throw from docker stack ps is:

starting container failed: error creating external connectivity network: cannot create network c2f3e64892b7e03d577134d397393158d82600c0ac8961086cab15590255d155 (docker_gwbridge): conflicts with network f129b0ff64507350ddcf73411395380704b8a8552721f75f108f8ef4afde4fda (docker_gwbridge): networks have same bridge name

All three swam nodes error out with the same message for either docker-cleanup or shepherd, though the network id’s are different in all cases.

For reference, I’m running CentOS 7 LXC’s unconfined, inside proxmox (PVE). I’m not sure if this has to do with runnings in LXC - perhaps an AppArmor issue? Although I think unconfined disables them.

funkypenguin · May 13, 2018, 8:49pm

So that’s interesting - seems to have to do with “docker_gwbridge”, since that’s the “same bridge name” it’s complaining about.

If you run docker network ls | grep docker_gwbridge, do you get something like this?

[root@kvm ~]# docker network ls | grep docker_gwbridge
42d246870961        docker_gwbridge           bridge              local
[root@kvm ~]#

I.e., the SCOPE of the bridge is “local”…

D

kruecab · May 13, 2018, 11:24pm

Yeah actually docker_gwbridge doesn’t exist on any of my nodes - hey, at least its consistent!

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
f0cbd0808230        bridge              bridge              local
dbeda93e303f        host                host                local
1s0sro0uxdfy        ingress             overlay             swarm
4c5fb74396ec        none                null                local

But it shows up as an interface on the machine:

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 02:42:c8:7c:cb:8e brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 22:62:be:ef:d6:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.0.82/24 brd 192.168.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.0.80/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::2062:beff:feef:d6b8/64 scope link 
       valid_lft forever preferred_lft forever
7: vetha15837b@if5: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN 
    link/ether 2a:dc:0d:7e:bb:f8 brd ff:ff:ff:ff:ff:ff link-netnsid 2
8: docker_gwbridge: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN 
    link/ether 02:42:a3:4b:01:cd brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
       valid_lft forever preferred_lft forever

Shouldn’t “docker_gwbridge” exist from the creation of the swarm? Perhaps that errored out when I built mine. I did have to load “ip_vs” for keepalived to work, but I did that on the servers hosting the LXC’s.

funkypenguin · May 14, 2018, 12:15am

Mmm, yeah, I think we’re getting close to the problem here. Does restarting the docker daemon help?

kruecab · May 14, 2018, 1:15am

No change. I’ve rebooted even the hosts just to make sure all the modules and things load properly. Keepalived can get stuck, but I think that’s a different problem. It feels like there is something about LXC’s ability around setting up network interfaces, etc. I also thought it might be selinux, but I have that disabled.

My next guess would be to delete the docker swarm and re-init it and watch the output closely. This is my first experience with docker, so I’m learning where the logs and errors go. docker ps --no-trunc has been handy.

If that doesn’t work and barring anyone else’s suggestions on running inside LXC, I might try a set of VM’s just to confirm my theory. But I would really like to run in LXC to (1) contain the install for portability and (2) not giveup overhead for the VM. Seems like most people running docker in PVE are either running on the host or in a VM - though I have found a few folks talking about LXC, none have mentioned docker and mostly they just disable AppArmor.

funkypenguin · May 14, 2018, 5:07am

I’ve not played with LXEs much, although I have a DayJob™ colleague who loves them under Proxmox Yeah, I’d try running meat-and-potatoes VMs to confirm whether LXC is what’s messing with you. You can turn on debugging for docker by editing /etc/docker/daemon.json and setting "debug": true

(Configure and troubleshoot the Docker daemon | Docker Documentation)

rafipiccolo · December 27, 2021, 8:56pm

since mazzolino/shepherd and meltwater/docker-cleanup both look dead / not updated

i’d like to introduce GitHub - containrrr/watchtower: A process for automating Docker container base image updates.
which i no longer use, since i created my own script.
but someone might find it interesting.

It basically updates all containers using the same principle of sheperd, and can automatically delete the image of the deleted/rotated container.