For truly highly-available services with Docker containers, we need an orchestration system. Docker Swarm (as defined at 1.13) is the simplest way to achieve redundancy, such that a single docker host could be turned off, and none of our services will be interrupted.
You’re right. On my initial build, I had other problems which made me think that joining with the same token was the problem. I since fixed the actual fault (with the VM hypervisor layer), and it was completely unrelated to the swarm-joining process. I’ve updated the recipe, thanks!
The registry doesn’t seem to be responding. I see this in my logs:
Oct 11 14:22:20 orange dockerd[1136]: time=“2017-10-11T14:22:20.826479524-07:00” level=warning msg=“Error getting v2 registry: Get https://registry-mirror.gerg.org/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)”
Oct 11 14:22:20 orange dockerd[1136]: time=“2017-10-11T14:22:20.826541119-07:00” level=info msg=“Attempting next endpoint for pull after error: Get https://registry-mirror.gerg.org/v2/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)”
Actually either should work… if you use http://watever:5000, you’re using swarm to load-balance you to the registry container. If you use https://whatever, and you’ve setup traefik, you’re using swarm to loadbalance you to the traefik container, which in turn sends you to the registry container. The advantage of using traefik is that you get SSL encryption, which (IIRC) stops docker complaining when it pulls in images.
I’ve successfully started my swarm and have keepalived running okay, but I can’t get docker-cleanup or shepherd to start… I suspect other swam stacks will also fail. The error both of them throw from docker stack ps is:
starting container failed: error creating external connectivity network: cannot create network c2f3e64892b7e03d577134d397393158d82600c0ac8961086cab15590255d155 (docker_gwbridge): conflicts with network f129b0ff64507350ddcf73411395380704b8a8552721f75f108f8ef4afde4fda (docker_gwbridge): networks have same bridge name
All three swam nodes error out with the same message for either docker-cleanup or shepherd, though the network id’s are different in all cases.
For reference, I’m running CentOS 7 LXC’s unconfined, inside proxmox (PVE). I’m not sure if this has to do with runnings in LXC - perhaps an AppArmor issue? Although I think unconfined disables them.
Yeah actually docker_gwbridge doesn’t exist on any of my nodes - hey, at least its consistent!
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
f0cbd0808230 bridge bridge local
dbeda93e303f host host local
1s0sro0uxdfy ingress overlay swarm
4c5fb74396ec none null local
But it shows up as an interface on the machine:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN
link/ether 02:42:c8:7c:cb:8e brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
link/ether 22:62:be:ef:d6:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 192.168.0.82/24 brd 192.168.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet 192.168.0.80/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::2062:beff:feef:d6b8/64 scope link
valid_lft forever preferred_lft forever
7: vetha15837b@if5: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN
link/ether 2a:dc:0d:7e:bb:f8 brd ff:ff:ff:ff:ff:ff link-netnsid 2
8: docker_gwbridge: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
link/ether 02:42:a3:4b:01:cd brd ff:ff:ff:ff:ff:ff
inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
valid_lft forever preferred_lft forever
Shouldn’t “docker_gwbridge” exist from the creation of the swarm? Perhaps that errored out when I built mine. I did have to load “ip_vs” for keepalived to work, but I did that on the servers hosting the LXC’s.
No change. I’ve rebooted even the hosts just to make sure all the modules and things load properly. Keepalived can get stuck, but I think that’s a different problem. It feels like there is something about LXC’s ability around setting up network interfaces, etc. I also thought it might be selinux, but I have that disabled.
My next guess would be to delete the docker swarm and re-init it and watch the output closely. This is my first experience with docker, so I’m learning where the logs and errors go. docker ps --no-trunc has been handy.
If that doesn’t work and barring anyone else’s suggestions on running inside LXC, I might try a set of VM’s just to confirm my theory. But I would really like to run in LXC to (1) contain the install for portability and (2) not giveup overhead for the VM. Seems like most people running docker in PVE are either running on the host or in a VM - though I have found a few folks talking about LXC, none have mentioned docker and mostly they just disable AppArmor.
I’ve not played with LXEs much, although I have a DayJob™ colleague who loves them under Proxmox Yeah, I’d try running meat-and-potatoes VMs to confirm whether LXC is what’s messing with you. You can turn on debugging for docker by editing /etc/docker/daemon.json and setting "debug": true