[Archived] Shared Storage (Ceph) Jewel

funkypenguin · September 29, 2017, 6:45am

I just tried to replicate this one one of my Atomic VM hosts… I added a 10G disk (/dev/vdb), and went through adding a mon and an OSD. The OSD died at this point:

2017-09-29 06:42:46 /entrypoint.sh: Regarding parted, device /dev/vdb is inconsistent/broken/weird.
2017-09-29 06:42:46 /entrypoint.sh: It would be too dangerous to destroy it without any notification.
2017-09-29 06:42:46 /entrypoint.sh: Please set OSD_FORCE_ZAP to ‘1’ if you really want to zap this disk.
[root@ds1 ~]#

Which, I guess, we expect. Zapping required. Now /dev/vdb is the block device as my OS sees it.

[root@ds1 ~]# lsblk | grep vdb
vdb 252:16 0 10G 0 disk
[root@ds1 ~]#

So, this all looks good.

Can you post the output of docker logs ceph-osd ?

funkypenguin · September 29, 2017, 7:14pm

2 questions:

#1 - Is /dev/nvme1n1 the entire block device, or just a partition? (It needs to be the entire device)

#2 - Have you tried docker exec -it ceph-osd bash, and confirming that /dev/nvme1n1 does exist within the container?

ggilley · September 29, 2017, 7:31pm

#1. It’s the entire device.

#2. it keeps restarting, so i can’t run bash. However, looking at the logs, it clearly can see the device. the problem I see is that the script in the kraken docker container doesn’t actually prepare the disk.

ggilley · September 30, 2017, 9:21pm

Okay, I rebuilt everything on top of Ubuntu server 16.04. Seems to be working. One thing that would be very helpful would be to get an example “ceph status” at the various steps. I had some trouble along the way and being able to see what the status should be would help. Here’s my current status. I’m curious about the OSDs (I have 5 nodes):

  cluster:
    id:     67f89555-83a3-48e6-8f47-54467435e107
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon: 5 daemons, quorum orange,lime,lemon,plum,fig
    mgr: no daemons active
    mds: cephfs-1/1/1 up  {0=orange=up:creating}, 4 up:standby
    osd: 1 osds: 1 up, 1 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   0 kB used, 0 kB / 0 kB avail
    pgs:

ggilley · September 30, 2017, 9:35pm

Ah, in fact the other OSDs are not starting. Here’s the logs. It appears that each node is looking for a different fsid?

mount_activate: Failed to activate
unmount: Unmounting /var/lib/ceph/tmp/mnt.YhQiIO
command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.YhQiIO
Traceback (most recent call last):
  File "/usr/sbin/ceph-disk", line 9, in <module>
    load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5704, in run
    main(sys.argv[1:])
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5655, in main
    args.func(args)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3759, in main_activate
    reactivate=args.reactivate,
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3522, in mount_activate
    (osd_id, cluster) = activate(path, activate_key_template, init)
  File "/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3669, in activate
    ' with fsid %s' % ceph_fsid)
ceph_disk.main.Error: Error: No cluster conf found in /etc/ceph with fsid 5224144f-00e5-4a11-9791-0062dd4c5c34

funkypenguin · September 30, 2017, 10:14pm

We’re SOOO close! The contents of /etc/ceph/ should be identical on each node. Are they?

ggilley · September 30, 2017, 10:26pm

It’s possible that in one of the iterations, i forgot to zap the osd drive before. So I went and did it again. Seems better Now the question is about the mds status:

  cluster:
    id:     67f89555-83a3-48e6-8f47-54467435e107
    health: HEALTH_WARN
            no active mgr
 
  services:
    mon: 5 daemons, quorum orange,lime,lemon,plum,fig
    mgr: no daemons active
    mds: cephfs-1/1/1 up  {0=orange=up:active}, 4 up:standby
    osd: 5 osds: 5 up, 5 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   0 kB used, 0 kB / 0 kB avail
    pgs:

funkypenguin · September 30, 2017, 11:04pm

That’s better!

Have you actually installed the MDS on each per the recipe? (one should take the primary role, and the rest should go into standby/backup)

I see that you’re also seeing the new issue presented in Luminous, whereby you need to deploy at least one “mgr” as well as your mons, too, before your cluster will report itself as “healthy” - Could you you please send me the output when you do this, so that I can incorporate into the recipe, and make it Luminous-compatible?

D

ggilley · September 30, 2017, 11:11pm

Yes, I had MDS running. I didn’t know about the primary/standby status for mds. Added the manager

docker run -d --net=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph:/var/lib/ceph -e CEPH_PUBLIC_NETWORK=192.168.2.0/24 ceph/daemon mgr

Status is now:

  cluster:
    id:     67f89555-83a3-48e6-8f47-54467435e107
    health: HEALTH_WARN
            too many PGs per OSD (307 > max 300)
 
  services:
    mon: 5 daemons, quorum orange,lime,lemon,plum,fig
    mgr: orange(active)
    mds: cephfs-1/1/1 up  {0=orange=up:active}, 4 up:standby
    osd: 5 osds: 5 up, 5 in
 
  data:
    pools:   2 pools, 512 pgs
    objects: 21 objects, 2246 bytes
    usage:   10245 MB used, 4756 GB / 4766 GB avail
    pgs:     512 active+clean
 
  io:
    client:   854 B/s wr, 0 op/s rd, 4 op/s wr
    recovery: 672 B/s, 3 keys/s, 4 objects/s

funkypenguin · September 30, 2017, 11:22pm

Aah, right, I misread the MDS status on your output. Yes, a single active MDS is normal. So this looks healthy, except for the warning about PG count being slightly too high. If you don’t have any data yet, you could delete and recreate your pool with a smaller pg/pgp size, or (possibly - I’ve never tried) you could reduce the PG count of the existing pools.

A note from painful experience - I (now) like to set replica count to three, but allow the cluster to continue to operate at two. This lets me loose an OSD with enough redundantly to let me sleep at night until it’s replaced!

ggilley · September 30, 2017, 11:46pm

Is there a simple way to delete the pool? Or am I zapping the drives again and starting over? The docker daemon is convenient in one way, but painful to run custom commands.

I haven’t been about to get cephfs to mount yet on ubuntu, so there is no data.

funkypenguin · October 1, 2017, 12:07am

Deleting pools is dangerously simple, so Luminous introduced a feature - you need to set a flag in ceph.conf on every mon before it’ll let you delete a pool.

See Protecting your Ceph pools against removal or property changes – Widodh for details

ggilley · October 1, 2017, 3:56pm

Well, i ended up wiping everything and starting over. Here’s the current status. Two things: you have a typo in your mgr command (mgs instead of mgr). I ended up using PG=64. Technically the computation said I should be using 128, but you can increase the PG but never decrease it and it depends on the number of pools that you have. I can’t mount the fs I get an error “mount: mount orange:6789:/ on /var/data failed: No such process”.

  cluster:
    id:     e48a2eb3-9c49-402e-ac86-a3ec091f8852
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum orange,lime,lemon,plum,fig
    mgr: orange(active)
    mds: cephfs-1/1/1 up  {0=orange=up:active}, 4 up:standby
    osd: 5 osds: 5 up, 5 in
 
  data:
    pools:   2 pools, 128 pgs
    objects: 21 objects, 2246 bytes
    usage:   10244 MB used, 4756 GB / 4766 GB avail
    pgs:     128 active+clean

ggilley · October 1, 2017, 7:33pm

More debugging. The “no such process” was because the ceph installed by default on ubuntu was v10 which didn’t have the fs installed. I updated the client ceph to luminous as well.

$ wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
$ sudo apt-add-repository 'deb https://download.ceph.com/debian-luminous/ xenial main'
$ sudo apt-get update
$ sudo apt-get install ceph-common

Now I get a timeout when trying to mount instead. The ceph-mon process is listening on port 6789, so I don’t understand why its failing now.

Here’s the mount command I’m using in case it’s something obvious:

sudo mount -t ceph orange:6789:/ /var/data -o name=dockerswarm,secret=AQB8DtFZnghSJhAA6kOacgTPP8nAff1lz5UBKQ==,_netdev,noatime

ggilley · October 1, 2017, 9:36pm

Mount successful! I found this in my syslog:

Oct  1 14:24:23 orange kernel: [ 5886.955545] libceph: mon2 192.168.2.12:6789 feature set mismatch, my 107b84a842aca < server's 40107b84a842aca, missing 400000000000000
Oct  1 14:24:23 orange kernel: [ 5886.962032] libceph: mon2 192.168.2.12:6789 missing required protocol features

This makes no sense since I’m running the same version on both client and docker. However, doing:

sudo ceph osd crush tunables hammer

Fixed the problem and now I can mount cephfs.

funkypenguin · October 1, 2017, 9:41pm

Hurrumph. You’re right, it doesn’t make much sense.

Here’s my theory - you can either mount CephFS using the kernel driver or using FUSE in user-space. Presumably the kernel driver is more efficient, but older. I bet if you used FUSE, you wouldn’t have needed the hammer tunables.

So congrats, you have a workable Ceph cluster

I’d suggest you run through some simulated failures to be sure it’s tweaked the way you you want, before you put data on it though!

D

blacs30 · November 20, 2017, 10:07pm

I am new to ceph but a colleague was a developer for ceph and praised it a lot.
I found a good use case for testing with a 3 node rancher HA environment with cattle at the moment.
This step by step guide worked great. Having RancherOS with the Ubuntu console, its Ubuntu 17.4.

Thanks again for this work.

majestic · December 29, 2017, 9:09pm

Hi All,

I am new to ceph and have been trying to build a cluster using your awesome guide. Now ive managed to get it to this point…

  cluster:
    id:     5fcf6c04-c435-4848-a4ca-32e1b15c8d40
    health: HEALTH_WARN
            noscrub,nodeep-scrub flag(s) set

  services:
    mon: 9 daemons, quorum ceph-mon01,ceph-mon04,ceph-mon02,ceph-mon05,ceph-mon03,ceph-mon06,ceph-mon07,ceph-mon08,ceph-mon09
    mgr: ceph-mgr01(active), standbys: ceph-mgr02, ceph-mgr03, ceph-mgr04, ceph-mgr05, ceph-mgr06, ceph-mgr07, ceph-mgr08, ceph-mgr09
    mds: cephfs-1/1/1 up  {0=ceph-mds01=up:active}, 2 up:standby
    osd: 3 osds: 3 up, 3 in
         flags noscrub,nodeep-scrub

  data:
    pools:   2 pools, 262 pgs
    objects: 21 objects, 2246 bytes
    usage:   5214 MB used, 4494 GB / 4499 GB avail
    pgs:     262 active+clean

I had to set the PG to 70 or the MDS wodulnt start up, I had to keep reducing it until it worked. I did check the caculator which says it should be 256 but I cant get it to accept that.

Anyway after they are up, for some reason 1) its unhealthy when it shoudlnt be IMO as all nodes are up, I sadly only have 3 OSD’s which is the same as what you have. I also did the tweak for two replcas. This hasnt helped. On top I finally managed to pull the secret key but its refusing to mount and it says…

mount: 172.30.1.200:6789:/ is write-protected, mounting read-only
mount: cannot mount 172.30.1.200:6789:/ read-only

Which is so weird.

So im kinda at a loss, what have I miss understood and or can someone point me to how I can debug why its un-healthy for one.

I can add more MDS, there should be 9 but I dont think it will help.

Thanks in advance.

Kind Regards

majestic · December 29, 2017, 11:30pm

To answer my own question, I have this all fixed. The problem was a couple of things, first I needed cephfs installed on the machine (this was a Debian 9 machine I was using to test on, I haven’t tested the coreos ones yet) but you needed the following packages:

ceph-fs-common
ceph-common

I also miss copied your cmd which created the key for dockerswarm which was causing another error. I use the following in the end:

ceph auth get-or-create client.docker osd 'allow rw' mon 'allow rw' mds 'allow rw' > /etc/ceph/keyring.docker

Then I tested a mount with:

mount -t ceph 172.30.1.200:6789:/ /volumes/ -o name=docker,secretfile=/tmp/secret

You also don’t need the ceph-authtool. You can just use:

ceph auth list

This will display the keys, look for the dockerswarm client or whatever you called it. Put the key and only the key into a text file and use that.

The only extra thing I had to do because my cluster wasnt going healthy was that the tweaks you suggested, I had to remove these with osd unset because they were causing a healthy warning and I hate warnings

Again, thank you for such a great guide as without it, I doubt I would of got it all working.

The only last thing I would really like to know more about is how to set the PG’s correctly and what really are they?

As I said earlier mines set to 70 but this is just a guess really as calc says 256 which doesnt work

I currently have 3x OSD each one has a single disk/partican which is 1.5TB in size (each is the same). What should the PG be? I do plan to push this up to 4 nodes shortly.

One last thing which readers might like to know is, you can use a mount point instead if your on a VPS and you only have one disk. You need to use the following…

docker run -d --net=host \
--name ceph-mds \
--hostname ceph-mds02 \
--restart always \
-v /var/lib/ceph/:/var/lib/ceph/ \
-v /etc/ceph:/etc/ceph \
-e CEPHFS_CREATE=1 \
-e CEPHFS_DATA_POOL_PG=70 \
-e CEPHFS_METADATA_POOL_PG=70 \
ceph/daemon mds

Replace with your PG etc…

One final note before I sign off, I found that it wont work with ext4 file systems. I was getting some strange error which took a while to figure a solution/help. If you format it as xfs, you wont have any problems and it will just format/work. If you dont do this then you need to manually create everything which was a nightmare. So to make it easy use xfs or btrfs.

Kind Regards,

Simon

funkypenguin · December 30, 2017, 1:22am

Good to hear you got it sorted!

Fair call re the noscrub/nodeep-scrub… that’s a judgement call based on the reliability of the storage available to you, and your available resources. In my case, I run my (relatively underpowered) VPSs on an OpenStack platform which itself runs Ceph, so double-scrubbing was just 100% wasted resources

So PG (Placement Groups) are the smallest element of your cluster that Ceph is aware of. For example, if you have 3 OSDs servicing a single pool with 150 PGs and a replica count of 2, you’ll have you’ll have 2 copies of each PG across the entire cluster, or (at capacity), 100 PGs per OSD. There are probably more advanced semantics that I’m unaware of - I’d suggest you consult http://ceph.com/pgcalc/ for the definitive answer re how many PGs you should have. Note that you can always increase PGs, but you can never decrease PGs (per-pool).

Finally, the note about ext4 - weird, thanks for the tip - I’ve always defaulted to XFS, largely out of habit