2

I'm using Ansible to configure a server. This server is running on AWS Ec2 and I'm attaching to it four EBS drives.

When I run my ansible playbook it will fail about 50% of the time. The failure is when I mount a path to a newly formatted drive. While investigating I noticed that one of the four drives appears to not have a filesystem and is missing it's UUID. Ansible does not show any errors in the task for creating the filesystem.


Task:

- name: Create File Systems
      filesystem:
        fstype: ext4
        dev: /dev/{{ item }}
      with_items: "{{ ansible_devices }}"
      register: filesystem
      when: item != "nvme0n1"

I skip the root volume ^.

TASK [Create File Systems] ****************************************************************************************************************************************************************************************************************************************************************************************************
changed: [10.76.22.196] => (item=nvme3n1)
changed: [10.76.22.196] => (item=nvme4n1)
changed: [10.76.22.196] => (item=nvme1n1)
changed: [10.76.22.196] => (item=nvme2n1)
skipping: [10.76.22.196] => (item=nvme0n1)

So when it fails and I log in to investigate I get this...

[ec2-user@ip-10-76-22-196 ~]$ lsblk -f
NAME        FSTYPE LABEL UUID                                 MOUNTPOINT
nvme0n1
├─nvme0n1p1
└─nvme0n1p2 xfs          de4def96-ff72-4eb9-ad5e-0847257d1866 /
nvme1n1     ext4         35546ab6-8a1f-401f-97fa-7c9daf9005eb /couchbase/DATA
nvme2n1     ext4         379a603a-2726-437f-ad25-14fd43358e96 /couchbase/INDEX
nvme3n1     ext4         b0ceae1f-e902-44d5-a63f-2ef81bb62f21 /couchbase/LOGS
nvme4n1

Next I tried creating the file system again

[root@ip-10-76-22-196 ~]# mkfs.ext4 /dev/nvme4n1
mke2fs 1.42.9 (28-Dec-2013)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1638400 inodes, 6553600 blocks
327680 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2155872256
200 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

[root@ip-10-76-22-196 ~]# lsblk -f
NAME        FSTYPE LABEL UUID                                 MOUNTPOINT
nvme0n1
├─nvme0n1p1
└─nvme0n1p2 xfs          de4def96-ff72-4eb9-ad5e-0847257d1866 /
nvme1n1     ext4         35546ab6-8a1f-401f-97fa-7c9daf9005eb /couchbase/DATA
nvme2n1     ext4         379a603a-2726-437f-ad25-14fd43358e96 /couchbase/INDEX
nvme3n1     ext4         b0ceae1f-e902-44d5-a63f-2ef81bb62f21 /couchbase/LOGS
nvme4n1

but no luck =/

I also tried other ways to get this information.

[ec2-user@ip-10-76-22-196 ~]$ ls /dev/disk/by-uuid/
35546ab6-8a1f-401f-97fa-7c9daf9005eb  379a603a-2726-437f-ad25-14fd43358e96  b0ceae1f-e902-44d5-a63f-2ef81bb62f21  de4def96-ff72-4eb9-ad5e-0847257d1866

fsck seems to think its ext2?

[ec2-user@ip-10-76-22-196 ~]$ fsck -N /dev/nvme4n1
fsck from util-linux 2.23.2
[/sbin/fsck.ext2 (1) -- /dev/nvme4n1] fsck.ext2 /dev/nvme4n1
[ec2-user@ip-10-76-22-196 ~]$ fsck -N /dev/nvme3n1
fsck from util-linux 2.23.2
[/sbin/fsck.ext4 (1) -- /couchbase/LOGS] fsck.ext4 /dev/nvme3n1
[ec2-user@ip-10-76-22-196 ~]$ lsblk -f
NAME        FSTYPE LABEL UUID                                 MOUNTPOINT
nvme0n1
├─nvme0n1p1
└─nvme0n1p2 xfs          de4def96-ff72-4eb9-ad5e-0847257d1866 /
nvme1n1     ext4         35546ab6-8a1f-401f-97fa-7c9daf9005eb /couchbase/DATA
nvme2n1     ext4         379a603a-2726-437f-ad25-14fd43358e96 /couchbase/INDEX
nvme3n1     ext4         b0ceae1f-e902-44d5-a63f-2ef81bb62f21 /couchbase/LOGS
nvme4n1

Eventually, I found this...

[ec2-user@ip-10-76-22-196 ~]$ sudo sudo file -s /dev/nvme*
/dev/nvme0:     ERROR: cannot read (Invalid argument)
/dev/nvme0n1:   x86 boot sector; partition 1: ID=0xee, active, starthead 0, startsector 1, 20971519 sectors, code offset 0x63
/dev/nvme0n1p1: data
/dev/nvme0n1p2: SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)
/dev/nvme1:     ERROR: cannot read (Invalid argument)
/dev/nvme1n1:   Linux rev 1.0 ext4 filesystem data, UUID=35546ab6-8a1f-401f-97fa-7c9daf9005eb (needs journal recovery) (extents) (64bit) (large files) (huge files)
/dev/nvme2:     ERROR: cannot read (Invalid argument)
/dev/nvme2n1:   Linux rev 1.0 ext4 filesystem data, UUID=379a603a-2726-437f-ad25-14fd43358e96 (needs journal recovery) (extents) (64bit) (large files) (huge files)
/dev/nvme3:     ERROR: cannot read (Invalid argument)
/dev/nvme3n1:   Linux rev 1.0 ext4 filesystem data, UUID=b0ceae1f-e902-44d5-a63f-2ef81bb62f21 (needs journal recovery) (extents) (64bit) (large files) (huge files)
/dev/nvme4:     ERROR: cannot read (Invalid argument)
/dev/nvme4n1:   Linux rev 1.0 ext4 filesystem data, UUID=caf9638a-9d10-482e-a554-ae8152cd2fdb (extents) (64bit) (large files) (huge files)

So something is not right

Levi
  • 253
  • 2
  • 10
  • 1
    Maybe try to null out the partition block of the device? You can use `dd` or `wipefs`. – eckes Feb 02 '19 at 22:20
  • You should run the play with extra verbosity to find out what is going on. You might also print that variable you registered to see what's in it. – Michael Hampton Feb 02 '19 at 22:21
  • @eckes Indeed wiping the Filesystem and trying again does fix it. I'm more curious about why it's happening and how to make it stop. The fix I'm currently testing is to partition the device and work from the partition. – Levi Feb 02 '19 at 23:36
  • @MichaelHampton Unfortuantly the issue isn't with Ansible. I belive its with the system. I have looked at the output of `filesystem` and it doesn't show much. Just if it worked or not. As far as -vvvv it doesn't show anything useful either, just the input and output of mkfs and it doesn't show anything interesting. – Levi Feb 02 '19 at 23:39
  • 1
    I am not sure what’s going on but it looks like a mkfs/kernel-refresh Problem and not related to ansible. Nulling with wipefs will also reload the kernel‘s view of partitions, so it’s most likely best for automated steps (if you are sure you want to destroy anything existing) – eckes Feb 02 '19 at 23:39
  • @eckes Any idea why so many tools would report so many different things? `fsck` says the device has an `ext2` filesystem, `lsblk` says it has nothing and `file` shows both an ext4 partition and the uuid. – Levi Feb 02 '19 at 23:49
  • 1
    I guess most tools work on the kernel view (at least for initial classification) of things and some (like `File`) open the device and read the content. – eckes Feb 02 '19 at 23:50

1 Answers1

1

If /dev/disk/by-uuid or lsblk does not show the filesystem then it is possible that the partition type was not correctly recognized by the kernel, or the kernel view was not updated after mkfs.

There are a number of situations where garbage on the disk can cause problems, including external lvm IDs, software raid signatures or mismatch in bios/uefi tables. It is a good idea to null out the beginning of the disk.

If you use wipefs for this (instead of dd) you get the additional benefit that it uses an ioctl to tell the kernel to actually reload its view of the disk partitions.

I think the Filesystem tools as well as file command reads directly from the disk and therefore are unaware of kernel state. I think the filesystem detection code of fsck does also only rudimentary inspection to find the type of there is no fstab entry for the filesystem. The check binary is the same for ext2-ext4, so if fsck finds the type in the fstab it starts a command with exactly this type (fsck.ext4) if it however not finds the type it checks the beginning for a filesystem signature and for any of the ext2 versions it starts the fsck.ext2 tool (which would check the more specific version).

eckes
  • 845
  • 9
  • 21