1

I have an Azure Batch pool where I have mounted three blob storage containers. It works, however, when the nodes are booting for the first time they get the following error:

Mount configuration error

Looking in the logs it seems the nodes have trouble installing the blobfuse package. Getting this error message repeatedly:

2020-03-11T09:15:48,654579941+00:00 - INFO: Downloading: https://packages.microsoft.com/keys/microsoft.asc as microsoft.asc
2020-03-11T09:15:48,770319520+00:00 - INFO: Downloading: https://packages.microsoft.com/config/ubuntu/16.04/prod.list as /etc/apt/sources.list.d/microsoft-prod.list
Hit:1 http://azure.archive.ubuntu.com/ubuntu xenial InRelease
Hit:2 http://azure.archive.ubuntu.com/ubuntu xenial-updates InRelease
Hit:3 http://azure.archive.ubuntu.com/ubuntu xenial-backports InRelease
Hit:4 http://security.ubuntu.com/ubuntu xenial-security InRelease
Get:5 https://packages.microsoft.com/ubuntu/16.04/prod xenial InRelease [4,002 B]
Get:6 https://packages.microsoft.com/ubuntu/16.04/prod xenial/main amd64 Packages [124 kB]
Fetched 128 kB in 0s (383 kB/s)
Reading package lists...
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
...

2020-03-11T09:16:53,361634408+00:00 - ERROR: Could not install packages (apt): blobfuse

The nodes then proceed to the state Unusable until I manually reboot them which will "fix" the problem and then the node starts working on a task.

The task should be running with elevated privileges:

UserIdentity = new UserIdentity(
                new AutoUserSpecification(
                      elevationLevel: ElevationLevel.Admin,
                      scope: AutoUserScope.Pool
                  )
              ),

Update 1

I was not able to solve this problem so I decided to work around it. Resizing or recreating the pool did not help. In stead I installed blobfuse in the Docker image and mount the blob storage containers in the task itself. This works just fine.

niknoe
  • 323
  • 5
  • 14
  • do you need root access to install those packages? – Aravind Mar 16 '20 at 08:47
  • @Aravind possibly. The task itself is running as admin, but this process is running on the pool itself when new nodes are joining it. I don't have much control over the process. All I have done is add the mount configurations and I thought azure was supposed to fix the rest. – niknoe Mar 16 '20 at 08:58
  • +1: This looks like the `installation issue with the Blobfuse package`, Try `resizing or reboot` your pool back to zero and then scaling back up and I think as joinpool the new Joining VM's should trigger new install. This should help. – Tats_innit Mar 16 '20 at 09:54
  • Hiya @NiklasNoem good approach but remember it just keep this in back if your kind that blobfuse driver has an exisiting hug where the it hands after 65 hours detail here: https://github.com/Azure/azure-storage-fuse/issues/329 , also if docker is not necessary you can always do that via a script and at the start task level for normal batch nodes. – Tats_innit Mar 21 '20 at 01:36
  • 1
    @Tats_innit Thank you for the tip. That's probably longer than any of the tasks will run and the pool is auto scaling so it will reconnect. Thanks again for the help. – niknoe Mar 21 '20 at 09:46

2 Answers2

3

Your approach looks good and glad that reboot fixed it, in this specific case this is the right fix along with re-size.

Thanks for sharing logs, this looks like issue with the failure to Installation with the Blobfuse

Big give away is: **ERROR: Could not install packages (apt): blobfuse** is the biggest indicator, essentially under the hood node looks to blobfuse to be installed and seems like some process has a long running apt install in parallel. The cause of this error is detailed here E: Could not get lock /var/lib/dpkg/lock-frontend - open.

2 Possible Solution

  • Like you mention reboot fixed it.
  • Another option is to resize the pool. OR better : recreate the pool and then try.

Why the reboot or resize will fix it : essentially in both cases VM will invoke the join pool process at batch end with refresh memory which will help in un-blocking the lock scenario for blobfuse. At batchnode level we can try some sort of back off mechanism. I would also keep an eye on blobfuse and if something within caused this https://github.com/Azure/azure-storage-fuse

Hope this helps.

Tats_innit
  • 33,991
  • 10
  • 71
  • 77
  • Thank you for your reply. There is an autoscale formula on the pool. The issue occurs every time a node is joining the pool I think. I will get back to you when I can confirm this. If that is the case I must have done something wrong. I can't manually reboot all nodes every time so hopefully a solution can be found. – niknoe Mar 16 '20 at 11:10
  • Thanks for ping! Yes please, that will be great, do check, because this should not be the ever time case, I tried 5 times and was not able to generate this but this can happen if one of the underlying apt takes longer or might be blobfuse driver is having issues or taking time in download. – Tats_innit Mar 16 '20 at 19:20
  • Unfortunately I can now confirm that this happens every time a node joins the pool. – niknoe Mar 17 '20 at 11:13
  • Thanks for getting back, @NiklasNoem cool, have you tried resizing pool to zero and back again or if it’s ok try recreating the pool? Essentially this like mentioned above some lock in long running underlying apt while blobfuse drive is getting installed. – Tats_innit Mar 17 '20 at 20:32
  • The pool is on an autoscale formula which resizes the pool down to 0 and up again whenever new tasks are created. It has been resized automatic, manually as well as recreated entirely but the issue persists. Which means it must be something I have done, I think? The only way to get the nodes operative is to reboot them manually after they reach their initial unsuable state. – niknoe Mar 18 '20 at 10:16
  • Cool, @NiklasNoem just as experiment if your scenario permits try re-creating your pool, the underlying theory is that when you recreate, that will refresh that will refresh anything else and any ligering vms ( just a long shot) **but** what in particular you are doing different which will cause this? You have any suspects? I think it’s some bum level underlying long running apt issue. Give it a try for recreation but if you have not tried already. – Tats_innit Mar 18 '20 at 10:26
1

I had exactly the same issue and as I'm creating dynamic pools and tasks manually stepping in and rebooting for me was not an option.

My work around was linking the batch account to the storage account then specifying the resource files as part of the task. This enabled the container to be visible in the working directory of the task.

Here is my example in NodeJS, it should be transferable to your language of choice.

{
        id: taskId,
        commandLine: "",
        containerSettings: taskContainerSettings,
        environmentSettings: [
            {
                name: "USER_ID",
                value: userId
            },
            {
                name: "RUN_ID",
                value: runId
            }],
        resourceFiles: [{"autoStorageContainerName": userId}]
};
jizhihaoSAMA
  • 12,336
  • 9
  • 27
  • 49
druridge
  • 69
  • 1
  • 5