1

I am building an openstack cluster and am having some issues with what I think may be a quota problem. I can successfully build vms on every host, but only one vm per host.

I deployed the system using puppet. and the current openstack version deployed is Ussuri. openstack puppet modules used are 17.4 with the exception of puppet-vswitch which uses 13.4

Each compute host(hypervisor) has 64 cores and 512GB of RAM. Even if I spin up a 2 core vm, i cant spin up any more on that hypervisor and I get the following error in the logs:

scheduler.log:

"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider

nova-conductor.log

2021-05-24 15:31:21.770 31421 ERROR nova.conductor.manager [req-18e93e25-5cc2-43b6-a036-312ed064070b 9f72d8a0694146288eb09ac7fee38298 7016985dddfe4048b535ca7ff12a0c68 - default default] Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.

I have checked and re-checked the quotas for this project and the number of instances is set to 10000 so im not sure what im missing:

| fixed-ips | 10000
| floating_ips | None
| health_monitors | None
| injected-file-size | 10240
| injected-files | 5
| injected-path-size | 255
| instances | 10000
| key-pairs | 100 | project_name | admin
| properties | 128
| ram | 99999999

Im not too sure what else i can possibly check and from the searches ive done, no one else seems to have run into something like this so im hoping its a simple setting im missing.

EDIT 5-26-21: I ran some more tests and I have found an interesting pattern.

If I put a 1 core machine(flavor m1.nano) on a compute host, i can build as many virtual machines as I want, any flavor that I want, until the machine runs of of resources physically.

If I create anything larger than a 1 core vm, and that vm is started on a compute host that does not have a 1 core vm already, any other vm built on this host will fail after a single machine being placed.

Other than telling me it cant allocate vcpus when it does fail, the logs aren't helping whatsoever.

Edited to add deployment method and openstack version.

Thanks in advance! -Jeff

Jeff_M
  • 13
  • 4
  • Have you checked `openstack hypervisor stats show` and looked into each hypervisor (`openstack hypervisor show `)? – eblock May 25 '21 at 06:52
  • yeah, I did actually. `openstack hypervisor stats show` shows me the total numbers of resources from the controller perspective and `openstack hypervisor show ` shows me the compute host resources. Both show the correct numbers for total and used with the one VM on it. Starting another one instantly fails with the two errors above saying the host is out of resources. – Jeff_M May 25 '21 at 12:28
  • Could you add how you deployed openstack and which version? – eblock May 25 '21 at 13:40
  • Sure, I deployed openstack using puppet and the openstack version deployed is Ussuri. With the exception of puppet-vswitch (13.4), all openstack puppet modules are on 17.4. I edited my initial posting with this info as well. Thank you! – Jeff_M May 25 '21 at 14:05
  • Did you somehow turn off `nova_cpu_allocation_ratio`? The default is 16. – eblock May 26 '21 at 05:39
  • Sorry for the delay in response. I checked and the allocation ratio is still the defaults. I ran some more tests this afternoon and edited my response above as it was too long to fit here. Thanks! – Jeff_M May 26 '21 at 23:29
  • This is really interesting and sounds like a bug to mee. I would recommend to post this to the [mailing list](http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-discuss) and maybe also a [bug report](https://bugs.launchpad.net/). – eblock May 27 '21 at 07:44
  • 1
    I'm encountering the exact same thing in a new deployment, did you resolve it in the meantime? – eblock Jan 28 '22 at 14:15
  • 1
    Sure did! Check the accepted answer below. – Jeff_M Jan 31 '22 at 14:00

1 Answers1

2

I've recently tracked such an error down to a MariaDB problem (https://jira.mariadb.org/browse/MDEV-25714). In my case, MariaDB is version 10.6.5.

When running placement API in debug mode, its log reveals a message like

"Over capacity for VCPU on resource provider . Needed: 1, Used: 4118, Capacity: 768.0"

but checking the entries in the placement db / allocations table shows that "4118" is the sum of all resources for the resource provided, not only for the CPU class.

The problem results from errors in the DBMS handling the "outer join" with a subquery while retrieving the currently allocated resources.

You might want to run the test described by "Daniel Howard" in that ticket to verify if your version of MariaDB is affected as well - unless, of course, you experience these problems but are not using MariaDB at all.

Jens M
  • 36
  • 2
  • 1
    I marked this as the correct answer as this was the problem for me. The miscalculation done in mariadb is what was caused the issue. We upgraded mariadb from 10.5 to 10.6.4 and the issue was resolved. I also upgraded this cluster from Ussuri to Wallaby with the same DB version and can confirm the issue is still resolved. – Jeff_M Jan 31 '22 at 13:52
  • Cheers, I upgraded our MariaDB from 10.5.13 to 10.6.5 (from MariaDBs own repo, on Ubuntu 20.04) and it seems to have resolved the issue so far. Running OpenStack Xena. I was however not able to reproduce the `over capacity for on resource provider` error even after enabling debug logging for the placement-api, I could only find `Unable to create allocation for '' on resource provider` (nova-scheduler) and subsequently `nova.exception.NoValidHost` (nova-conductor). – timss Feb 04 '22 at 18:26
  • Nevermind, the bug is still very much present on our installation of 10.6.5 as well. Probably latent to begin with because a different combination of flavors and whatnot were used. – timss Feb 08 '22 at 13:05