Compute manager to use thread pools selectively

This changes the thread pool usage of the ComputeManager to go through
the concurrency mode aware util functions.

The concurrent live migration pool had a seemingly unlimited option
when configured with value 0, but in reality GreenThreadPool has a
default worker size of 1000. In reality it is almost never right to
have more than one live migration running concurrently. Also with
native threading having 1000 worker is just too costly. So we
decided to deprecate the value 0 and changed the implementation of
unlimited to mean 5 threads in native threading mode. We kept the 1000
greenthread in eventlet mode for backward compatibility.

The _sync_power_states periodic task also spawn tasks for each instance
to be synced. As it uses a shared data structure across these tasks
and the caller a lock is needed to avoid race conditions.
Also the default pool size is 1000 for these tasks in our configuration.
That would use a lot of memory on a busy host in native threading mode.
So we changed the default value from 1000 to 5.

Change-Id: I9567d5fabdf086b5d0493103d9f6bde4f66af387
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
This commit is contained in:
Balazs Gibizer
2025-11-03 18:25:27 +01:00
parent 0498e2ad76
commit 3c23390cc8
6 changed files with 119 additions and 22 deletions
+18 -1
View File
@@ -43,7 +43,7 @@ Tunables for the native threading mode
As native threads are more expensive resources than greenthreads Nova provides As native threads are more expensive resources than greenthreads Nova provides
a set of configuration options to allow fine tuning the deployment based on a set of configuration options to allow fine tuning the deployment based on
load and resource constraints. The default values are selected to support a load and resource constraints. The default values are selected to support a
basic, small deployment without consuming substantially more memory resources, basic, small deployment without consuming substantially more memory resources,
than the legacy Eventlet mode. Increasing the size of the below thread pools than the legacy Eventlet mode. Increasing the size of the below thread pools
means that the given service will consume more memory but will also allow more means that the given service will consume more memory but will also allow more
tasks to be executed concurrently. tasks to be executed concurrently.
@@ -75,6 +75,23 @@ tasks to be executed concurrently.
This option is relevant to every nova service using ``nova.utils.spawn()``. This option is relevant to every nova service using ``nova.utils.spawn()``.
* :oslo.config:option:`sync_power_state_pool_size`: Used by the
nova-compute service to sync the power state of each instance on the host
between the hypervisor and the DB. Since nova 33.0.0 (2026.1 Gazpacho) the
default value of this option is changed from 1000 to 5 to have a sane default
in native threading mode. Increasing this value in native threading mode
increases the nova-compute memory consumption on a host that has many
instances.
* :oslo.config:option:`max_concurrent_live_migrations`: Used by the
nova-compute service to limit the number of outgoing concurrent live
migrations from the host. It is implemented via a thread pool. So increasing
the the number of concurrent live migrations will increase the nova-compute
service memory consumption in native threading mode. It is almost always
a bad idea to use change this config option from its default value, 1. If
more performant live migration is needed then enable
:oslo.config:option:`libvirt.live_migration_parallel_connections` instead.
Seeing the usage of the pools Seeing the usage of the pools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+37 -15
View File
@@ -39,7 +39,6 @@ import typing as ty
from cinderclient import exceptions as cinder_exception from cinderclient import exceptions as cinder_exception
from cursive import exception as cursive_exception from cursive import exception as cursive_exception
import futurist
from keystoneauth1 import exceptions as keystone_exception from keystoneauth1 import exceptions as keystone_exception
from openstack import exceptions as sdk_exc from openstack import exceptions as sdk_exc
import os_traits import os_traits
@@ -667,9 +666,10 @@ class ComputeManager(manager.Manager):
self.compute_task_api = conductor.ComputeTaskAPI() self.compute_task_api = conductor.ComputeTaskAPI()
self.query_client = query.SchedulerQueryClient() self.query_client = query.SchedulerQueryClient()
self.instance_events = InstanceEvents() self.instance_events = InstanceEvents()
self._sync_power_executor = futurist.GreenThreadPoolExecutor( self._sync_power_executor = nova.utils.create_executor(
max_workers=CONF.sync_power_state_pool_size) max_workers=CONF.sync_power_state_pool_size)
self._syncs_in_progress = {} self._syncs_in_progress: set[str] = set()
self._syncs_in_progress_lock = threading.Lock()
self.send_instance_updates = ( self.send_instance_updates = (
CONF.filter_scheduler.track_instance_changes) CONF.filter_scheduler.track_instance_changes)
if CONF.max_concurrent_builds != 0: if CONF.max_concurrent_builds != 0:
@@ -683,11 +683,27 @@ class ComputeManager(manager.Manager):
else: else:
self._snapshot_semaphore = compute_utils.UnlimitedSemaphore() self._snapshot_semaphore = compute_utils.UnlimitedSemaphore()
if CONF.max_concurrent_live_migrations > 0: if CONF.max_concurrent_live_migrations > 0:
self._live_migration_executor = futurist.GreenThreadPoolExecutor( self._live_migration_executor = nova.utils.create_executor(
max_workers=CONF.max_concurrent_live_migrations) max_workers=CONF.max_concurrent_live_migrations)
else: else:
# CONF.max_concurrent_live_migrations is 0 (unlimited) # setting CONF.max_concurrent_live_migrations to 0 (unlimited)
self._live_migration_executor = futurist.GreenThreadPoolExecutor() # is deprecated but still supported, so we need to use a sane
# default values for each threading mode
LOG.warning("Nova compute deprecated the support of unlimited "
"parallel live migration so "
"[DEFAULT]max_concurrent_live_migrations configured "
"with value 0 is deprecated and will not be supported "
"in future releases. Please set an explicit positive"
"value to this config option instead.")
if utils.concurrency_mode_threading():
self._live_migration_executor = nova.utils.create_executor(
max_workers=5)
else:
# In eventlet mode we need to keep backward compatibility and
# 1000 greenthreads to emulate unlimited.
self._live_migration_executor = nova.utils.create_executor(
max_workers=1000)
# This is a dict, keyed by instance uuid, to a two-item tuple of # This is a dict, keyed by instance uuid, to a two-item tuple of
# migration object and Future for the queued live migration. # migration object and Future for the queued live migration.
self._waiting_live_migrations = {} self._waiting_live_migrations = {}
@@ -706,6 +722,11 @@ class ComputeManager(manager.Manager):
self.rt = resource_tracker.ResourceTracker( self.rt = resource_tracker.ResourceTracker(
self.host, self.driver, reportclient=self.reportclient) self.host, self.driver, reportclient=self.reportclient)
@contextlib.contextmanager
def syncs_in_progress(self) -> ty.Iterator[set[str]]:
with self._syncs_in_progress_lock:
yield self._syncs_in_progress
def reset(self): def reset(self):
LOG.info('Reloading compute RPC API') LOG.info('Reloading compute RPC API')
compute_rpcapi.reset_globals() compute_rpcapi.reset_globals()
@@ -11031,20 +11052,21 @@ class ComputeManager(manager.Manager):
LOG.exception("Periodic sync_power_state task had an " LOG.exception("Periodic sync_power_state task had an "
"error while processing an instance.", "error while processing an instance.",
instance=db_instance) instance=db_instance)
with self.syncs_in_progress() as syncs:
self._syncs_in_progress.pop(db_instance.uuid) syncs.remove(db_instance.uuid)
for db_instance in db_instances: for db_instance in db_instances:
# process syncs asynchronously - don't want instance locking to # process syncs asynchronously - don't want instance locking to
# block entire periodic task thread # block entire periodic task thread
uuid = db_instance.uuid uuid = db_instance.uuid
if uuid in self._syncs_in_progress: with self.syncs_in_progress() as syncs:
LOG.debug('Sync already in progress for %s', uuid) if uuid in syncs:
else: LOG.debug('Sync already in progress for %s', uuid)
LOG.debug('Triggering sync for uuid %s', uuid) else:
self._syncs_in_progress[uuid] = True LOG.debug('Triggering sync for uuid %s', uuid)
nova.utils.spawn_on( syncs.add(uuid)
self._sync_power_executor, _sync, db_instance) nova.utils.spawn_on(
self._sync_power_executor, _sync, db_instance)
def _query_driver_power_state_and_sync(self, context, db_instance): def _query_driver_power_state_and_sync(self, context, db_instance):
if db_instance.task_state is not None: if db_instance.task_state is not None:
+10 -5
View File
@@ -694,7 +694,12 @@ that doing so is safe and stable in your environment.
Possible values: Possible values:
* 0 : treated as unlimited. * ``0``: Deprecated since 33.0.0 (2026.1 Gazpacho). This value was previously
documented as meaning unlimited but the actual implementation used maximum
1000 greenthreads. Since this release, the implementation keep using 1000
greenthreads in eventlet mode and will use 5 native threads in threading
mode. In the future release when eventlet support is removed, 0 as a valid
value will also be removed.
* Any positive integer representing maximum number of live migrations * Any positive integer representing maximum number of live migrations
to run concurrently. to run concurrently.
"""), """),
@@ -732,9 +737,9 @@ Related options:
checks checks
"""), """),
cfg.IntOpt('sync_power_state_pool_size', cfg.IntOpt('sync_power_state_pool_size',
default=1000, default=5,
help=""" help="""
Number of greenthreads available for use to sync power states. Number of threads available for use to sync instance power states.
This option can be used to reduce the number of concurrent requests This option can be used to reduce the number of concurrent requests
made to the hypervisor or system with real instance power states made to the hypervisor or system with real instance power states
@@ -742,8 +747,8 @@ for performance reasons, for example, with Ironic.
Possible values: Possible values:
* Any positive integer representing greenthreads count. * Any positive integer representing threads count.
""") """),
] ]
compute_group_opts = [ compute_group_opts = [
+9
View File
@@ -18,6 +18,8 @@
"""Tests for compute service.""" """Tests for compute service."""
import datetime import datetime
import threading
import fixtures as std_fixtures import fixtures as std_fixtures
from itertools import chain from itertools import chain
import operator import operator
@@ -1661,7 +1663,14 @@ class ComputeTestCase(BaseTestCase,
def setUp(self): def setUp(self):
super(ComputeTestCase, self).setUp() super(ComputeTestCase, self).setUp()
self.compute._live_migration_executor = futurist.SynchronousExecutor() self.compute._live_migration_executor = futurist.SynchronousExecutor()
# NOTE(gibi): the _sync_power_states periodic task in the
# ComputeManager spawning concurrent tasks and uses a lock to
# synchronize a shared data structure. As the spawn is made
# synchronous meaning the tasks runs on the caller thread. This means
# the simple lock causes a deadlock in the unit test. Upgrade that lock
# to be reentrant so the test can pass with synchronous spawn.
self.useFixture(fixtures.SpawnIsSynchronousFixture()) self.useFixture(fixtures.SpawnIsSynchronousFixture())
self.compute._syncs_in_progress_lock = threading.RLock()
self.image_api = image_api.API() self.image_api = image_api.API()
self.default_flavor = objects.Flavor.get_by_name(self.context, self.default_flavor = objects.Flavor.get_by_name(self.context,
+17 -1
View File
@@ -72,6 +72,7 @@ from nova.tests.unit import fake_network_cache_model
from nova.tests.unit.objects import test_instance_fault from nova.tests.unit.objects import test_instance_fault
from nova.tests.unit.objects import test_instance_info_cache from nova.tests.unit.objects import test_instance_info_cache
from nova.tests.unit.objects import test_instance_numa from nova.tests.unit.objects import test_instance_numa
from nova import utils
from nova.virt.block_device import DriverVolumeBlockDevice as driver_bdm_volume from nova.virt.block_device import DriverVolumeBlockDevice as driver_bdm_volume
from nova.virt import driver as virt_driver from nova.virt import driver as virt_driver
from nova.virt import event as virtevent from nova.virt import event as virtevent
@@ -4288,6 +4289,18 @@ class ComputeManagerUnitTestCase(test.NoDBTestCase,
power_state.NOSTATE, power_state.NOSTATE,
use_slave=True) use_slave=True)
def test_syncs_in_progress(self):
self.assertFalse(self.compute._syncs_in_progress_lock.locked())
self.compute._syncs_in_progress.add("fake-uuid")
with self.compute.syncs_in_progress() as syncs:
self.assertTrue(self.compute._syncs_in_progress_lock.locked())
self.assertEqual({"fake-uuid"}, syncs)
syncs.remove("fake-uuid")
self.assertFalse(self.compute._syncs_in_progress_lock.locked())
self.assertEqual(set(), self.compute._syncs_in_progress)
def test_cleanup_running_deleted_instances_virt_driver_not_ready(self): def test_cleanup_running_deleted_instances_virt_driver_not_ready(self):
"""Tests the scenario that the driver raises VirtDriverNotReady """Tests the scenario that the driver raises VirtDriverNotReady
when listing instances so the task returns early. when listing instances so the task returns early.
@@ -11743,7 +11756,10 @@ class ComputeManagerMigrationTestCase(test.NoDBTestCase,
def test_max_concurrent_live_semaphore_unlimited(self): def test_max_concurrent_live_semaphore_unlimited(self):
self.flags(max_concurrent_live_migrations=0) self.flags(max_concurrent_live_migrations=0)
mgr = manager.ComputeManager() mgr = manager.ComputeManager()
self.assertEqual(1000, mgr._live_migration_executor._max_workers) if utils.concurrency_mode_threading():
self.assertEqual(5, mgr._live_migration_executor._max_workers)
else:
self.assertEqual(1000, mgr._live_migration_executor._max_workers)
@mock.patch('nova.objects.InstanceGroup.get_by_instance_uuid', mock.Mock( @mock.patch('nova.objects.InstanceGroup.get_by_instance_uuid', mock.Mock(
side_effect=exception.InstanceGroupNotFound(group_uuid=''))) side_effect=exception.InstanceGroupNotFound(group_uuid='')))
@@ -0,0 +1,28 @@
---
upgrade:
- |
The meaning of the 0 value of the config option
``[DEFAULT]max_concurrent_live_migrations`` has been changed. In the past
the implementation of the meaning of "unlimited" used maximum 1000
concurrent worker greenthreads. For eventlet mode this behavior is kept but
for the native threading mode it is now reduced to 5 native threads. It is
almost always a bad idea to change this config option from its default value, 1.
Please read the `concurrency
<https://docs.openstack.org/nova/latest/admin/concurrency.html>`__
guide for more details.
- |
The default value of the configuration option
``[DEFAULT]sync_power_state_thread_pool_size`` is changed from 1000 to 5 to
have a value that is safe to use in native threading mode. If you are still
using the eventlet mode and relying on a higher value then configure that
higher value explicitly before the upgrade. Please read the
`concurrency <https://docs.openstack.org/nova/latest/admin/concurrency.html>`__
guide for more details.
deprecations:
- |
The possible 0 value of the configuration option
``[DEFAULT]max_concurrent_live_migrations`` is deprecated and will be
removed in a future release. It is almost always a bad idea to change the
default value, 1, of this config option. If more performant live migration
is needed, use the ``live_migration_parallel_connections`` config option
instead.