This adds two tests and updates the cross-cell resize docs to
show that _poll_unconfirmed_resizes can work if the cells are
able to "up-call" to the API DB to confirm the resize. Since
lots of deployments still enable up-calls we don't explicitly
block _poll_unconfirmed_resizes from processing cross-cell
migrations. The other test shows that _poll_unconfirmed_resizes
fails if up-calls are disabled.
Part of blueprint cross-cell-resize
Change-Id: I39e8159f3e734a1219e1a44434d6360572620424
This implements the cleanup_instance_network_on_host method
in the neutron API which will delete port bindings for the
given instance and the given host, similar to how
setup_networks_on_host works when teardown=True and the
instance.host does not match the host provided to that method.
This allows removing the hacks in the
_confirm_snapshot_based_resize_delete_port_bindings and
_revert_snapshot_based_resize_at_dest methods.
Change-Id: Iff8194c868580facb1cc81b5567d66d4093c5274
Rather than copy the instance action event from the target
cell DB to the source cell DB when
_finish_snapshot_based_resize_at_dest fails on the dest host,
we can simply use the EventReporter context in conductor with
the source cell context and get the same event.
Part of blueprint cross-cell-resize
Change-Id: I8053765797f097859bf585459f4a00d31844708e
This tries to strike a balance between giving a useful high level
flow without injecting too much complex detail in each diagram.
For the more complicated resize diagram, I have used labels to
try and make clear which conductor task is performing an action.
For the less complicated confirm and revert diagrams, I add a
separator to show where the conductor task is orchestrating the
calls and provide a bit more detail into what each task is doing
since the calls to computes are minimal in those cases.
Part of blueprint cross-cell-resize
Change-Id: I27c549901a3359f106ba5d77aa6559397ee12a5d
This gives most of the high level information. I'm sure there
are more troubleshooting things we can add but those could come
later as they crop up.
The sequence diagram(s) will come in a separate change.
Part of blueprint cross-cell-resize
Change-Id: I13f07a2d45bf5b8584adc8aa079bae640cb5c470
This changes the nova-multi-cell job to essentially
force cross-cell resize and cold migration. By "force"
I mean there is only one compute in each cell and
resize to the same host is disabled, so the scheduler
has no option but to move the server to the other cell.
This adds a new role to write the nova policy.yaml file
to enable cross-cell resize and a pre-run playbook so
that the policy file setup before tempest runs.
Part of blueprint cross-cell-resize
Change-Id: Ia4f3671c40e69674afc7a96b5d9b198dabaa4224
This adds the "compute:servers:resize:cross_cell" policy
rule which is now used in the API to determine if a resize
or cold migrate operation can be performed across cells.
The check in the API is based on:
- The policy check passing for the request.
- The minimum nova-compute service version being high
enough across all cells to perform a cross-cell resize.
If either of those conditions fail a traditional same-cell
resize will be performed.
A docs stub is added here and will be fleshed out in an
upcoming patch.
Implements blueprint cross-cell-resize
Change-Id: Ie8a0f79a3b16e02b7a34a1b81f547013a3d88996
There was no code in FakeDriver that updated the internal
dict `self.instances` when a FakeInstance was live migrated.
This commit fills this gap. As a result, a couple of versioned
notification samples get updated since we are now properly
tracking a live migrated instance on the destination host as
running vs pending power state.
Closes-Bug: 1426433
Change-Id: I9562e1bcbb18c7d543d5a6b36011fa28c13dfa79
This patch addresses a number of typos and minor
issues raised during review of [1][2]. A summary
of the changes are corrections to typos in comments,
a correction to the exception message, an update to
the release note and the addition of debug logging.
[1] I0322d872bdff68936033a6f5a54e8296a6fb3434
[2] I48bccc4b9adcac3c7a3e42769c11fdeb8f6fd132
Related-Bug: #1804502
Related-Bug: #1763766
Change-Id: I8975e524cd5a9c7dfb065bb2dc8ceb03f1b89e7b
This test creates 2 servers which take up all of the
CUSTOM_PMEM_NAMESPACE_SMALL resources on the sole host in
the test. Then it tries to create a 3rd server which it
expects to fail and it does. Then it deletes the second
server to free up one CUSTOM_PMEM_NAMESPACE_SMALL resource
on the host and schedules another server which it expects
to work. The problem is the test doesn't wait for the
second server to be fully deleted so the test intermittently
fails when the CUSTOM_PMEM_NAMESPACE_SMALL resource isn't
yet freed up when the fourth server goes through scheduling.
This adds a simple wait call for the deleted server to actually
be gone before creating the fourth server.
Change-Id: Id3389038b5fed1e70dd12c7ed9cef2c0950625cf
Closes-Bug: #1856902
To be able to test cancelling a live migration we need a special virt
driver in the functional test which requires a separate test class to
use. Before this separate test class is added this patch pulls up some
of the common test methods to an already existing base class so those
can be used from the new test class in a subsequent patch.
Change-Id: I83937bc8a4d7665c7f435020e284e269b98985c8
blueprint: support-move-ops-with-qos-ports-ussuri
This patch adds extra functional test coverage for live migration with
qos ports. It covers live migration with target host specified and
re-schedule success and failure cases.
There are two non test code changes.
1)
In the check_can_live_migrate_destination compute manager method CONF.host
was used to determine the destination host. This was replaced with
self.host. This change result in identical data in a production
environment. But during the functional test the CONF object is global for
every compute servers running in the environment. This causes that when
the above call runs on host2 the CONF.host is already set to host3 as
that host was started later in the test. Fortunately self.host is
set on the compute manage during the compute startup so it is correct in
the functional test env too. It is unfortunate that we have to change
non test code to make the test code work but the global CONF variable is
a hard problem to resolve.
2)
The InstancePCIRequest is saved in the LiveMigrationTask as the task
updates the request for each host it tries to move the instance. If non
of the destination supports the migration then the rollback() of the
task will restore the InstancePCIRequest to its original value as the
instance will remain on the source host.
Change-Id: I351399f02d4d3b3e958a62fe37380577f3056d0d
blueprint: support-move-ops-with-qos-ports-ussuri
The RequestSpec.get_request_group_mapping method's doc wrongly states
that that return value is a dict of UUID values. Actually it is a list
of UUIDs.
Change-Id: I75b02757b319ea0f706a9c4167a227f5a043c37a
This patch enhances the live_migration task in the conductor to support
qos ports during live migration. The high level sequence of events are
the following:
* when request spec is gathered before the scheduler call the resource
requests are collected from neutron ports and the request spec is
updated
* after a successful scheduling the request group - resource provider
mapping is calculated
* the instance pci requests are updated to drive the pci claim on the
target host to allocate VFs from the same PCI PF the bandwidth is
allocated from
* the inactive port binding on the target host is updated to have the RP
UUID in the allocation key according to the resource allocation on the
destination host.
As the port binding is already updated in the conductor the late check
about the allocation key in the binding profile is turned off for live
migration in the neutronv2 api.
Note that this patch only handles the live migration without target host
subsequent patches will add support for migration with target host and
other edge case like reschedule.
blueprint: support-move-ops-with-qos-ports-ussuri
Change-Id: Ibb84ea210795634f02997d4627e0beec5fea36ee