diff --git a/api-guide/source/server_concepts.rst b/api-guide/source/server_concepts.rst index 6c9303db80..6910ff5334 100644 --- a/api-guide/source/server_concepts.rst +++ b/api-guide/source/server_concepts.rst @@ -390,3 +390,136 @@ assigned at creation time. "accessIPv6":"::babe:67.23.10.132" } } + +Moving servers +~~~~~~~~~~~~~~ + +There are several actions that may result in a server moving from one +compute host to another including shelve, resize, migrations and +evacuate. The following use cases demonstrate the intention of the +actions and the consequence for operational procedures. + +**Shelving** + +Sometimes a user does not require a server to be active for a while, +perhaps over a weekend or at certain times of day. This gives +the cloud operator an opportunity to make better use of resources by +freeing resources and rebalancing workloads across the infrastructure. + +When the user shelves a server the operator can choose to remove it +from the compute hosts. When it is unshelved it is scheduled to a new +host according to the operators policies for distributing work loads +across the compute hosts, including taking disabled hosts into account. +This will contribute to increased overall capacity, freeing hosts that +are ear-marked for maintenance and providing contiguous blocks +of resources on single hosts due to moving out old servers. + +Shelving a server is not normally a choice that is available to +the cloud operator because it affects the availability of the server +being provided to the user. + +**Resize** + +Sometimes a user may want to change the flavor of a server, e.g. change +the quantity of cpus, disk, memory or any other resource. This is done +by rebuilding the server with a new flavor. As the server is being +rebuilt it is normal to reschedule the server to another host +(although resize to the same host is an option for the operator). + +As with shelving, resize provides the cloud operator with an +opportunity to redistribute work loads across the cloud according +to the operators scheduling policy, providing the same benefits as +above. + +Resizing a server is not normally a choice that is available to +the cloud operator because it changes the nature of the server +being provided to the user. + +**Migration (including cold and live migration)** + +Sometimes a cloud operator may need to redistribute work loads for +operational purposes. For example, the operator may need to remove +a compute host for maintenance or deploy a kernel security patch that +requires the host to be rebooted. + +The operator has two actions available for deliberately moving +work loads: cold migration (moving a server that is not active) +and live migration (moving a server that is active). + +Cold migration moves a server from one host to another by copying its +state, local storage and network configuration to new resources +allocated on a new host selected by scheduling policies or as +an explicit decision. The operation is relatively quick as the +server is not changing its state during the copy process. The user +does not have access to the server during the operation. + +Live migration moves a server from one host to another while it +is active, so it is constantly changing its state during the action. +As a result it can take considerably longer than cold migration. +During the action the server is online and accessible, but only +a limited set of management actions are available to the user. + +The following are two common patterns for employing migrations in +a cloud: + +- **Host maintenance** + + If a compute host is to be removed from the cloud all its servers + will need to moved to other hosts. In this case it is normal for + the rest of the cloud to absorb the work load, redistributing + the servers by rescheduling them. + + To prepare the host it will be disabled so it does not receive + any further servers. Then each server will be migrated to a new + host by cold or live migration, depending on the state of the + server. When complete, the host is free to be removed. + +- **Rolling updates** + + Often it is necessary to perform an update on all compute hosts + that requires them to be rebooted. In this case it is not + strictly necessary to move inactive instances because they + will be available after the reboot. However, active instances would + be impacted by the reboot. Live migration will allow them to + continue operation. + + In this case a rolling approach can be taken by starting with an + empty compute host that has been updated and rebooted. Another host + that has not yet been updated is disabled and all its servers are + migrated to the new host. When the migrations are complete the + new host continues normal operation. The old host will be empty + and can be updated and rebooted. It then becomes the new target for + another round of migrations. + + This process can be repeated until the whole cloud has been updated, + usually using a pool of empty hosts instead of just one. + +Migrating a server is not normally a choice that is available to +the cloud user because the user is not normally aware of compute +hosts. Management of the cloud and how servers are provisioned +in it is the sole responsibility of the cloud operator. + +**Evacuate** + +Sometimes a compute host may fail. This is a rare occurrence, but when +it happens during normal operation the servers running on the host may +be lost. In this case the operator may recreate the servers on the +remaining compute hosts using the evacuate action. + +Failure detection can be proved to be impossible in compute systems +with asynchronous communication, so true failure detection cannot be +achieved. Usually when a host is considered to have failed it should be +excluded from the cloud and any virtual networking or storage associated +with servers on the failed host should be isolated from it. These steps +are called fencing the host. Initiating these action is outside the scope +of Nova. + +Once the host has been fenced its servers can be recreated on other +hosts without worry of the old incarnations reappearing and trying to +access shared resources. It is usual to redistribute the servers +from a failed host by rescheduling them. + +Evacuating a server is solely in the domain of the cloud operator because +it must be performed in coordination with other operational procedures to +be safe. A user is not normally aware of compute hosts but is adversely +affected by their failure.