RHQ High Availability - Design and GoalsRHQ 1.1.0 is scheduled to introduce RHQ High Availability (HA). The initial goal of RHQ HA is to support multiple RHQ servers configured for a single database repository. Agent load can then be partitioned amongst the available servers. Failover will occur for agents whose server goes down for any reason. So, RHQ HA will introduce scalability and fault tolerance. The sections below describe features of RHQ High Availability that are currently in plan for version 1.1.0. Although accurate at the time of writing the content and features below are subject to change or omission. For a quick illustration of what HA is going to accomplish, see this flash: failover.swf Server CloudThe foundation of RHQ HA is a cloud of RHQ servers. The cloud is made up of one or more loosely coupled RHQ servers. Cloud members must:
A single-server installation is considered a 1-member HA server cloud and must therefore supply valid name and endpoint information. The RHQ installer will handle multiple scenarios: single server installation, the addition of a cloud member, and the upgrade/re-installation of an existing cloud member. The server distribution will remain a .zip file in RHQ 1.1.0; the archive will unzip into a new directory. Load BalancingHA will provide load balancing by partitioning agents amongst servers in the cloud. There are various factors that will contribute to the partition algorithm:
AffinityAn "Affinity Group" is a tag (just a unique name) created by the administrator. It can then be assigned to any number of RHQ servers and agents. For example, in an environment with several data centers it may be useful to assign each data center an affinity group (e.g. "DC1", "DC2"). Agents and servers co-located in a single data center can then set their affinity groups appropriately. The partition algorithm will then show preference to assigning agents to servers withing the same affinity group. Although, based on load and availability affinity is not guaranteed.
Round RobinThe goal is to partition agents across servers while taking into consideration aspects such as affinity. Additionally, the partition algorithm will take into consideration failover topology. Agents will be assigned server lists (see below) in a weighted round robin fashion distribution. For example, in a 3-Server cloud with no affinity then agents would be assigned server lists similar to: If S1, S2, A1, A2, A3 all had affinity then the assigned server lists may look similar to: Compute Power( Servers may vary in the load they can carry. If servers in a cloud vary significantly in computing power the administrator can alter the assigned compute power to distribute load appropriately. Compute power is relative, by default all servers in the cloud are assigned a compute power of 1. So, for example, in a 2-Server cloud where S1 is twice as powerful as S2 and you want S1 to assume twice the agent load, set ComputePower(S1)=2 and leave ComputePower(S2)=1. Compute powers are set as positive integers. AlgorithmA few notes on the partition algorithm. Affinity is strong and will always be satisfied if servers with the correct affinity are available. As shown in the example above, affinity servers will always be the first entries in the server list for an agent with defined affinity. Load will be distributed amongst affinity servers while satisfying affinity. But, as also shown in the example, if no affinity servers are available the agent will fail over to an available non-affinity server. So, although it is possible, and potentially desirable, to have load imbalance across all servers, due to affinity, agents with affinity should have their load fairly well distributed amongst their affinity servers. The repartition algorithm attempts to limit connection churn, it will attempt to maintain as many primary server assignments as possible while still balancing load. All known servers and agents participate in the repartition. DOWN or MAINTENANCE_MODE servers are expected to be up and operating in NORMAL fashion and as such are included on the assumption that their unavailability will be remedied. Long term downtime should be handled by deleting the server and potentially reinstalling it at a later time. Server AssignmentPerhaps the best way to understand the proposed behavior for Agent and Server assignment is to look at various use cases for how an an RHQ agent determines its server. To do this, a few terms need to be defined:
Registration and ConnectionAgents go through a registration/connection phase when initially contacting the configured server. A successful registration will return the most recent server list to the agent. The agent will then set the configured server to the primary server and attempt to connect. If connection fails it will set the configured server to the next server in the list and try again, failing over until it succeeds or must start again from the primary. Therefore, the registration server and the connected server may not be the same server. Agent startup logic: If the agent:
Then
If the agent:
Then
If the agent:
Then
FailoverAfter successful startup the agent will be connected to a server in its server list, typically the primary server. If the agent loses its connection to its server it will perform some logic to ensure the connection loss was not just temporary (e.g. network blip). If reconnection does not succeed, the agent will attempt to failover to a different server, starting with the head of its server list, until a connection is made or until the server list is exhausted. When a server list is exhausted it will be reprocessed, from the head of the list, after some (configurable) delay. Upon connection to a new server the agent will scale its workload incrementally. This is to prevent overwhelming a particular server after a large scale failover. For example, in a 2-Server cloud, if one server goes down all agents will failover to the remaining server. Messages from agent-to-server that were marked for reliable delivery will be sent to the new server once a connection is established. Cloud RepartitionRHQ HA will, in certain circumstances, repartition the agents amongst the cloud servers. This will result in new server lists being generated for all known agents. It is important to note that the repartition algorithm will seek to limit connection churn, and as such will re-assign the minimal number of agents to new primary servers to accomplish the re-balancing. A repartition does not push new server lists to connected agents. This prevents large scale fail-over in large environments, potentially spiking a server with connection processing. Instead, agents will intermittently check for updated server lists, and reconnect to new primary assignments, if necessary. This disperses the connection load. Redistribution can occur for the following reasons:
Agent BehaviorNew agents will be assigned a server list such that load balancing and affinity are satisfied in the same ways as if the agent had been registered during the last cloud repartition. Connected agents check for updated server lists at (configurable) scheduled intervals. At that time if an agent is not connected to its primary server it will attempt to connect to the primary. In this fashion all agents seek to run on their assigned primary server. All agents being connected to their primary servers guarantees the best load balancing and affinity satisfaction. Server Operation ModesThere are four server operation modes: INSTALLED, NORMAL, DOWN, MAINTENANCE. The valid transitions are as follows:
Note that a server started up in NORMAL or MAINTENANCE mode will maintain that mode. Server Maintenance ModeAn HA Server can be taken out of the cloud for maintenance reasons without actually being shut down. This is done via the HA Administration Console and effectively shuts down all agent communication with the server, although the server remains up and the RHQ GUI remains usable. Agents will treat this as a downed server and will apply reconnect and failover logic as needed. A server taken down in maintenance mode comes up in maintenance mode. HA Administration ConsoleThe RHQ GUI will offer an HA Administration Console (HAAC), available to RHQ users with management permissions. It will be accessed via the Administration Page in the existing GUI. The Administration Console will offer the following features:
GUINote that it doesn't matter which server you connect to in the Server Cloud to use the RHQ GUI; the viewable resources and available options will be identical regardless of which you choose. RHQ AgentCommandsThe RHQ Agent will have new commands introduced with HA:
Note that the sender status command will now tell you what the configured server currently is (so you can know what server the agent is, or is attempting to, talk to. UpgradeThe next version of RHQ plans to have automated agent updates. The RHQ agent will need to run the same version as the RHQ server (this is the "Prime Directive"), and will need to be re-installed with the new version. Ease of installation and agent backward compatibility are high priority goals for future versions. FutureThe following features are currently not in scope for RHQ 1.1.0 but are in plan for subsequent releases of RHQ High Availability. Load Balancing (Future)
Relative Server PowerIf the server cloud is made up of servers with unequal compute power it makes sense to assign more agent load to the servers with more compute power. Agent LoadRHQ agents can vary significantly in the load they put on a server based on number of inventoried resources, measurement collection (schedule) frequency, and other factors. HA will base agent assignment not on number of agents but on relative agent load. Database Failure Handling (Future)On database failure all RHQ servers configured for that database will, on a best effort of detection, be moved to Maintenance Mode. When the database is restored, for servers still operating, they can be reset to Normal operating mode via the GUI HAAC. Failover (Future)Initially a server will have no hard limit on how mucg agent load can be assigned. A potential future is to be able to define various limits for server load which when enforced will deny agent connection requests. RedistributionRedistribution can occur for the following reasons:
RHQ will periodically review the server-agent topology and decide whether redistribution is necessary. If so, RHQ will re-balance agent load across available servers. RHQ Agent (Future)
TestingWe have documented some of the HA testing we have performed. |