High AvailabilityRHQ supports running multiple RHQ Servers which provides a High Availability (HA) environment. We call the multiple RHQ Servers the "RHQ HA Server cloud". A multi-server HA environment provides for RHQ Agent failover and distribution of load. HA is integrated into the standard installation process. There is no separate HA installer. The important points to understand are:
Installing or upgrading a single server will result in a single-server environment (a "1-server cloud"). To add more servers to the HA server cloud, run the installer on the target server machines, each time configuring for the same database, endpoint information and selecting a unique server name. RHQ Servers can be added and removed from the cloud at any point in time. To learn more about whether an HA environment is appropriate for your needs, and how to move to HA from an existing RHQ installation, continue on and read High Availability Configuration.
High Availability ConfigurationThe goal of HA is to support multiple RHQ servers configured against a single database repository. RHQ Agent load can then be partitioned amongst the available RHQ Servers. Failover will occur for agents whose server becomes unreachable. A multi-server HA configuration provides fault tolerance and improved scalability. Details of the HA Design can be found at Design-High Availability - Agent Failover.
When You Should Setup High AvailabilityIn many circumstances, it may be satisfactory to run a single-server configuration. But if your environment satisfies one or more of the following criteria you may want to consider a multi-server approach:
HA InfrastructureIn reality, every RHQ environment is an "HA configuration". You can consider an environment with a single RHQ server as a 1-Server HA Cloud because it can still be managed via the HA Administration GUI pages.
For example, the HAAC pages HighAvailability>Servers, HighAvailability>Agents, etc. are all applicable to a single-server and accessible from the RHQ GUI. Since a single-server is a 1-Server cloud, it easily adapts to an increase in cloud size. In general, RHQ Servers can be added or removed from the HA Cloud at any time. So, a single-server environment can be turned into a multi-server environment by running the RHQ Server installer on a second machine, configured for the same database, and defining a new server (unique Server Name and public Endpoint). Database ImpactAlthough RHQ Servers can be added to the HA Server Cloud with relative ease, it should be done cautiously due to potential impact on the back-end database. Each RHQ Server limits its concurrent database connections but there is no restriction on the Cloud itself. Meaning, adding a second server effectively doubles the potential database connections, even if the number of RHQ Agents remains the same. The increase is linear as servers are added. Each RHQ server instance has built-in mechanisms for limiting the load it will put on the database. In the current RHQ release, that number out-of-the-box is 55 simultaneous connections. Each RHQ server may use less connections (that number based largely on how many agents are connected to it and how much data it needs to process concurrently), but the limits guarantee that they will each never use more than 55 connections to the database at any given point in time. So, for example, a 2-Server configuration would require that the database be able to handle 110 connections. The RHQ Administrator should work closely with the Database Administrator to ensure an adequate configuration. In general, a large scale RHQ configuration requires DBA planning to handle not only connections, but to provide a database with reasonable data distribution and space allocation. HA impact is just another aspect to take into consideration. Note that an HA configuration does not necessarily imply a large number of RHQ Agents. It may be the case that a relatively small RHQ implementation may be in place, with only a few RHQ Agents. But, those Agents may need high availability and therefore failover servers are required. In that case the backing database will still have a high number of potential connections, but in reality will not reach that limit. Server and Agent EndpointsIn a multi-server HA configuration, it is important to realize that any agent could potentially try to connect to any server. Thus, it is critical that every RHQ Agent be able to resolve the Endpoint Address set for every RHQ Server in the HA Server cloud. So, when defining the RHQ Server in the installer, it is important that the Endpoint Address be public to the degree that the RHQ Agent population can resolve the RHQ Server's address and be able to reach the RHQ Server via the defined address and port (or secure port, if configured for secure communications). Note that the RHQ Server endpoint information can be updated via the HAAC in the RHQ GUI. Conversely, an RHQ Agent connecting to an RHQ Server must provide an endpoint reachable by all the RHQ Servers in the Cloud in order to allow for the necessary two-way communications. Failover ListsEach agent will be assigned a "failover list". A failover list helps the agent determine which server it should communicate with. A failover list has one or more server public endpoints in it.
A failover list is ordered - the first server in the failover list is considered the agent's "primary server". The primary server is the server that the agent should try to communicate with first. If, for some reason, the agent cannot talk to the primary server, the agent will move down the failover list, trying to talk to each server in the order they appear in the failover list. For example, if the first server (the primary server) is down, the agent will attempt to communicate with server #2 in the list. If the agent can't talk to that server, the agent will continue down the list (trying server #3 next) until it successfully communicates with a server. If the agent exhausts its entire failover list and still cannot communicate with any server, the agent enters a mode where it temporarily stops trying to send messages over the wire and will start to spool the messages to disk so it can retry them later. When the agent discovers that a server has come back online (it usually does this by periodically polling all servers in its failover list until it finds one that it can talk to), the agent will send messages it previously spooled to disk to that online server and will continue to talk to that server normally. An agent will try to ensure that it is connected to its primary server. Every hour (which is the default setting) the agent will check to see if the server it is currently talking to is its primary server. If it is not (for example, if the agent had recently failed over to one of its secondary servers due to its primary server going down), the agent will attempt to re-connect with its primary server. This helps maintain desired affinity and keeps the server/agent HA infrastructure in its most efficient configuration. All failover lists for all agents are generated by the server. An agent obtains its failover list when it registers with the server and periodically thereafter (by default, the agent will check every hour to see if it has been assigned a new failover list). A failover list can change when new servers and new agents are added to the environment and when affinity is changed (i.e. when an agent is added or removed from an affinity group).
You can examine an agent's failover list in one of several ways:
AffinityBy default, agent load is distributed evenly amongst the servers in the cloud. Balance can change in failover situations but in general, by default, agent load will be evenly distributed when all agents and all servers are running. This is fine when it is unimportant which RHQ Agents connect to which RHQ Servers. But there are use cases where it may be desirable to create stronger bonds between specific agents and servers. This is accomplished by defining RHQ HA Affinity Groups. RHQ Agents will prefer connecting to RHQ Servers in the same Affinity Group. Affinity Group assignment is optional and any given RHQ Agent or RHQ Server can participate in at most one Affinity Group. Affinity BehaviorAffinity is described in more detail in the HA Design Document but the following basics should be understood about Affinity Group behaviour:
When To Use AffinityFollowing are scenarios that may benefit from Affinity Group assignment. Physical EfficiencyIn general, if it is clear that certain agent-server connections will run more efficiently than others, then defining affinity to prefer those connections makes sense. This could include RHQ Servers and RHQ Agents co-located in the same data center, other geographic grouping, or various network topology scenarios. Logical EfficiencyIt may not be the case that certain agents and servers will run more efficiently by talking to one another, but that there are other reasons to group agents and servers together. For example, organizational reasons such as administration responsibilities and business units are some logical reasons to use affinity grouping. Warm BackupIt may be the case that certain machines should not be assigned agent load unless specifically needed for failover purposes. In this case you would have all agents assigned affinity to a subset of the available servers, leaving some servers without any associated agents in normal operation. Moving to an HA installationAfter deciding that an RHQ High Availability strategy is appropriate for your needs, you should do two things to prepare for you installation or upgrade:
Note that affinity assignments can be added or removed at any time but it is useful to consider your initial approach, even if it confirms that affinity assignments are unnecessary. From an installation and upgrade perspective, an HA environment does not require different steps, and actually can be moved to incrementally. Server Requirements For HAEach RHQ Server in an HA Server Cloud must:
Note that the first RHQ Server installed will become the initial member in the HA Server Cloud. This means that a single-server installation can also be thought of as a 1-Server HA Cloud and therefore has the same Server requirements. The RHQ GUI HA Administration Console pages are still usable to inspect or manage your environment. Perhaps, more importantly, this allows the 2nd, 3rd...Nth RHQ Servers to be added at any time, even while other RHQ Servers are running. Conversely, RHQ Servers can be removed from the Cloud at any time. RHQ Servers communicate solely via the database and therefore it is not required that their endpoints be visible to each other. No direct server-to-server communication is ever made. Agent Requirements for HAEach RHQ Agent in an HA environment must:
To install or upgrade an RHQ Agent, there is nothing different to do, other than the normal agent install/upgrade steps. Since HA environments typically involve many agents it may be useful to pre-configured your Agents to avoid having to answer initial setup questions interactively. Managing an HA installationEven a 1-Server installation can take advantage of certain HA management capabilities. But after adding a second server, or more, it will be useful to become comfortable with the HA management features available in RHQ. In general, the steps to take when building up and managing an HA Server Cloud are:
|