Failover
Failover is the process by which a standby server (or forwarding-standby server) replaces a server that provides standard, commit-server, or edge-server services. The standby replaces its target server The immediately upstream server for replica servers, edge servers, standby servers, proxies and brokers. See also 'upstream server' and 'central server'..
In this diagram, both a high availability standby server and a disaster recovery standby server replicate the commit server's versioned file content and metadata. The commit server is the central (or innermost) server for both edge server (dev) and edge server (build). The commit server and the high availability standby use a shared network file system (NFS) for versioned file content.
The high availability standby has the mandatory option for metadata consistency.
The disaster recovery standby has the nomandatory option to aid performance. |
![]() |
In the diagram:
-
The blue lines indicate file content, which is also called versioned files or archive files Versioned files that users submitted to a depot..
-
The red lines indicate journal data as the journalcopy makes a copy of the upstream server’s journal on the standby.
-
The green lines indicate journal data being applied by the pull -L process, from the journal data that journalcopy has retrieved from the upstream server.
To learn more about p4 pull and p4 journalcopy, see Server startup commands.
Although the standard service is the default service and does not require a server specification, for a standby server to take over for a standard server, the standard server must have a server specification.
High Availability and Disaster Recovery
The Failover feature supports two scenarios, High Availability (HA) and Disaster Recovery (DR).
High Availability (HA)
-
The central server can be configured to provide one of the following services: standard or commit-server.
-
Typically, the standby server is in the same hardware rack as the central server. However, this is not necessarily the case, and a standby server that takes over for an edge server is typically not in the same hardware rack as the central server.
-
Typical use case: scheduled maintenance of the central server, but also possible if the central server hardware fails.
Additional use case: the standby server for an edge server can take over if the edge server hardware fails. -
Typically, the central server participates in the failover process by:
-
Disabling itself in an orderly fashion.
-
Waiting for the journalcopy of the remaining transactions to the standby server.
-
Allowing the standby server to stop the central server, or, in the case of standby server for a edge server, for the standby server to stop the edge server that is the upstream server of that standby server.
-
High Availability |
|
---|---|
The high availability standby server can become the new central server (commit server). |
![]() |
p4 failover defaults to the central server participating while the high availability standby is preparing to take over as the new central server. |
![]() |
The former high availability standby has become the new central server. |
![]() |
If the target server does not participate
If the central server (or the edge server that is the target server) does not participate in the failover, a check is made to ensure that the standby server to which failover is to occur has the mandatory option set. Without the participation of the central server, failing over to a mandatory standby server is required to ensure that the other replicas remain consistent with the new central server after failover. Consistency is assured because during production operations, metadata must be journalcopy'd by all mandatory standby servers before that metadata is replicated to the other replicas.
The mandatory option and performance
For performance reasons, consider deploying one or more mandatory standby servers that are local to the central server. Journalcopy performance of the mandatory standby servers that are remote can affect the production replication to the other replicas. This is because each transaction must be journalcopy'd by all standby servers using the mandatory
option before the transaction can be replicated to other replicas.
Disaster Recovery (DR)
- Typical use case: due to a sudden catastrophe, the central server (and any HA standby servers) are unable to operate.
-
For assistance with failing over to a non-mandatory standby server when the central server is inaccessible, contact Perforce Support.
Consistency of the downstream replicas is assured for failing over when:
- The central server participates, in which case:
- The standby server does not need to be a
mandatory
standby. - The standby server's journalcopy, pull -L, and pull -u threads are an integral part of the failover.
- The standby server does not need to be a
- The central server (or the edge server that is the target server) does not participate and the standby server is a
mandatory
standby, in which case only the standby server's pull -L thread is an integral part of the failover.
Disaster Recovery |
|
---|---|
The disaster recovery standby server can become the new central server. |
![]() |
If a catastrophic failure of the data center occurs, both the central server and the high availability standby server might be unavailable. |
![]() |
The disaster recovery standby server has become the new central server (commit server). |
![]() |
Prerequisites for a successful failover
-
The
p4 failover
command must be run on a server of Type standby or forwarding-standby. To learn more, see Standby and forwarding-standby. - The standby (or forwarding-standby) server must be appropriately licensed for its new role following the failover. Fill out the form at Helix Core Duplicate Server Request.
- Make sure that monitoring (p4 monitor) is enabled for the new standby server (former standard or commit-server).
Monitoring must be enabled at server startup of the standby prior to running the
p4 failover
command. This is because the monitor subsystem is used to terminate the journalcopy, pull -L, and pull -u threads during the failover sequence.
-
Open the server spec for each standby and forwarding-standby server. In the ReplicatingFrom field, enter the
serverID
of the server from which the standby server is journalcopy'ing. - If an edge server is being failed over, the service user of the edge server should be logged into the commit (or standard) server using the file specified by the P4TICKETS variable that is defined for the standby of the edge server. For example, run the following commands on the standby server that will become the new central server:
export P4TICKETS=directory/.p4tickets
p4 -p target:port -u serviceuser login
If you plan to use Failback after failover to restore the original central server, we also recommend that a DNS alias point to the IP address of the standby server.
Failover to a standby or forwarding-standby
Failing over to a dedicated standby is generally faster than failing over to a forwarding-standby. For situations where failover completion is less time-critical, you might want to consider a forwarding-standby. To learn more, see "standby" and "forwarding standby" in p4 server in the P4 CLI Reference.
High availability with the mandatory server specification option
To deploy standby servers with minimal interruption to replication, before you set the standby to mandatory, make sure the journalcopy thread of the new standby server is caught up with the server from which is it journalcopying.
Follow this process:
- Deploy the standby with the default, which is nomandatory
- To monitor the progress of the standby's journalcopy, on the server from which the standby is journalcopying, invoke the p4 servers -J command.
In this example, we have invoked p4 servers -J on a central server named commit-uk, and Line 4 indicates that standby2 has 400, which does not yet match the 682 value on the central server named commit-uk:
Copycommit-uk '2025/03/12 14:44:52' commit-server 5/682 5/682 wadL/1 1
edge '2025/03/12 14:44:52' edge-server 5/458 5/458 waD1/2 1
standby1 '2025/03/12 14:44:52' standby 5/682 1/0 WAd1/4 1
standby2 '2025/03/12 14:44:52' standby 5/400 2/0 WAd1/4 1
standby3 '2025/03/12 14:44:52' standby 5/682 3/0 wAd1/4 1 mandatory
standby4 '2025/03/12 14:44:52' standby 5/682 4/0 wAd1/4 1 mandatoryLater, again on the central server named commit-uk, we invoke the p4 servers -J command. Now Line 4 indicates that standby2 has progressed to 682, which matches the central server named commit-uk. This means that the standby2 journalcopy is current.
Copycommit-uk '2025/03/12 14:44:52' commit-server 5/682 5/682 wadL/1 1
edge '2025/03/12 14:44:52' edge-server 5/458 5/458 waD1/2 1
standby1 '2025/03/12 14:44:52' standby 5/682 1/0 WAd1/4 1
standby2 '2025/03/12 14:44:52' standby 5/682 2/0 WAd1/4 1
standby3 '2025/03/12 14:44:52' standby 5/682 3/0 wAd1/4 1 mandatory
standby4 '2025/03/12 14:44:52' standby 5/682 4/0 wAd1/4 1 mandatory - Change the server spec for standby2 to specify mandatory.
On the central server, in the server specification for standby2, under Options, mandatory is now appropriate for a standby (or forwarding-standby) server. This option ensures that no replica has metadata that has not been copied to the journalcopy of all mandatory standby (or forwarding-standby) servers.
If the central server were unavailable, standby1, which is not a mandatory standby, could not be used for failover.
If the central server is available, all four of the standbys could be used for failover.
If the server from which failover is to occur is not participating in the failover, the p4 failover command returns an error if it is running on a standby (or forwarding-standby) server that is not properly configured with the mandatory option. Two reasons for non-participation are because the central server named commit-uk is unavailable or because the -i option causes the central server to be ignored.
Disaster recovery with the nomandatory server specification option
For disaster recovery failover, the remote standby typically has a server specification with the Option
field set to the default value, which is nomandatory. This is because the journalcopy performance of a mandatory standby can affect the speed of replication to the replicas of the central server.
After failover
After the failover process is complete:
-
The former standby server is now the new central server.
-
The standby server’s spec in the new central server’s metadata is left intact, except that it is set to
nomandatory
. This prevents transactions from being stalled while waiting for the journalcopy of those transactions to a standby server that no longer exists. Also, the spec can be used when the former central server is configured as a new standby server. -
The central server is stopped if the central server participated in the failover.
Potential data loss
If the central server participates
-
Any commands that were not completed when failover began might need to be executed again on the new central server.
- There should not be any data loss.
If the target server does not participate
- Standby is mandatory.
-
Any commands that were not completed when failover began might need to be executed again on the new central server.
- The following type of transactions will be lost: transactions that were done directly on the central server prior to the failover that had not yet been journalcopy'd to the standby server being used for the failover.
-
To minimize data loss, the standby server used for the failover should be the standby server that was the most current with the central server at the time of the failover. Typically, this is the standby that is in the same rack with the central server. The goal is that:
-
The downstream replicas are consistent with the new central server.
-
The downstream replicas will not have data loss relative to the new central server.
-
Failover process
The Failover feature allows the super user to do the following.
Step | Description | ||||||
---|---|---|---|---|---|---|---|
1.
|
Get a report of whether conditions look good for a successful failover. If the report indicates that the existing central server is still accessible and ignoring that server has been requested with the -i option, this could result in two separate servers, each of which is unaware of the other. This "split-brain" situation can produce inconsistencies that compromise the integrity of your data.
|
||||||
2.
|
Initiate the failover process on the standby server.
|
||||||
3.
|
Monitor the steps that are reported during the process. If the failover process encounters an error, the process is designed to inform the superuser and to stop the failover process so that corrective action can be taken and a new attempt can occur. If an error is encountered after the standby server has stopped the central server, the standby server will not restart the central server. |
||||||
4.
|
After the completion of a successful failover,
|
||||||
5. |
You have the option of performing a Failback after failover. |
Configurables affected
The failover process:
- Makes no changes to the configurables on the original central server.
- Can make changes to the following for the new central server so that the values are appropriate for the new environment:
Configurable |
Environment Variable |
---|---|
client.readonly.dir | P4AUDIT |
client.sendq.dir | P4JOURNAL |
journalPrefix | P4LOG |
pull.trigger.dir | P4TICKETS |
server.depot.root | P4TRUST |
server.extensions.dir | P4ROOT |
server.locks.dir | |
statefile |
Configurables and edge server
When failing over to a standby from an edge (or other replica) server, the updated configurables for the edge server will need to be manually changed on the commit server. This is because the update of the configurables cannot be propagated back to the commit (or upstream) server automatically, given that the edge server might, or might not, be participating in the failover.
If p4 failover cannot be used
The p4 failover
command is not supported for the disaster recovery scenario in which the only standby servers remaining use the nomandatory
option. At a remote disaster recovery site, it is generally recommended that the standby server use the nomandatory
option.
Network latency and bandwidth to the server from which the standby server is journalcopy'ing might delay journalcopy operations to a standby server at a remote disaster recovery site. If the journalcopy operations are delayed and that standby server uses the mandatory
option, replication to other replicas could be delayed. This is because each transaction must be journalcopy'd by all standby servers using the mandatory
option before the transaction can be replicated to other replicas.
Metadata consistency issue with nomandatory
Using the nomandatory
option avoids the potential performance issue mentioned above. However, if the standby server at a remote disaster recovery site uses the nomandatory
option, some transactions might have been replicated to the other replicas but not yet journalcopy'd to the standby server at the remote disaster recovery site.
In this case:
-
The standby server using the
nomandatory
option might lack metadata transactions that are present in the other replicas. -
The other replicas must therefore be reseeded back to a consistent point rather than using the
p4 failover
command.
Reseeding
The following only applies for a true disaster in which none of the remaining standby servers use the mandatory option.
-
Use the standby server as a basis for reseeding the other replicas. Reseeding the other replicas is a supported recovery method when the standby server is using the
nomandatory
option. To learn more, see Replica server metadata recovery. -
After reseeding the replicas from a checkpoint (or dump) taken from the standby server using the
nomandatory
option, that standby server can become the new central server because the replicas have been reseeded from its metadata. The metadata of the replicas is now consistent with that of the standby server. After changing the server ID to that of the central server, that server will become the new central server and use the central server's configuration when it is started.