Failover

Failover is the process by which a standby server (or forwarding-standby server) replaces a server that provides standard, commit-server, or edge-server services. The standby replaces its target server The immediately upstream server for replica servers, edge servers, standby servers, proxies and brokers. See also 'upstream server' and 'central server'..

In this diagram, both a high availability standby server and a disaster recovery standby server replicate the commit server's versioned file content and metadata. The commit server is the central (or innermost) server for both edge server (dev) and edge server (build). The commit server and the high availability standby use a shared network file system (NFS) for versioned file content.

The high availability standby has the mandatory option for metadata consistency.

 

 

 

 

 

 

 

 

 

The disaster recovery standby has the nomandatory option to aid performance.

In the diagram:

  • The blue lines indicate file content, which is also called versioned files or archive files Versioned files that users submitted to a depot..

  • The red lines indicate journal data as the journalcopy makes a copy of the upstream server’s journal on the standby.

  • The green lines indicate journal data being applied by the pull -L process, from the journal data that journalcopy has retrieved from the upstream server.

To learn more about p4 pull and p4 journalcopy, see Server startup commands.

Although the standard service is the default service and does not require a server specification, for a standby server to take over for a standard server, the standard server must have a server specification.

High Availability and Disaster Recovery

The Failover feature supports two scenarios, High Availability (HA) and Disaster Recovery (DR).

High Availability (HA)

  • The central server can be configured to provide one of the following services: standard or commit-server.

  • Typically, the standby server is in the same hardware rack as the central server. However, this is not necessarily the case, and a standby server that takes over for an edge server is typically not in the same hardware rack as the central server.

  • Typical use case: scheduled maintenance of the central server, but also possible if the central server hardware fails.
    Additional use case: the standby server for an edge server can take over if the edge server hardware fails.

  • Typically, the central server participates in the failover process by:

    • Disabling itself in an orderly fashion.

    • Waiting for the journalcopy of the remaining transactions to the standby server.

    • Allowing the standby server to stop the central server, or, in the case of standby server for a edge server, for the standby server to stop the edge server that is the upstream server of that standby server.

High Availability

The high availability standby server can become
the new central server (commit server).
p4 failover defaults
to the central server
participating while the
high availability standby is preparing to take over as the new central server.
The former high availability standby has become
the new central server.

If the target server does not participate

If the central server (or the edge server that is the target server) does not participate in the failover, a check is made to ensure that the standby server to which failover is to occur has the mandatory option set. Without the participation of the central server, failing over to a mandatory standby server is required to ensure that the other replicas remain consistent with the new central server after failover. Consistency is assured because during production operations, metadata must be journalcopy'd by all mandatory standby servers before that metadata is replicated to the other replicas.

The mandatory option and performance

For performance reasons, consider deploying one or more mandatory standby servers that are local to the central server. Journalcopy performance of the mandatory standby servers that are remote can affect the production replication to the other replicas. This is because each transaction must be journalcopy'd by all standby servers using the mandatory option before the transaction can be replicated to other replicas.

Disaster Recovery (DR)

  • Typical use case: due to a sudden catastrophe, the central server (and any HA standby servers) are unable to operate.
  • For assistance with failing over to a non-mandatory standby server when the central server is inaccessible, contact Perforce Support.

Consistency of the downstream replicas is assured for failing over when:

  • The central server participates, in which case:
    • The standby server does not need to be a mandatory standby.
    • The standby server's journalcopy, pull -L, and pull -u threads are an integral part of the failover.
  • The central server (or the edge server that is the target server) does not participate and the standby server is a mandatory standby, in which case only the standby server's pull -L thread is an integral part of the failover.

Disaster Recovery

The disaster recovery standby server
can become
the new central server.
If a catastrophic failure of the data center
occurs,
both the central server
and the high availability standby server
might be unavailable.
The disaster recovery standby server has become the new central server (commit server).

Prerequisites for a successful failover

  • The p4 failover command must be run on a server of Type standby or forwarding-standby. To learn more, see Standby and forwarding-standby.

  • The standby (or forwarding-standby) server must be appropriately licensed for its new role following the failover. Fill out the form at Helix Core Duplicate Server Request.
  • Make sure that monitoring (p4 monitor) is enabled for the new standby server (former standard or commit-server).
    • Monitoring must be enabled at server startup of the standby prior to running the p4 failover command. This is because the monitor subsystem is used to terminate the journalcopy, pull -L, and pull -u threads during the failover sequence.

  • Open the server spec for each standby and forwarding-standby server. In the ReplicatingFrom field, enter the serverID of the server from which the standby server is journalcopy'ing.

  • If an edge server is being failed over, the service user of the edge server should be logged into the commit (or standard) server using the file specified by the P4TICKETS variable that is defined for the standby of the edge server. For example, run the following commands on the standby server that will become the new central server:
  • export P4TICKETS=directory/.p4tickets
    p4 -p target:port -u serviceuser login
  • We recommend that a DNS alias point to the IP address of the central server. This allows the same DNS alias to point the new central server (former standby server).
  • If you plan to use Failback after failover to restore the original central server, we also recommend that a DNS alias point to the IP address of the standby server.

To be prepared in case you might need to decide whether a failover operation is necessary, consider monitoring a target server by setting up Triggering on heartbeat (server responsiveness).

Failover to a standby or forwarding-standby

Failing over to a dedicated standby is generally faster than failing over to a forwarding-standby. For situations where failover completion is less time-critical, you might want to consider a forwarding-standby. To learn more, see "standby" and "forwarding standby" in p4 server in the P4 CLI Reference.

High availability with the mandatory server specification option

To deploy standby servers with minimal interruption to replication, before you set the standby to mandatory, make sure the journalcopy thread of the new standby server is caught up with the server from which is it journalcopying.

A high availability standby within an existing installation should not be initially deployed as mandatory.

Follow this process:

  1. Deploy the standby with the default, which is nomandatory
  2. To monitor the progress of the standby's journalcopy, on the server from which the standby is journalcopying, invoke the p4 servers -J command.

    In this example, we have invoked p4 servers -J on a central server named commit-uk, and Line 4 indicates that standby2 has 400, which does not yet match the 682 value on the central server named commit-uk:

    Copy
    commit-uk '2025/03/12 14:44:52' commit-server 5/682 5/682 wadL/1 1
    edge '2025/03/12 14:44:52' edge-server 5/458 5/458 waD1/2 1
    standby1 '2025/03/12 14:44:52' standby 5/682 1/0 WAd1/4 1
    standby2 '2025/03/12 14:44:52' standby 5/400 2/0 WAd1/4 1
    standby3 '2025/03/12 14:44:52' standby 5/682 3/0 wAd1/4 1 mandatory
    standby4 '2025/03/12 14:44:52' standby 5/682 4/0 wAd1/4 1 mandatory

    Later, again on the central server named commit-uk, we invoke the p4 servers -J command. Now Line 4 indicates that standby2 has progressed to 682, which matches the central server named commit-uk. This means that the standby2 journalcopy is current.

    Copy
    commit-uk '2025/03/12 14:44:52' commit-server 5/682 5/682 wadL/1 1
    edge '2025/03/12 14:44:52' edge-server 5/458 5/458 waD1/2 1
    standby1 '2025/03/12 14:44:52' standby 5/682 1/0 WAd1/4 1
    standby2 '2025/03/12 14:44:52' standby 5/682 2/0 WAd1/4 1
    standby3 '2025/03/12 14:44:52' standby 5/682 3/0 wAd1/4 1 mandatory
    standby4 '2025/03/12 14:44:52' standby 5/682 4/0 wAd1/4 1 mandatory
  3. Change the server spec for standby2 to specify mandatory.

    On the central server, in the server specification for standby2, under Options, mandatory is now appropriate for a standby (or forwarding-standby) server. This option ensures that no replica has metadata that has not been copied to the journalcopy of all mandatory standby (or forwarding-standby) servers.

    If the central server were unavailable, standby1, which is not a mandatory standby, could not be used for failover.

    If the central server is available, all four of the standbys could be used for failover.

If the server from which failover is to occur is not participating in the failover, the p4 failover command returns an error if it is running on a standby (or forwarding-standby) server that is not properly configured with the mandatory option. Two reasons for non-participation are because the central server named commit-uk is unavailable or because the -i option causes the central server to be ignored.

Disaster recovery with the nomandatory server specification option

For disaster recovery failover, the remote standby typically has a server specification with the Option field set to the default value, which is nomandatory. This is because the journalcopy performance of a mandatory standby can affect the speed of replication to the replicas of the central server.

After failover

After the failover process is complete:

  • The former standby server is now the new central server.

  • The standby server’s spec in the new central server’s metadata is left intact, except that it is set to nomandatory. This prevents transactions from being stalled while waiting for the journalcopy of those transactions to a standby server that no longer exists. Also, the spec can be used when the former central server is configured as a new standby server.

  • The central server is stopped if the central server participated in the failover.

Potential data loss

If the central server participates

  • Any commands that were not completed when failover began might need to be executed again on the new central server.

  • There should not be any data loss.

If the target server does not participate

  • Standby is mandatory.
  • Any commands that were not completed when failover began might need to be executed again on the new central server.

  • The following type of transactions will be lost: transactions that were done directly on the central server prior to the failover that had not yet been journalcopy'd to the standby server being used for the failover.
  • To minimize data loss, the standby server used for the failover should be the standby server that was the most current with the central server at the time of the failover. Typically, this is the standby that is in the same rack with the central server. The goal is that:

    • The downstream replicas are consistent with the new central server.

    • The downstream replicas will not have data loss relative to the new central server.

Failover process

The Failover feature allows the super user to do the following.

Step Description

1.

 

Get a report of whether conditions look good for a successful failover.

If the report indicates that the existing central server is still accessible and ignoring that server has been requested with the -i option, this could result in two separate servers, each of which is unaware of the other. This "split-brain" situation can produce inconsistencies that compromise the integrity of your data.

2.

 

Initiate the failover process on the standby server.

  1. This automatically stops the standby (or forwarding standby) server that will become the new central server.

  2. During the failover process,

    1. The central server does not process any new commands and end-users get the "failoverMessage" (see the p4 failover command).

    2. A verification process ensures that recent file content was correctly replicated to the new central server. See the p4 failover command for the -v option.

    3. The standby's state file is renamed to statepullL so that it is not later erroneously used as the (stale) state of some replica. If the failover fails, statepullL is renamed back to state so that the standby can resume operation following the failed failover attempt.

    4. The P4ROOT directory will get a new file named statefailover. This file is the last consistency point journalcopy'd by the standby immediately prior to the failover. This file will be deleted by the new central server when it is no longer needed.

      For example,
      p4 failover
      Make sure the preview looks OK.
      If so, then run
      p4 failover -y

3.

 

Monitor the steps that are reported during the process. If the failover process encounters an error, the process is designed to inform the superuser and to stop the failover process so that corrective action can be taken and a new attempt can occur. If an error is encountered after the standby server has stopped the central server, the standby server will not restart the central server.

4.

 

After the completion of a successful failover,

  1. Verify that the former standby (or forwarding standby) has been restarted as the new central server by issuing the p4 info command and checking the ServerID to ensure that it is the same ServerID that the previous central server used.

  2. Site-specific changes might be needed to use the new central server. It might be necessary to make DNS changes so that users and replicas can connect to the new central server. For example,

If you have a DNS alias set up for the central server If you do not have a DNS alias set up for the central server

Update the IP address of that DNS alias to point to the IP address of its new location. This will enable users to connect to the central server using the same P4PORT.

Update the P4TARGET environment variable and server specifications to use the correct host. The port number should remain the same, but the host name must be changed to use its new location.
If you plan to use Failback after failover, the DNS alias for the standby must be changed to point to the IP address of the central server before failover.
  • Change your P4PORT to point to the new central server host and the same port number so that you can connect to the new central server.

  • On the new central server, change the P4TARGET for each replica or edge server by issuing the p4 configure show allservers command and then issuing the p4 configure set "replica-name#P4TARGET=new-target-server:port-number" command.

  • Update each replica's own P4TARGET by issuing the p4d -r $P4ROOT "-cset replica-name#P4TARGET=new-target-server:port-number"command.

  • Update your server specifications with the proper hostname and port number by issuing the p4 server servername command.

  • Inform your users if they need to update their P4PORT to connect to the new host location of the central server. The port number should remain the same as before, and your users can now issue new commands.

5.

You have the option of performing a Failback after failover.

Configurables affected

The failover process:

  • Makes no changes to the configurables on the original central server.
  • Can make changes to the following for the new central server so that the values are appropriate for the new environment:

Configurable

Environment Variable
client.readonly.dir P4AUDIT
client.sendq.dir P4JOURNAL
journalPrefix P4LOG
pull.trigger.dir P4TICKETS
server.depot.root P4TRUST
server.extensions.dir P4ROOT
server.locks.dir  
statefile  

Configurables and edge server

When failing over to a standby from an edge (or other replica) server, the updated configurables for the edge server will need to be manually changed on the commit server. This is because the update of the configurables cannot be propagated back to the commit (or upstream) server automatically, given that the edge server might, or might not, be participating in the failover.

If p4 failover cannot be used

The p4 failover command is not supported for the disaster recovery scenario in which the only standby servers remaining use the nomandatory option. At a remote disaster recovery site, it is generally recommended that the standby server use the nomandatory option.

Network latency and bandwidth to the server from which the standby server is journalcopy'ing might delay journalcopy operations to a standby server at a remote disaster recovery site. If the journalcopy operations are delayed and that standby server uses the mandatory option, replication to other replicas could be delayed. This is because each transaction must be journalcopy'd by all standby servers using the mandatory option before the transaction can be replicated to other replicas.

Metadata consistency issue with nomandatory

Using the nomandatory option avoids the potential performance issue mentioned above. However, if the standby server at a remote disaster recovery site uses the nomandatory option, some transactions might have been replicated to the other replicas but not yet journalcopy'd to the standby server at the remote disaster recovery site.

In this case:

  • The standby server using the nomandatory option might lack metadata transactions that are present in the other replicas.

  • The other replicas must therefore be reseeded back to a consistent point rather than using the p4 failover command.

Reseeding

The following only applies for a true disaster in which none of the remaining standby servers use the mandatory option.

  1. Use the standby server as a basis for reseeding the other replicas. Reseeding the other replicas is a supported recovery method when the standby server is using the nomandatory option. To learn more, see Replica server metadata recovery.

  2. After reseeding the replicas from a checkpoint (or dump) taken from the standby server using the nomandatory option, that standby server can become the new central server because the replicas have been reseeded from its metadata. The metadata of the replicas is now consistent with that of the standby server. After changing the server ID to that of the central server, that server will become the new central server and use the central server's configuration when it is started.