IPLM metrics details

Learn more about the metrics exported by the Perforce IPLM products.

IPLM Server metrics

To see the IPLM Server metrics, run the following command on a machine running IPLM Server:

# curl localhost:2002/metrics 2>/dev/null | grep piserver

Alternately, in a web browser, navigate to the Prometheus interface on the machine at http://<IP-address>:9090 and in the query input at the top enter:

{job="piserver"}

Note:

The above ports and job names are defined in the machine's /etc/mdx/mdx-metrics-prometheus.yml Prometheus configuration file in its scrape_configs section.

PiServer metrics

Perforce IPLM Metric Type Description

mdx_piserver_2xx_responses_total

counter

Gives the total number of 200 type responses.

mdx_piserver_4xx_responses_total

counter

Gives the total number of 400 type responses.

mdx_piserver_5xx_responses_total

counter

Gives the total number of 500 type responses.

mdx_piserver_event_config_mode

gauge

Gives the event configuration mode. From the pi-admin settings edit template, the mdx_piserver_event_config_mode metric will have the following values:

  • 0 = none (events disabled)
  • 1 = write_only (only events that result in a Neo4j write operation are enabled)
  • 2 = all (all Perforce IPLM events are enabled)

mdx_piserver_events_published_total

counter

Gives the total number of events published.

mdx_piserver_info

counter

Gives information about PiServer, for example:

mdx_piserver_info{version="2.36.0",release="0",} 1.0

mdx_piserver_neo4j_latency_max_seconds

gauge

Gives the Neo4j maximum latency in seconds (the maximum duration Neo4j has handled any request) , for example:

mdx_piserver_neo4j_latency_max_seconds 153.442200388

mdx_piserver_neo4j_latency_min_seconds

gauge

Gives the Neo4j minimum latency in seconds (the minimum duration Neo4j has handled any request) , for example:

mdx_piserver_neo4j_latency_min_seconds 0.00230883

mdx_piserver_neo4j_latency_seconds

summary

Gives Neo4j's latency in seconds (the duration Neo4j has handled requests). Prometheus translates this metric to other metrics giving the count of the number of times this metric has been set and the sum total of all of the values this metric is set to, for example

mdx_piserver_neo4j_latency_seconds_count 226878.0
mdx_piserver_neo4j_latency_seconds_sum 3348.1690111860003

mdx_piserver_subscribed_requests_total

counter

Gives the total event subscribed requests, i.e. the number of times the PiServer subscribed endpoint has been called to get a list of subscribed users.

Metrics guidance

Informational metrics

Most of the metrics give static or dynamic insight about IPLM Server and its use. The mdx_piserver_info metric is confirms that the expected version of IPLM Server is running and the mdx_piserver_event_config_mode metric confirms which events will be output from IPLM Server. The mdx_piserver_xxx_responses_total metrics inform about the distribution of the responses and can help detect unexpected patterns. For example, any positive number of 5xx responses likely indicates a problem in IPLM Server itself, which may lead to reporting an issue. A large number of 4xx responses may indicate an erratic script sending invalid requests to the Public API.

Performance metrics

A few metrics provide insight about the latency of IPLM Server and can help assess bottlenecks. For example, a consistently large value for mdx_piserver_neo4j_latency_seconds likely negatively impacts user experience, and may justify investigating the cause of such slowness. In that context, note that most of the the response time of IPLM Server is actually the response time of the Neo4j server(s), so that's where the analysis should focus.

IPLM Cache metrics

Since IPLM Cache is a multi-process application implemented in Python and the Prometheus Python Client Library is setup for instrumenting a single-process application (with the exception of a Gunicorn multi-process WSGI application), IPLM Cache uses the Prometheus StatsD Exporterto coordinate metrics among all its processes and to export them to the Prometheus server. The IPLM Cache Watchdog Monitor exports its metrics separately.

It is important to start the mdx-metrics service before starting the IPLM Cache service so that the StatsD exporter can properly gather IPLM Cache metrics.

To see the IPLM Cache metrics, run the following command on a machine running IPLM Cache:

$ curl localhost:9102/metrics 2>/dev/null | grep mdx_picache

Alternately, in a web browser, navigate to the Prometheus interface on the machine at http://<IP-address>:9090 and in the query input at the top enter:

{job="picache-statsd"}

The above ports and job names are defined in the machine's /etc/mdx/mdx-metrics-prometheus.yml Prometheus configuration file in its scrape_configs section.

Description of IPLM Cache metrics

Perforce IPLM Metric Type Description
mdx_picache_front_api_request_handling_time_ms summary

Gives the time it took for the IPLM Cache front-end, in milliseconds, to handle a client HTTP request. As described in Prometheus's HISTOGRAMS AND SUMMARIES documentation, summaries are broken into count, sum, and quantile metrics. For IPLM Cache, these translate to the metrics (with example values):

mdx_picache_front_api_request_handling_time_ms{quantile="0.5"} 0.030681999999999997
mdx_picache_front_api_request_handling_time_ms{quantile="0.9"} 0.066292
mdx_picache_front_api_request_handling_time_ms{quantile="0.99"} 0.092524
mdx_picache_front_api_request_handling_time_ms_sum 1.5169719999999998
mdx_picache_front_api_request_handling_time_ms_count 46

These metrics will not be available until the IPLM Cache front-end handles its first request.

mdx_picache_ipv_data_consistency_failures_total

gauge

Gives the number of IPLM Cache IPV Data Consistency Check Failures.

mdx_picache_ipv_versions_in_cache_total__<IPV-reference> gauge Gives the number of versions in the cache for a specific IPV. <IPV-reference> is formatted as library_IP_line. An example metric is mdx_picache_ipv_versions_in_cache_total__tutorial_acells_tsmc18_TRUNK

mdx_picache_ipvs_in_cache_total

gauge

Gives the total number of IPVs in the cache.
mdx_picache_job_execution_time_ms summary

Gives the the time, in milliseconds, a IPLM Cache backend worker took to execute a job. As with mdx_picache_front_api_request_handling_time_ms, summaries are broken into count, sum, and quantile metrics. For IPLM Cache, these translate to the metrics (with example values):

mdx_IPLM Cache_job_execution_time_ms{quantile="0.5"} 2.012019
mdx_IPLM Cache_job_execution_time_ms{quantile="0.9"} 4.607835
mdx_IPLM Cache_job_execution_time_ms{quantile="0.99"} 9.540736
mdx_IPLM Cache_job_execution_time_ms_sum 99.77386099999998
mdx_IPLM Cache_job_execution_time_ms_count 39

These metrics will not be available until the IPLM Cache back-end handles its first job.

mdx_picache_job_spent_in_queue_time_ms summary

Gives the time, in milliseconds, a job spent in a Redis queue. As above, this metric translates to the metrics (with example values):

mdx_picache_job_spent_in_queue_time_ms{quantile="0.5"} 8.706202
mdx_picache_job_spent_in_queue_time_ms{quantile="0.9"} 15.51039
mdx_picache_job_spent_in_queue_time_ms{quantile="0.99"} 18.33295
mdx_picache_job_spent_in_queue_time_ms_sum 323.80503799999997
mdx_picache_job_spent_in_queue_time_ms_count 39

These metrics will not be available until the IPLM Cache back-end handles its first job.

mdx_picache_jobs_dequeued_total

gauge

Gives the number of jobs dequeued from Redis queues by the backend workers.

mdx_picache_jobs_enqueued_total

gauge

Gives the number of jobs enqueued in Redis queues by the IPLM Cache front-end.

mdx_picache_jobs_successfully_handled_total

gauge

Gives the number of jobs successfully handled by the backend workers.
mdx_picache_jobs_total__<IPV-reference> gauge Gives the number of jobs enqueued in Redis queue for a specific IPV. <IPV-reference> is formatted as library_IP_version_line. An example metric is mdx_picache_jobs_total__tutorial_acells_tsmc18_1_TRUNK

mdx_picache_major_ver

gauge

Gives the IPLM Cache major version number. This would be the number 1 in version 1.2.3.

mdx_picache_minor_ver

gauge

Gives the IPLM Cache minor version number. This would be the number 2 in version 1.2.3.

mdx_picache_patch_ver

gauge

Gives the IPLM Cache patch version number. This would be the number 3 in version 1.2.3.

mdx_picache_start_time

gauge

Gives the time in seconds since the epoch IPLM Cache was started. For example, this is used in Perforce IPLM Cache Grafana dashboard to give IPLM Cache's Uptime value with the equation:

time() - mdx_picache_start_time{job="picache-statsd"}

mdx_picache_up

gauge

Indicates if IPLM Cache is running (1) or not (0). On startup IPLM Cache sets this metric to 1, and on shutdown it sets this to 0.

Metrics guidance

Configuration

The mdx_picache_up, mdx_picache_major_ver, mdx_picache_minor_ver, and mdx_picache_patch_ver metrics can be analyzed when troubleshooting IPLM Cache issues, for example to make sure IPLM Cache is up and to make sure you have the version of IPLM Cache you're expecting. The mdx_picache_start_time metric can be monitored to see how long IPLM Cache has been running. IPLM Cache Watchdog metrics (below) can be monitored for Redis information.

Timing

The mdx_picache_front_api_request_handling_time_ms, mdx_picache_job_execution_time_ms, and mdx_picache_job_spent_in_queue_time_ms metrics deal with IPLM Cache timing performance. These can be monitored to see how they vary with time. If they are seen to increase over time, that may indicate a highly loaded system that may indicate some kind of performance bottleneck that needs to be addressed or a normal condition that could be alleviated with additional IPLM Cache instances.

IPV information

The mdx_picache_ipv_versions_in_cache_total__<IPV-reference> and mdx_picache_ipvs_in_cache_total metrics can be monitored for information on the IPVs in the cache.

IPV information is maintained by IPLM Cache when MongoDB is used (the mongod-host setting is used). For more information on the IPVs, the picache-ipv-admin.sh tool can be used, specifically with the --list, --all, and -v/--verbose options:

$ picache-ipv-admin.sh --help
usage: picache-ipv-admin.sh [-h] [-v] [-c CONF] [--project PROJECT] [--all]
                            (--list |  --remove | --expire | --no-expire | --clear-consistency-check)
                            [-t TIMEOUT]
                            [ipv [ipv ...]]

IPLM Cache administration tool for managing IPVs in cache

positional arguments:
  ipv one or more IPV names

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         verbose output
  -c CONF, --conf CONF  IPLM Cache config file, default: /etc/mdx/picache.conf
  --project PROJECT     IPV project
  --all                 select all IPVs recorded in cache
  --list                list IPV in the cache
  --remove              remove IPV from cache
  --expire              mark IPV subject to auto-removal
  --no-expire           mark IPV to not be subject to auto-removal
  --clear-consistency-check
                        clear IPV data consistency check in progress indication
  -t TIMEOUT, --timeout TIMEOUT 
                        number of seconds to wait for a remove job to finish, default: 10

The picache-monitor.sh tool can also be used to get IPV information with the -i/--ipvs and -v/--verbose options:

$ picache-monitor.sh --help
usage: picache-monitor.sh [-h] [-v] [-c CONF] [-p PERIOD] [-i] [-q] [-l]
                          [--hb] [-a] [-n] [-j]

IPLM Cache administration tool for monitoring IPLM Cache.

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         verbose output
  -c CONF, --conf CONF  IPLM Cache config file, default: /etc/mdx/picache.conf
  -p PERIOD, --period PERIOD
                        causes repeated output with this period in seconds between outputs (0 = continuous output)
  -i, --ipvs            show IPVs in cache
  -q, --queues          show queue details
  -l, --locks           show lock information
  --hb                  show heartbeat information
  -a, --all             show all information
  -n, --no_summary      don't output Summary section (overridden by --all)
  -j, --json            output in JSON format (used for automation purposes)

The picache-ipv-usage.sh tool can also be used to get full IPV information (including file list information).

Log information for IPVs can be obtained by using the picache-query-mongo-log.sh tool, specifically with the -i/--ip argument:

$ picache-query-mongo-log.sh --help
usage: picache-query-mongo-log.sh [-h] [-c CONF] [-e [PATTERN [PATTERN ...]]]
                                  [-f [FUNCTION [FUNCTION ...]]]
                                  [-i [IP [IP ...]]] [-j [JOBID [JOBID ...]]]
                                  [--level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                                  [-L [LOGNAME [LOGNAME ...]]]
                                  [-m [MODULE [MODULE ...]]]
                                  [-n [NODE [NODE ...]]] [-p [PID [PID ...]]]
                                  [-t MINTIME] [-T MAXTIME] [-v] [--count]
                                  [--exc]

IPLM Cache administration tool for extracting logs from MongoDB.

optional arguments:
  -h, --help            show this help message and exit
  -c CONF, --conf CONF  IPLM Cache  config file, default: /etc/mdx/picache.conf
  -e [PATTERN [PATTERN ...]], --regexp [PATTERN [PATTERN ...]]
                        Regular expressions to search log records for
  -f [FUNCTION [FUNCTION ...]], --function [FUNCTION [FUNCTION ...]]
                        Functions log records are for
  -i [IP [IP ...]], --ip [IP [IP ...]]
                        IPs log records are for 
  -j [JOBID [JOBID ...]], --jobid [JOBID [JOBID ...]]
                        Job IDs log records are for
  --level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Log level (and above) log records have
  -L [LOGNAME [LOGNAME ...]], --logname [LOGNAME [LOGNAME ...]]
                        Logger names log records are for
  -m [MODULE [MODULE ...]], --module [MODULE [MODULE ...]]
                        Modules log records are for
  -n [NODE [NODE ...]], --node [NODE [NODE ...]]
                        Nodes (hostnames) log records are from
  -p [PID [PID ...]], --pid [PID [PID ...]]
                        Process IDs log records are for
  -t MINTIME, --mintime MINTIME
                        Gets log records newer than or equal to this datetime
  -T MAXTIME, --maxtime MAXTIME
                        Gets log records older than or equal to this datetime
  -v, --verbose         Verbose output
  --count               Prints the number of log records and exits
  --exc                 Gets log records only showing exceptions

Use an egrep expression for a logical OR search of the logs, for example:
 -e 'job_build|job_update'
will get log records containing 'job_build' or 'job_update'.
Use multiple expressions for a logical AND search of the logs, for example:
 -e 'enter' 'job_build'
will get log records containing both 'enter' and 'job_build'.
MINTIME and MAXTIME are formatted as '2018-07-02T14:59:13.365'
(fields 'T' and after are optional).

Job information

The mdx_picache_jobs_dequeued_total, mdx_picache_jobs_enqueued_total, mdx_picache_jobs_successfully_handled_total, and mdx_picache_jobs_total__<IPV-reference> metrics can be monitored to see IPLM Cache's execution of jobs.

The difference between the mdx_picache_jobs_enqueued_total and mdx_picache_jobs_dequeued_total metrics gives the number of jobs still awaiting execution, while the mdx_picache_jobs_successfully_handled_total metric gives the number of jobs that have been successfully executed. In a busy environment, increasing values in all of these metrics indicates IPLM Cache is executing jobs.

Additional job information can be obtained by using the picache-monitor.sh tool with the -q/--queues and -v/--verbose arguments to see what kind of jobs are in which queues.

Data consistency check failures

The mdx_picache_ipv_data_consistency_failures_total metric can be monitored to see if there are any IPVs failing data consistency checking. IPV data consistency checking is enabled in /etc/picache.conf when MongoDB is used (the mongod-host setting is used) and the cache-check-period-hrs setting is non-zero. To see the reason for a data consistency check failure, the picache-query-mongo-log.sh tool can be run as:

$ picache-query-mongo-log.sh -e "job_check return message"

Or for a specific IP as:

$ picache-query-mongo-log.sh -e "job_check return message" --ip <IP-name>Examples:

$ picache-query-mongo-log.sh -e "job_check return message" --ip tutorial
$ picache-query-mongo-log.sh -e "job_check return message" --ip tutorial.verif_config
$ picache-query-mongo-log.sh -e "job_check return message" --ip tutorial.verif_config@2.TRUNK

IPLM Cache Watchdog metrics

To see the IPLM Cache Watchdog metrics, the following command can be run on a machine running the IPLM Cache Watchdog:

$ curl localhost:2005/metrics 2>/dev/null | grep mdx_wdog

Alternately, in a web browser, navigate to the Prometheus interface on the machine at http://<IP-address>:9090 and in the query input at the top enter:

{job="picache-wdog"}

The above ports and job names are defined in the machine's /etc/mdx/mdx-metrics-prometheus.yml Prometheus configuration file in its scrape_configs section.

IPLM Cache Watchdog metrics description

Perforce IPLM Metric Type Description
mdx_wdog_picache_version gauge Gives the IPLM Cache Watchdog Version, e.g. 1.6.2
mdx_wdog_redis_connected gauge Indicates if Redis is connected (1) or not (0).
mdx_wdog_config_redis_info gauge

Gives Watchdog configuration information for Redis, for example:

mdx_wdog_config_redis_info{redis_sentinel_configuration="[('10.211.55.18', 26379), ('10.211.55.19', 26379), ('10.211.55.20', 26379)]",redis_sentinel_service_name="mymaster-suse"} 1.0
mdx_wdog_get_status_msgs_total counter

Gives the number of GET STATUS messages processed. These messages originate from IPLM Cache's mdx_wdog_stat.pyc script used in a deployment in which the Linux Watchdog Monitor periodically gets IPLM Cache's status via the mdx_wdog_stat.pyc script and can be configured to reboot the machine to fix any unexpected bad health condition. See IPLM Cache administration overview for more information. 

The Prometheus Python Client Library also creates the metric mdx_wdog_get_status_msgs_created giving the time the metric was created.

mdx_wdog_heartbeat_config_info gauge

Gives Watchdog configuration information for heartbeats, for example:

mdx_wdog_heartbeat_config_info{hb_down_msg="down",hb_healthy_msg="good",hb_redis_base_key="mdx:picache:heartbeat:",hb_unhealthy_msg="bad",nom_dur_between_hbs_sec="10"} 1.0
mdx_wdog_clients_total gauge Gives the number of Watchdog clients.
mdx_wdog_config_info gauge

Gives Watchdog configuration information, for example:

mdx_wdog_config_info{config_file_path="conf/picache.conf",default_ping_period_sec="20",log_file_path="/home/bob/develop/picache/log/picache-wdog.log",log_level="WARNING",node_name="suse-1",profiling="False"} 1.0
mdx_wdog_clients_num_healthy gauge Gives the number of Watchdog clients that are healthy.
mdx_wdog_msg_handling_time summary

Gives Watchdog message handling timing information. Prometheus translates this metric to other metrics giving the count of the number of times this metric has been set, the sum total of all of the values this metric has been set to, and the time the metric was created, for example:

mdx_wdog_msg_handling_time_count 1563.0
mdx_wdog_msg_handling_time_sum 0.23833489418029785
mdx_wdog_msg_handling_time_created 1.605906044264304e+09
mdx_wdog_clients_num_indefinite gauge Gives the number of Watchdog clients that have an indefinite ping period.
mdx_wdog_redis_instance_info gauge

Gives the Redis instance the Watchdog is connected to, for example:

mdx_wdog_redis_instance_info{redis_master_host="10.211.55.18",redis_master_port="6379"} 1.0
mdx_wdog_clients_num_bad gauge Gives the number of Watchdog clients that are unhealthy.

Metrics guidance

Configuration

The mdx_wdog_picache_version, mdx_wdog_config_redis_info, mdx_wdog_heartbeat_config_info, mdx_wdog_config_info, and mdx_wdog_redis_instance_info metrics can be analyzed when troubleshooting IPLM Cache Watchdog issues. For example, make sure you have the version of IPLM Cache (which includes the IPLM Cache Watchdog) and the configuration options you're expecting. The configuration options are all in the IPLM Cache configuration file, /etc/mdx/picache.conf. The Redis configuration options will be important when troubleshooting any Watchdog connection problem to Redis, as indicated by the mdx_wdog_redis_connected and mdx_wdog_redis_instance_info metrics.

Linux Watchdog daemon

If you have configured the Linux Watchdog Daemon to interface to the IPLM Cache Watchdog, an increasing value for the mdx_wdog_get_status_msgs_total metric indicates the daemon is successfully interfacing to the IPLM Cache Watchdog.

Unhealthy Watchdog clients

If the mdx_wdog_clients_num_bad metric is non-zero and stays that way, that many Watchdog clients have stopped pinging the Watchdog. If the Linux Watchdog Daemon is interfacing to the IPLM Cache Watchdog, the daemon may reboot the machine in order to correct the problem. If the Linux Watchdog Daemon is not used, you can correct the problem manually by restarting the IPLM Cache service or you can ignore the problem by using the picache-wdog-unregister.sh tool:

$ picache-wdog-unregister.sh --help
usage: picache-wdog-unregister.sh [-h] [-v] [-c CONF]
                                  entity_name [entity_name ...]

IPLM Cache administration tool for unregistering clients from the IPLM Cache Watchdog Monitor.

positional arguments:
  entity_name           one or more entity names to unregister

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         verbose output
  -c CONF, --conf CONF  IPLM Cache config file, default: /etc/mdx/picache.conf

The entity_name argument can be found in the Watchdog log file in the log messages showing which entity(ies) have stopped pinging.

Watchdog monitor heartbeat status

IPLM Cache Watchdog heartbeat information can be obtained from the IPLM Cache Monitor tool, picache-monitor.sh, using the --hb and -v options (see above for picache-monitor.sh use).

Third party metrics

Perforce IPLM uses:

All of these export their own metrics.

Neo4j metrics

The Neo4j Enterprise edition supports metrics (the free Community edition does not). Perforce currently supports Neo4j version 3.5.3. Refer to Neo4j's documentation on Metrics for a description of the metrics they expose.

JMX metrics

From Monitoring Java applications with the Prometheus JMX exporter and Grafana:

The Prometheus JMX exporter connects to Java’s native metric collection system, Java Management Extensions (JMX), and converts the metrics into a format that Prometheus can understand.

The Prometheus JMX Exporter webpage does not give a description of the metrics, but the numerous metrics' HELP text can be seen by use of the curl command, for example:

Node exporter metrics

Information on the metrics provided by the Prometheus Node Exporter can be found on the exporter's webpage. Again, no detailed information for each metric is given, but the metrics' HELP text can be seen by use of the curl command, for example:

StatsD metrics

Information for the StatsD Exporter metrics also have to be extracted from their HELP text, for example:

Grafana Metrics

Information on Grafana's metrics can be found on Grafana's Internal Grafana metrics webpage. Detailed information of its metrics is available from the metrics' HELP text, for example:

Prometheus metrics

Prometheus's metrics HELP text: