OMPTUNE Procedure

Optimizes the parallel-performance of PV-WAVE on the host. OMPTUNE finds the optimal number of threads for each array operation in a representative subset of all such operations. The data is saved to a file and can be automatically loaded when PV-WAVE is started on a platform like the one which ran OMPTUNE. Then for that platform PV-WAVE automatically controls the number of threads so it is optimal for any array operation regardless of size and data-type.

Important: The OMPTUNE file format changed as of PV-WAVE 10.0. Any tuning files generated prior to PV-WAVE version 10.0 must be regenerated for use in later versions. Tuning files generated with prior versions of PV-WAVE are ignored and accompanied by a warning message.

OMPTUNE, nmb, output_file_name

Input Parameters

nmb—The total number of megabytes of the highest-level data-cache on the host. For example, on a machine with three levels of cache and two processors with 20MB L3-cache each, set nmb to 40.

output_file_name—A scalar string that is the name of the file to be created as a result of executing OMPTUNE.

Keywords

Cache—A 5-element array of cache sizes on the host machine. The default is [ 64, 32768, 262144, 8388608, 67108864 ], representing line size, and L1, L2, L3, and L4 cache size, in bytes. The values in Cache are saved in the output file along with the rest of the performance data, and accurate values are necessary for the optimal performance of some operations. For machines without an L4 cache, the last value in Cache can be omitted, i.e., a 4-element array can be input. See the !CACHE system variable for more information.

Monotonic—An integer value equal to 0 (false) or 1 (true), which indicates whether or not the optimal number of threads is a monotonically increasing function of operation size. The default is 1 (true). Although this assumption is not strictly true for inexpensive operations, it represents a decent approximation on most platforms.

Nmax—(Input) Maximum number of threads to include in the tests. Nmax defaults to the number of processors on the machine. On some platforms the tuning data can take hours to generate, and to avoid wasted computation and longer tuning times, a lower value of Nmax should be used if all processors are not available to OMPTUNE or if all processors will never be available to a long running PV-WAVE application. For example if only half of the processors on a 16-core box will ever be available to a long running PV-WAVE application, then run OMPTUNE with Nmax=8 at a time when at least 8 of the processors are available (lightweight processes from the OS or from a LAN may be ignored).

Note:

If hyper-threading is enabled on the host, it is recommended that Nmax be set to !NPROCS/2. Then in general, PV-WAVE uses just the physical cores and little of the virtual cores. This is usually better for general applications which are dominated by inexpensive (e.g., logical, relational, or arithmetic) array operations. Higher values of Nmax may be better for applications dominated by expensive (e.g., transcendental) array operations.

Nrep—Tuning is accomplished by optimizing individual operations from a representative subset of all possible PV-WAVE array operations. Because hardware response is generally pseudorandom in nature, each operation is tested multiple times. Nrep is an odd positive integer controlling the number of repetitions of the test for any given operation. Tuning-time increases linearly with odd-number increases in Nrep, and the default value of 5 should generally yield good results.

Verbose—(Input) If set, results are printed to the console and to a file as they are being generated. The default is 1 (set).

Fast—If set, optimum values are restricted to 1 or Nmax, effectively using step-functions to approximate relationships between operation size and optimal number of threads. On many platforms this is a decent approximation and also turns out to be the only generally viable option for automatic thread control in array-based code, where the optimal number of threads can vary at high frequency. This high-frequency switching has overhead, and unless the host has very many cores, this overhead generally outweighs the benefit of more accurate representations. Therefore, Fast defaults to 1.

Discussion

Since each array operation has been optimized for your machine, OMPTUNE virtually eliminates the need for a user to consider, examine, or change any of the OpenMP run-time settings. For example, OMPTUNE may find that on your machine using the COS function for an array of 1000 elements, 2 processors yields the fastest compute time. It may then find that 6 processors is optimal for arrays of 10,000 elements, and when presented with arrays that exceed 100,000 elements the optimal number of processors is equal to the total number of processors on your machine. This is an example of how PV‑WAVE automatically reduces the number of threads used for certain operations where the overhead of too many additional threads can degrade performance.

A required parameter to OMPTUNE is a scalar string that is the name of the file to be created as a result of executing OMPTUNE. Typically you define this to be a meaningful filename for your machine, i.e., OMPTUNE, nmb, 'myhostname.omp'. Perhaps at a site that comprises multiple computing platforms, adopt a naming convention for the tuning files such that the filenames contain details on the platform and hostnames, such as:

OMPTUNE, nmb, '%WAVE_DIR%\data\omp.win32.server3'

Note:

If you have a number of identical machines at your site, you do not need to run OMPTUNE on every machine. Run OMPTUNE on one machine, and then all identical machines can use the same tuning results file.

This 'tuning results file' is read with the command:

SET_OMP, File='mytunefile'

It is recommended that you have a PV-WAVE Startup file (refer to the PV‑WAVE Programmer’s Guide Appendix B: Modifying Your Environment for more information about using a PV-WAVE Startup file) and that you add this line to your Startup file so that PV-WAVE reads the tuning file each time you begin a PV‑WAVE session. Anytime you want to reduce the number of threads available to PV-WAVE to a value lower than the tuning Nmax, such as n, just reload the file with the command

SET_OMP, File='mytunefile', Nthreads=n

Then tuned performance is still in effect, but the larger Nthreads values are thresholded by n.

Note:

For more information about the implications and significance of OMPTUNE, refer to Chapter 16: OpenMP in the PV‑WAVE Programmer’s Guide, which describes the PV-WAVE implementation for OpenMP.

See Also

SET_OMP

For more information about OpenMP, see the PV‑WAVE Programmer’s Guide.