Node:Glucas compiler options, Previous:Alternative builds, Up:Installing Glucas



SourceForgeLogo
 

Passing Glucas options to the compiler.

To make Glucas as fast and reliable as possible, we have to include some flags and macro definitions at compile time. Some flags are specific for the compiler and others for Glucas or the YEAFFT library.

We cannot help about the specific compiler flags. You should read the compiler's documentation. Select those flags which make the binary as fast as possible.

There are some macro definitions we can set using the macro definition facility of most compilers. Then you have to include

-DYOUR_OPTION1[=value1] [-DYOUR_OPTION2[=value2]] ...

with the compiler flags when invoking the compiler.

For Metrowerks CodeWarrior a similar functionality can be achieved by editing the parameters in the file macos-codewarrior-prefix.h and including it in the Prefix File setting of the C/C++ Compiler Settings.

YEAFFT options


Y_AVAL=value
It gives the type and size of radices that will be used in FFTs. (See Glucas internals.) The default value is 3. We can define 4 or 5. It is recommended to use the default and then you can see the performance with other choices.
Y_AVAL=3 YEAFFT uses radices 4,5,6,7,8 and 9 in FFTs passes.
It is the default and the best option on most systems.

Y_AVAL=4 It uses radices 4,5,6,7,8 and 9 in first FFTs pass and
8,16 in the other passes.

Y_AVAL=5 It also can use radix32 reduction in middle passes.
There are few processors which we can gain some speed.


Y_MANY_REGISTERS
Defining it, in addition to Y_AVAL > 3, you can use a first radix pass reduction from 10 to 16 (See Glucas internals.) Due to the many local variables these routines use, it is only an advantage when the processor has a lot of registers (32 FPU registers or more). Indeed, we still have not seen a machine which uses this feature with gain.


Y_MEM_THRESHOLD=value
To avoid when possible cache misses, the FFT passes are different depending on the pad between data (See Glucas internals.) The threshold from pass 1 to 2 is defined by this parameter. The default is set to 2048. It is a good choice for most systems, but others like Alpha ev67 will run faster by using 8192 instead. The value should be a power of two.


Y_TARGET=value
The YEAFFT library intensively uses preprocessor C macros. Most of the FFT tasks are made by using bits of macros defined in the file ygeneric.h. This file is written with a generic processor in mind. This generic processor is the default and is set defining Y_TARGET=0. Sure, there are many things one could write better for a specific processor. If you are brave enough, do it. You can even write a collection of assembler macros. This is an advanced feature, we recommend do not touch it. You should change some lines in mccomp.h file and write your own my_proc.h file.

Recently, from release v.2.8a. Prefetch hints has been introduced. It increases the performance a lot in some cases. To use this feature, you have to define other than generic Y_TARGET. Up to release 2.8b this is the list of value for targets. Options 16,17,41 and 51 are not recommended, they are still experimental.

0
Generic. No prefetch other than builtin GCC v3.1 used. All pure C code. Generic C compiler.
1
Pentium, Pentium MMX, Pentium II. No prefetch. A lot of assembler lines. GNU/gcc compiler or compatible _asm_ extensions.
11
Pentium 3. Prefetch used. A lot of assembler code. GNU/gcc compiler or compatible _asm_ extensions.
12
AMD Athlon. Prefetch used. A lot of assembler code. GNU/gcc compiler or compatible _asm_ extensions.
16
Pentium 3. Prefetch used. Only two lines of assembler code. GNU/gcc compiler or compatible _asm_ extensions. Not recommended.
17
AMD Athlon. Prefetch used. Only two lines of assembler code. GNU/gcc compiler or compatible _asm_ extensions. Not recommended.
21
PowerPC 601. Prefetch used. Only two lines of assembler code. GNU/gcc compiler or compatible _asm_ extensions or Metrowerks Codewarrior intrinsics.
23
PowerPC 604e, 7xx, 74xx. Prefetch used. Only two lines of assembler code. GNU/gcc compiler or compatible _asm_ extensions or Metrowerks Codewarrior intrinsics.
31
Alpha ev56, ev6, ev67, ev68. Prefetch used. Only two lines of assembler code. Compaq-C compiler with asm calls.
32
Alpha ev56, ev6, ev67. Prefetch used. Only two lines of assembler code. GNU/gcc compiler or compatible _asm_ extensions.
41
Ultrasparc-II. No prefetch. Only two lines of assembler code. GNU/gcc compiler or compatible _asm_ extensions. Not recommended.
51
Intel Itanium IA-64. No prefetch. Only two lines of assembler code. GNU/gcc compiler or compatible _asm_ extensions. Not recommended, use Y_ITANIUM option for a terrific performance.

Y_PREFETCH_EXPENSIVE
When prefetch is available using Y_TARGET other than generic, some routines can be unrolled to avoid unnecessary calls to prefetch hints. It could be useful when prefetch hints are expensive in performance terms. At the moment, it is still an experimental feature.


Y_LONG_MACROS
YEAFFT code is coded based mostly on small macros doing elemental FFT work. Sometimes it is more convenient to use big macros to adjust and tune some long latencies operations in a more convenient way.


Y_VECTORIZE2
Don't take this option as a multithreaded one. For radix-4 reduction it is possible to unroll inner loops with a register pressure similar to radix-8. It can help a bit for some processors.


Y_ITANIUM
This option activates special code for IA64 processors since 2.8c. It has no effect in earlier releases. This code is plain C code, no assembler lines, but gives a big penalty in performance for other than Intel IA64 processors. It is strongly recommended to use this option for IA64 machines, you can double the performance.


_PTHREADS=value
This option enables the use of POSIX threads. This option is automatically filled building the binary with configure script and using --enable-pthread=n. If you want to use n threads you should then add -D_PTHREADS=n to your command line compiler options. At the moment n has to be power of two. When using configure the option --enable-pthread is equivalent to --enable-pthread=2. This option is not recommended for single processor machines.


_OPENMP
This option enables the use of OpenMP directives. If you want to set the number of threads to use you have to include the option Y_NUM_THREADS. Warning. If you want to use OpenMP multiprocessing you have to enable it with the proper compiler flag (usually -omp). Then, the compiler have already defined _OPENMP macro so YOU DON'T HAVE TO DEFINE IT EXPLICITLY. This option is not recommended for single processor machines.


_SUNMP
This option enables Sun MP C directives see docs about SunWSpro C compiler. Warning: You have to define PARALLEL=n in your shell environment, and you also have to define Y_NUM_THREADS and the specific compiler flag -xexplicitpar. This option is not recommended for single processor machines.


Y_NUM_THREADS=value
This option is used to set EXPLICITLY how many threads are used in OpenMP or SunMP multithreaded options. You have to assign a value when using SunMP but not when using OpenMP (here the best choice can be computed at runtime). Anyway, to avoid other than power of two number of threads, it is better to use this option whenever you use both _OPENMP or _SUNMP.

GLUCAS options

The following macros are defined to manage some aspects of Lucas Lehmer tests, not the FFT routines.

Y_SECURE
When defined, Glucas makes a round off check every iteration (See Glucas internals.) It costs about 5% of performance, but in some cases it is convenient. We recommend to use it in systems with unreliable hardware/software (low end PC's, overclocked systems, etc ...). In 2.9.0, when it is not defined the round off error is checked in the first 131072 iterations after start. If any error of these iterations is over 0.40 then Glucas restarts trying to adjust its accuracy. Once the initial phase is passed, Glucas still continues checking the roundoff error but now every 64 iterations and with a higher threshold (0.45). A further error will reinit the test, with accuracy increased, from the last save file.

You can also manage the roundoff check at run time by editing the ini file (See Configure files.)


Y_KILL_BRANCHES
The carry and normalization phase of Discrete Weighted Transform (See Glucas internals.) has some unpredictable branches. This can slowdown the performance in some processors. Glucas has an alternative code to avoid all these branches but with some float and integer instructions as cost. Some processors can gain with this option activated.


Y_VECTORIZE
If this option is used, it is possible to avoid some dependency stalls in carry and normalization phase of Discrete Weighted Transform.