Writing efficient SIMD code - Factor Documentation

Writing efficient SIMD code
Hardware vector arithmetic (SIMD)

Prev:	Numerical accuracy of SIMD primitives
Next:	SIMD data in struct classes

Since SIMD vectors are heap-allocated objects, it is important to write code in a style which is conducive to the compiler being able to inline generic dispatch and eliminate allocation.

If the inputs to a math.vectors word are statically known to be SIMD vectors, the call is converted into an SIMD primitive, and the output is then also known to be an SIMD vector (or scalar, depending on the operation); this information propagates forward within a single word (together with any inlined words and macro expansions). Any intermediate values which are not stored into collections, or returned from the word, are furthermore unboxed.

To check if optimizations are being performed, pass a quotation to the optimizer-report. and optimized. words in the compiler.tree.debugger vocabulary, and look for calls to Low-level SIMD primitives as opposed to high-level Vector operations.

For example, in the following, no SIMD operations are used at all, because the compiler's propagation pass does not consider dynamic variable usage:

USING: compiler.tree.debugger math.vectors math.vectors.simd ; SYMBOLS: x y ; [ float-4{ 1.5 2.0 3.7 0.4 } x set float-4{ 1.5 2.0 3.7 0.4 } y set x get y get v+ ] optimizer-report.

The following word benefits from SIMD optimization, because it begins with an unsafe declaration:

USING: compiler.tree.debugger kernel.private math.vectors math.vectors.simd ; IN: simd-demo : interpolate ( v a b -- w ) { float-4 float-4 float-4 } declare [ v* ] [ [ 1.0 ] dip n-v v* ] bi-curry* bi v+ ; \ interpolate optimizer-report.

Note that using declare is not recommended. Safer ways of getting type information for the input parameters to a word include defining methods on a generic word (the value being dispatched upon has a statically known type in the method body), as well as using Compiler specialization hints and inline declarations.

Here is a better version of the interpolate words above that uses hints:

USING: compiler.tree.debugger hints math.vectors math.vectors.simd ; IN: simd-demo : interpolate ( v a b -- w ) [ v* ] [ [ 1.0 ] dip n-v v* ] bi-curry* bi v+ ; HINTS: interpolate float-4 float-4 float-4 ; \ interpolate optimizer-report.

This time, the optimizer report lists calls to both SIMD primitives and high-level vector words, because hints cause two code paths to be generated. The optimized. word can be used to make sure that the fast code path consists entirely of calls to primitives.

If the interpolate word was to be used in several places with different types of vectors, it would be best to declare it inline.

In the interpolate word, there is still a call to the <tuple-boa> primitive, because the return value at the end is being boxed on the heap. In the next example, no memory allocation occurs at all because the SIMD vectors are stored inside a struct class (see Struct classes); also note the use of inlining:

USING: compiler.tree.debugger math.vectors math.vectors.simd ; IN: simd-demo STRUCT: actor { id int } { position float-4 } { velocity float-4 } { acceleration float-4 } ; GENERIC: advance ( dt object -- ) : update-velocity ( dt actor -- ) [ acceleration>> n*v ] [ velocity>> v+ ] [ ] tri velocity<< ; inline : update-position ( dt actor -- ) [ velocity>> n*v ] [ position>> v+ ] [ ] tri position<< ; inline M: actor advance ( dt actor -- ) [ >float ] dip [ update-velocity ] [ update-position ] 2bi ; M\ actor advance optimized.

The compiler.cfg.debugger vocabulary can give a lower-level picture of the generated code, that includes register assignments and other low-level details. To look at low-level optimizer output, call regs. on a word or quotation:

USE: compiler.tree.debugger M\ actor advance regs.

Example of a high-performance algorithms that use SIMD primitives can be found in the following vocabularies:

•	benchmark.nbody-simd
•	benchmark.raytracer-simd
•	random.sfmt