Optimization:

  I did some crude speed tests: I timed the test suite with  1e6
data points. "-O3 --unroll-loops" and "-O2" and 
" -march=athlon64 -O2 -pipe -fomit-frame-pointer " all took 33s.
"-Os" took 36s.  This means very little. Eg. perhaps most of the time
was spent in the very simple loops that compute a gaussian function
 1e6 times.

#
 -lf77blas -lcblas -latlas -

The following was useful and may be again

#define FPOINT (void (*)(double *, double *, int, int, void *))