## Monday, July 12, 2010

### benchmarking pitfalls

In this post i'm going to list some of the pitfalls which can happen when you are trying to optimize your code and test the timings of specific code snippets.

As an example, consider the case of testing the performance of custom array class implemented in pysph at http://code.google.com/p/pysph/source/browse/source/pysph/base/carray.pyx?repo=kunalp-alternate , which is a simple substitute for 1D numpy arrays.

Now in orfer to test the performace i wrote a simple benchmark code:

cpdef dict loopget(ns=Ns):
cdef double t, t1, t2, num
cdef int N, i
cdef dict ret = {}
empty = numpy.empty
zeros = numpy.zeros

cdef DoubleArray carr
cdef numpy.ndarray[ndim=1, dtype=numpy.float64_t] narr

for N in ns:
carr = DoubleArray(N)
t = time()
for i in range(N):
num = carr[i]
t = time()-t
narr = zeros(N)
t1 = time()
for i in range(N):
num = narr[i]
t1 = time()-t1
t2 = time()
for i in range(N):
num = carr.data[i]
t2 = time()-t2
ret['carr loopget %d'%N] = t/N
ret['carrd loopget %d'%N] = t2/N
ret['narr loopget %d'%N] = t1/N
return ret

This snippet simply times the retrieval of a value from a custom DoubleArray class (using its data attribute which is a c array) versus a numpy buffer, both should ideally be at c speed, but the numpy buffer is not unless you disable cython array bounds check.
Now if you run it you would be surprized to see the timings:

carr loopget 100                             9.05990600586e-08
carr loopget 1000                            3.38554382324e-08
carr loopget 10000                           3.42130661011e-08
carr loopget 100000                          3.51309776306e-08
carrd loopget 100                            9.53674316406e-09
carrd loopget 1000                           9.53674316406e-10
carrd loopget 10000                          9.53674316406e-11
carrd loopget 100000                         9.53674316406e-12
narr loopget 100                             9.53674316406e-09
narr loopget 1000                            1.90734863281e-09
narr loopget 10000                           1.09672546387e-09
narr loopget 100000                          1.01089477539e-09

Strangely, getting the value in the c data array is extremely fast, and in fact takes the same time independent of the size of the array.
My first thought was that the access time in C was extremely small as compared to the time it took to call the python's time function. However my readings about gcc and compiler optimizations came to my mind.
Trick: Note that the assignment which is tested in the code snippet does not affect any other part of the code, and the variable num is never even read again. Hence the compiler optimizes it away (this technique is called dead code removal). Thus in the case of C array access the assignments do not occur at all. This does not happen for the other two parts because in calling python functions the compiler can never be sure what is done of the variables, and hence cannot reliably determine whether the assignment has any side-effect or not, so that the assignment is not removed while compilation.

Keeping this fact in mind, let us try to modify the test code so that this specific optimization does not take place. Consider our new test code:

cpdef dict loopget(ns=Ns):
cdef double t, t1, t2, num
cdef int N, i
cdef dict ret = {}
empty = numpy.empty
zeros = numpy.zeros
cdef dict d = {}

cdef DoubleArray carr
cdef numpy.ndarray[ndim=1, dtype=numpy.float64_t] narr

for N in ns:
carr = DoubleArray(N)
t = time()
for i in range(N):
num = carr[i]
t = time()-t
d[num] = num
narr = zeros(N)
t1 = time()
for i in range(N):
num = narr[i]
t1 = time()-t1
d[num] = num
t2 = time()
for i in range(N):
num = carr.data[i]
t2 = time()-t2
d[num] = num
ret['carr loopget %d'%N] = t/N
ret['carrd loopget %d'%N] = t2/N
ret['narr loopget %d'%N] = t1/N
return ret

The purpose of these added statements is to make sure that the assignment to num is not useless and that the compiler does not optimize it away. Since the new statements occur outside the time() calls it shouldn't affect our tests.
Let us now check the new timings:

carr loopget 100                             1.19209289551e-07
carr loopget 1000                            4.19616699219e-08
carr loopget 10000                           4.52041625977e-08
carr loopget 100000                          4.62603569031e-08
carrd loopget 100                            9.53674316406e-09
carrd loopget 1000                           3.09944152832e-09
carrd loopget 10000                          1.31130218506e-09
carrd loopget 100000                         1.07049942017e-09
narr loopget 100                             2.14576721191e-08
narr loopget 1000                            2.86102294922e-09
narr loopget 10000                           1.69277191162e-09
narr loopget 100000                          2.29835510254e-09

As you can see now the times are more reasonable.
Conclusion: Timing testing code is not a trivial thing to do :)