Tuesday, July 13, 2010

Nautilus media tag editor extension by yours truly

Over this weekend, i was searching for something to do (not that i dont have tasks piled up), and i remembered there was no easy way to edit audio files tags. I know there are some excellent programs out there such as easytag and exfalso . However they need you to run a separate program to edit the tags.
Also there is a nautilus extension provided by totem that displays the metadata in the file properties dialog, but cannot edit it. I thought i may be a good idea to make a metadata editor extension for nautilus, so thats what i came up with.


Installation:
The extension requires that exfalso be installed on your computer (since it directly uses the exfalso metadata editor). On fedora, its a simple command:

$ yum install quodlibet

The extension itself is a single python file which you need to put into a directory ~/.nautilus/python-extensions/ (create the directory if it does not exist). The file is available at: https://sites.google.com/site/pankaj86/files/media_tag_editor.py.

Now the mandatory screenshot:


Monday, July 12, 2010

benchmarking pitfalls

In this post i'm going to list some of the pitfalls which can happen when you are trying to optimize your code and test the timings of specific code snippets.

As an example, consider the case of testing the performance of custom array class implemented in pysph at http://code.google.com/p/pysph/source/browse/source/pysph/base/carray.pyx?repo=kunalp-alternate , which is a simple substitute for 1D numpy arrays.

Now in orfer to test the performace i wrote a simple benchmark code:

cpdef dict loopget(ns=Ns):
    cdef double t, t1, t2, num
    cdef int N, i
    cdef dict ret = {}
    empty = numpy.empty
    zeros = numpy.zeros
   
    cdef DoubleArray carr
    cdef numpy.ndarray[ndim=1, dtype=numpy.float64_t] narr
   
    for N in ns:
        carr = DoubleArray(N)
        t = time()
        for i in range(N):
            num = carr[i]
        t = time()-t
        narr = zeros(N)
        t1 = time()
        for i in range(N):
            num = narr[i]
        t1 = time()-t1
        t2 = time()
        for i in range(N):
            num = carr.data[i]
        t2 = time()-t2
        ret['carr loopget %d'%N] = t/N
        ret['carrd loopget %d'%N] = t2/N
        ret['narr loopget %d'%N] = t1/N
    return ret

This snippet simply times the retrieval of a value from a custom DoubleArray class (using its data attribute which is a c array) versus a numpy buffer, both should ideally be at c speed, but the numpy buffer is not unless you disable cython array bounds check.
Now if you run it you would be surprized to see the timings:

carr loopget 100                             9.05990600586e-08
carr loopget 1000                            3.38554382324e-08
carr loopget 10000                           3.42130661011e-08
carr loopget 100000                          3.51309776306e-08
carrd loopget 100                            9.53674316406e-09
carrd loopget 1000                           9.53674316406e-10
carrd loopget 10000                          9.53674316406e-11
carrd loopget 100000                         9.53674316406e-12
narr loopget 100                             9.53674316406e-09
narr loopget 1000                            1.90734863281e-09
narr loopget 10000                           1.09672546387e-09
narr loopget 100000                          1.01089477539e-09

Strangely, getting the value in the c data array is extremely fast, and in fact takes the same time independent of the size of the array.
My first thought was that the access time in C was extremely small as compared to the time it took to call the python's time function. However my readings about gcc and compiler optimizations came to my mind.
Trick: Note that the assignment which is tested in the code snippet does not affect any other part of the code, and the variable num is never even read again. Hence the compiler optimizes it away (this technique is called dead code removal). Thus in the case of C array access the assignments do not occur at all. This does not happen for the other two parts because in calling python functions the compiler can never be sure what is done of the variables, and hence cannot reliably determine whether the assignment has any side-effect or not, so that the assignment is not removed while compilation.

Keeping this fact in mind, let us try to modify the test code so that this specific optimization does not take place. Consider our new test code:

cpdef dict loopget(ns=Ns):
    cdef double t, t1, t2, num
    cdef int N, i
    cdef dict ret = {}
    empty = numpy.empty
    zeros = numpy.zeros
    cdef dict d = {}
   
    cdef DoubleArray carr
    cdef numpy.ndarray[ndim=1, dtype=numpy.float64_t] narr
   
    for N in ns:
        carr = DoubleArray(N)
        t = time()
        for i in range(N):
            num = carr[i]
        t = time()-t
        d[num] = num
        narr = zeros(N)
        t1 = time()
        for i in range(N):
            num = narr[i]
        t1 = time()-t1
        d[num] = num
        t2 = time()
        for i in range(N):
            num = carr.data[i]
        t2 = time()-t2
        d[num] = num
        ret['carr loopget %d'%N] = t/N
        ret['carrd loopget %d'%N] = t2/N
        ret['narr loopget %d'%N] = t1/N
    return ret

The purpose of these added statements is to make sure that the assignment to num is not useless and that the compiler does not optimize it away. Since the new statements occur outside the time() calls it shouldn't affect our tests.
Let us now check the new timings:

carr loopget 100                             1.19209289551e-07
carr loopget 1000                            4.19616699219e-08
carr loopget 10000                           4.52041625977e-08
carr loopget 100000                          4.62603569031e-08
carrd loopget 100                            9.53674316406e-09
carrd loopget 1000                           3.09944152832e-09
carrd loopget 10000                          1.31130218506e-09
carrd loopget 100000                         1.07049942017e-09
narr loopget 100                             2.14576721191e-08
narr loopget 1000                            2.86102294922e-09
narr loopget 10000                           1.69277191162e-09
narr loopget 100000                          2.29835510254e-09

As you can see now the times are more reasonable.
Conclusion: Timing testing code is not a trivial thing to do :)

Friday, July 9, 2010

nearest particle search

Nearest neighbour particle search (NNPS) is  a common requirement of (meshfreee) particle methods, such as SPH. The requirement is to locate all particles within a fixed distance (the kernel support) of a specified particle, and the trick is to avoid doing brute-force distance comparison of every particle with every other particle (O(N^2)). There are many techniques available to implement this. One of the simplest for a fixed kernel support of all particles is to bin the particles and then search for the particles only in the neighbouring bins. Such a technique is implemented here: http://code.google.com/p/pysph/source/browse/source/pysph/base/nnps.pyx?repo=kunalp-alternate.
Here i'm gonna present some timings for the nnps. Note that the timings are old and also include some constant extra times for other operations (calling of rand() numpy function which i've now converted to the c rand() function).

Here are the timings result (Click on image to view the raw data sheet) :



As you can see, it shows that the bin size should be atleast thrice the kernel support size to get good performance.


Wednesday, July 7, 2010

aero nebula cluster

For those who do not know, i'm currently in my last year of the DD program in Aerospace engineering, and my project is implementation of solid mechanics code using SPH (smoothed particle hydrodynamics) integrated into the pysph project.
It pysph is basically a SPH implementation framework written in python/cython. (Now you know my reason for all those optimization posts :) ).
Now for most CFD codes, you need to run them in parallel on clusters so as to reduce the time required. So i just saw the specs of the nebula cluster (on which i have login) in aero department. Its really wonderful. The specs are:
20 nodes (15 working) each node with 12 six-core AMD opteron 2427 processors with 2.2 GHz xloxk speed and 12 GB RAM, in all 180 6-core processors.
This is sure gonna make parallelizing much more fun and interesting.
PS: I just rad and saw quite a few videos from google about their patented map-reduce technique. It would be interesting to implement SPH in map-reduce and let it run in the "cloud", the buzzword of today.