Sunday, May 30, 2010

research made simple with zotero

If you are into study/research of any kind (academic/non-academic) which involves reading up things and keeping track of them then you are in for a great productivity boost. This will help if you are reading books/news/articles/wikipedia/journals or any such sort of thing. The too I'm talking about is zotero
With zotero you can save proper bibliographic references of lots of material you see on the internet and manage/search/cite them in various forms. Its really difficult to describe all the wonderful things zotero can do for your research, so it'll be really good for you if you watch the screencast http://www.zotero.org/support/screencast_tutorials/zotero_tour

Some features you'll find helpful:
Collect:

Single click saving of references. For example single click on any sciencedirect article, if you have subscribed (as in my college), a single click will save all information about the article, including the pdf (with well thought name instead of fulltext.pdf) if its available.
To enable saving pdf select in Zotero Preferences->General tab -> automatically attach associated pdfs and other files when saving.
In search tab in preferences, you may also want to enable indexing of pdfs if you need.
Clicking on sites with references to lots of articles (wikipedia references, cited by in Scopus etc), you can easily select all the references you need to save

Manage:

You can search all your saved articles, add notes, tags etc
You can group all articles in collections based on topic
You can create saved searches based on various criteria

Cite:

To cite an article(s) simply select them and right click to 'create bibliography from selectd articles' and choose a format style from the many available (including all popular journals) and you are done
If you are using bibtex to manage bibliographies for your article then select the articles and right click to do 'export selected items' and select bibtex format
Zotero plugins are available for Openoffice and MS Office too, so you can easily insert the references in your articles, without the pain of collecting anf formatting

If you work in team then this is a really wonderful feature. Create a simple login on the zotero server (you can also use openid)
In zotero preferences->sync tab enter your zotero login details and enable sync my library and group library.
All synced items (including attached pdfs) are available on the internet anywhere without even installing zotero addon. You just need to login to zotero and see your collection. This is very useful if your college has access to some journals but when you are somewhere else in a conference and you need to check and article. 100 MB space is freely available and you can buy even more.
'My library' is your personal collection. Group libraries are shared collections, which can be shared with other people you are working with.

So what are you waiting for, install it now. If you did not install it yet, then you need to watch the screencast http://www.zotero.org/support/screencast_tutorials/zotero_tour now

Saturday, May 29, 2010

numpy array performance / divide and conquer considered harmful

This is again a post about python code speed, the data and inference are more than a few months old but still valid.
Here's a spreadsheet showing speed of array math operations (+, -, *, /) between numpy arrays and python lists.
Check this spreadsheet to see the timings of various operations
https://spreadsheets.google.com/ccc?key=0AomYDYyBBNkkdHAtMkdHMF9TZ29lMmZQV3UwYkxWNFE&hl=en

The operations I considered for comparison were:

x+0.1
x-0.1
x*0.1
x/0.1
x*(1/0.1)
x+y
x-y
x*y
x/y
[p+yp[j] for j,p in enumerate(xp)]
[xp[j]+yp[j] for j in xrange(i)]

where x and y are numpy arrays, xp and yp are python lists, all of size N which is varied for the comparison.
The raw timings data is available here:

https://spreadsheets.google.com/pub?key=0AomYDYyBBNkkdHAtMkdHMF9TZ29lMmZQV3UwYkxWNFE&hl=en&output=html

See the timings plot yourself

Conclusion:

Use numpy arrays for size > 10
Avoid division as much as you can to improve the speed of your numerical codes
Instead of x/0.1 do x*(1/0.1) . This itself causes large speedup as N is increased.
x/0.1 and x/y take almost the same time
+, -, * take almost same time, / takes much more time, and its expense increases as N is increased.
Once again, do not divide.
The same thing is valid in cython code also. Avoid division even in cython code, and even if you are using double instead of numpy arrays (buffer). Rewrite expressions to minimize the usage of division operator.

Wednesday, May 26, 2010

cython timings test

The TASK : To optimize cython functions

Detailed: functions which depend on a once initialized attribute value

This often comes handy in many cases, for example to write a Laplacian function of a scalar field in spherical/axisymmetric coordinate system, you would need three independent cases for 1,2,3 dimensions for performance purposes and if u do not write all functions as general 3D functions.

The test CODE : test_kernel.pyx

cdef class Kernel:
    cdef int dim
    cdef double (*func)(Kernel,double)
    def __init__(self, dim=1):
        self.dim = dim
        if dim == 1:
            self.func = self.func1
        elif dim == 2:
            self.func = self.func2

    cdef double func1(self, double x):
        return 1+x

    cdef double func2(self, double x):
        return 2+x

    cdef double c_func(self, double x):
        '''this is only to make function signature compatible with func1 and func2'''
        return self.func(self, x)

    def p_func(self, double x):
        return self.func(self, x)

    cpdef double py_func(self, double x):
        return self.func(self, x)

    cpdef double py_c_func(self, double x):
        return self.c_func(x)

    def py_func1(self, x):
        return self.func1(x)

    def py_func2(self, x):
        return self.func2(x)

    cdef double func_common(self, double x):
        cdef int dim = self.dim
        if dim == 1:
            return 10+x
        elif dim == 2:
            return 20+x

    def py_func_c_common(self, x):
        return self.func_common(x)

    cpdef double py_func_common(self, double x):
        cdef int dim = self.dim
        if dim == 1:
            return 10+x
        elif dim == 2:
            return 20+x

Compilation command:
   cython -a test_kernel.pyx;
   gcc <optimization-flag> -shared -fPIC test_kernel.c -lpython2.6 -I /usr/include/python2.6/ -o test_kernel.so
where optimization flag is either empty or "-O2" or "-O3"

Cython optimization
Tip 1:
Type (cdef) as many variables as you can. You also need to type the locals in each function. Try to try to use C data types wherever possible.
Tip 2:
use:
   cython -a file.pyx
command to generate a html file which shows lines which cause expensive python functions to be called. Clicking on a line shows the corresponding C code generated, highlighting expensive calls in shades of red. Try to eliminate as many such calls as you can.

The TEST :

time_kernel.py

import timeit

def time(s):

'''returns time in microseconds'''

t = 1e6*timeit.timeit(s,'import test_kernel;k1=test_kernel.Kernel(1);k2=test_kernel.Kernel(2);',number=1000000)/1000000.

print s, t

return t

time('k1.p_func(0)')
time('k1.py_func(0)')
time('k1.py_func1(0)')
time('k1.py_c_func(0)')
time('k1.py_func_c_common(0)')
time('k1.py_func_common(0)')

time('k2.p_func(0)')
time('k2.py_func(0)')
time('k2.py_func2(0)')
time('k2.py_c_func(0)')
time('k2.py_func_c_common(0)')
time('k2.py_func_common(0)')

Timings :

	function	time (μs)					(ns)
	Optimization flag ->	None	-O2	-O3	sum	(k1+k2)/2	penalty
1	k1.p_func(0)	0.20178	0.18321	0.18035	0.18845	0.19368	0.0000
2	k1.py_func(0)	0.23224	0.18599	0.18393	0.20072	0.19541	1.7345
3	k1.py_func1(0)	0.21477	0.18991	0.19252	0.19907	0.19802	4.3456
4	k1.py_c_func(0)	0.23395	0.19196	0.19243	0.20611	0.19761	3.9311
5	k1.py_func_c_common(0)	0.19566	0.18458	0.19062	0.19029	0.19767	3.9960
6	k1.py_func_common(0)	0.21981	0.18707	0.18984	0.19891	0.19510	1.4237
7	k2.p_func(0)	0.20448	0.18388	0.18194	0.19010
8	k2.py_func(0)	0.21798	0.18859	0.18437	0.19698
9	k2.py_func2(0)	0.20413	0.18124	0.18194	0.18910
10	k2.py_c_func(0)	0.23114	0.19166	0.19238	0.20506
11	k2.py_func_c_common(0)	0.19860	0.18783	0.18745	0.19129
12	k2.py_func_common(0)	0.21609	0.18747	0.18640	0.19666
	Average	0.21560	0.18681	0.18703	0.19648

Result :

The best is to write separate C function and a python accessor function.

task	function	penalty cost (ns)
C function + python accessor : base case	p_func
cpdef instead of def	py_func	1.7345
calling a cdef class method instead of a function pointer attribute	py_func1,py_func2	4.3456
one extra c function call	py_c_func	3.9311
(def + cdef) instead of (cpdef)	py_func_c_common-py_func_common	2.5723
One C comparison vs one C function call	py_func_common	1.4237

Conclusion :

As can be clearly seen that the results are clearly inconclusive :)
This was a small test carried on my laptop with no controlled environment. Also thought the results seemed close to repeatable, nevertheless many trials should be conduction and each value should have a standard deviation also to check the repeatability. However one clear conclusion is do not forget to add optimization flags. Setuptools already does that for you.
Also using a function pointer is not so bad after all. It would become more advantageous in case of more number of comparisons.
Cython provides great speedups (who didn't know that :) ). The pure python version of py_func_common took 0.408μs for dim=1 and 0.518μs for dim=2
These results are purely from python point of view. The effect of cdef/cpdef should also be considered in c/cython code which calls these functions.

CAVEAT:

I am no optimization expert. I have done this out of out of sheer boredom :)
If anyone wants to verify, you are welcome
Any information content is purely coincindental

novice blogger

Sunday, May 30, 2010

research made simple with zotero

Saturday, May 29, 2010

numpy array performance / divide and conquer considered harmful

Wednesday, May 26, 2010

cython timings test

The TASK : To optimize cython functions

Detailed: functions which depend on a once initialized attribute value

The test CODE : test_kernel.pyx

The TEST :

Timings :

Result :

Conclusion :

CAVEAT:

Translate

Facebook

Fedora

About Me

Places you may visit

Search This Blog

Blog Archive

Labels

pankaj's shared items

Followers

novice blogger

Sunday, May 30, 2010

research made simple with zotero

Saturday, May 29, 2010

numpy array performance / divide and conquer considered harmful

Wednesday, May 26, 2010

cython timings test

The TASK : To optimize cython functions

Detailed: functions which depend on a once initialized attribute value

The test CODE : test_kernel.pyx

The TEST :

Timings :

Result :

Conclusion :

CAVEAT:

Translate

Facebook

Fedora

About Me

Places you may visit

Search This Blog

Blog Archive

Labels

pankaj's shared items

Subscribe To

Followers