Saturday, November 9, 2013

Python3 Adventures

At my $dayjob, i work with python, but mostly python 2.7. As you might be aware, python 3 has been out in the wild since a very long time, and unfortunately many projects have not yet embraced it as much as python core devs would like. Our company has also been one of the many holdouts, partly because of the lack of compelling business reasons to move, not many advantages to moving (which is slowly changing with each new python 3.x minor release), and partly because there is much more higher priority work to be done.

However, thanks to some members of the community (burnpanck, who has python3 compatible (though not 100%) forks of parts of our core ETS stack, starting with traits, pyface and traitsui), this is slowly changing.
Based on his work, i have attempted to make chaco work on python3, and have largely been successful. The most significant changes were in the extension modules of enable, as expected, since the setup.py uses 2to3 to migrate python code. Enable's kiva renderer which uses AGG underneath, has significant amount of extension code. Luckily, most of it it written in swig which automatically takes care of python2 and python3 compatibility, though in one place i had to modify swig interface file for bytes<->str<->unicode madness. Chaco also has an extension module for contour plots which uses direct Python<->C api which i could make compatible with both 2 and 3 with some #ifdefs.

In short, migrating from 2 to 3 in not really difficult for most programs. 2to3 makes it even less effort, you just have to figure out the things which 2to3 cannot handle, which is easy if you have tests.

The proof:


Current status: Enable and Chaco seem to work very well under python3. All enable tests pass, though a few chaco tests fail. Also, the chaco toolbar overlay seems to segfault. All other examples in the Chaco demo run very well.
The story is not as so good for traits and traitsui though. The traits `implements` and `adapts` advisors do not work in python3. They are already deprecated in traits, so you must use the corresponding replacements from the new `traits.adaptation` package, even on python2. The same work is also required to be done for traitsui, which uses the old mechanism in quite a lot of place.

Python3 branches of ETS components:
Traitshttps://github.com/pankajp/traits/tree/python3 (derived from burnpanck)
Pyfacehttps://github.com/pankajp/pyface/tree/python3 (derived from burnpanck)
TraitsUIhttps://github.com/pankajp/traitsui/tree/python3 (derived from burnpanck)
Enablehttps://github.com/pankajp/enable/tree/python3
Chacohttps://github.com/pankajp/chaco/tree/python3

PS: If you are as excited about this as I am, please help the cause by using the python3 branch in your project (they also work with python 2.7), and submit issues and pull requests for both python2 and python2 compatibility. Thanks.

PPS: If you are using a newer gcc version (tested with gcc 4.8.2) and experience segfaults in enable/chaco image plots (enable's `integration_tests/kiva/agg/image_test_case.py` segfaults for you), then you need to disable all optimizations from the compiler (export CFLAGS='-O0'; export CXXFLAGS='-O0') while building enable because apparently gcc's optimizer eliminates some necessary code.
As an alternative, you can use the clang compiler (export CC='clang'; CXX='clang++') which doesn't exhibit this problem at any optimization level. I have yet to figure out whether it is a gcc bug of a bug in enable.

Sunday, October 20, 2013

cgroups: Maintain sanity of your development system while you screw it

I am sure many of us developers run into unresponsive machines while doing our usual stuff (developing/debugging) due to a rogue or buggy process consuming all the system's RAM, and swap. It is not usually very common, but happens fairly often for me. For example, if in an ipython shell you start allocating arrays in a loop, and add an extra zero to the size or simply store the arrays without it being garbage collected, it is easy to bog down the system if you are not careful.

For me, the most common instance of this has been using gdb. I usually install all the relevant debuginfo packages available in the Fedora repo to enhance my debugging experience. One downside of this is that gdb consumes a lot of memory. And coupled with python pretty-printing and `thread apply all bt full` command, gdb can go haywire, especially when the stack/heap is corrupted after the segfault. Previously, i would have no recourse but to hard reboot my machine, i once timed that my system to be unresponsive for more than half an hour.

The Cause:

Linux is generally good at time-slicing and not letting rogue processes DOS the system, except for the memory-swap part. The swap system in linux is still not very intelligent in swapping out portions of memory when under pressure. It seems it swaps out data and code of processes equally aggressively.
Here's what happens when you start a process which consumes a lot of memory (more than the total amount of RAM on your system) to force you to swap:
 First data of inactive processes is swapped out, you can see you machine's swap usage increasing but apparently it doesn't seem to affect the system responsiveness;
 next, executable code of processes is swapped out too, and now is when things go out of control.
When the latter happens, your machine is screwed, since your terminal process and bash and window manager and gnome-shell and X server have all likely swapped out their executable code to be able to respond to you soon, and there are only two ways to recover: 1) Hope the rogue process quickly dies, which is not very likely on my system with 8GB RAM and swap (it takes a looong time to allocate so much memory while your executable code is itself swapped out); 2) Hard reboot

Not any more, i present to you a mechanism to maintain sanity of the system and limit the amount of RAM you development processes can consume.

A Workaround:

Welcome to cgroups (Control Groups), a feature in linux kernel for managing resources of processes. You can read more about cgroups from the fedora guide at http://docs.fedoraproject.org/en-US/Fedora/17/html/Resource_Management_Guide/index.html .

Here, i will describe a simple way to restrict the cumulative memory (RAM) consumption of all processes started in a bash terminal so that essential system processes have sufficient RAM available to maintain system responsiveness.

The various tools we will use are provided by the libcgroup-tools package, so install it first using yum or the equivalent package on you distro.

Two important services i will use are cgconfig.service and cgred.service, but before enabling and starting them we will configure them. Their config are respectively located in /etc/cgconfig.conf and /etc/cgrules.conf

Here's what we will configure cgoups for:
1) Create a bash_memory group for memory subsystem and limit the total RAM consumption to slightly less than the total available RAM, which on my system is 8GB, so i have set the limit to 6GB. This is done via the /etc/cgconfig.conf file. Add the following content to it:

group bash_memory {
        memory {
                memory.soft_limit_in_bytes="5583457480";
                memory.limit_in_bytes="6442450944";
        }
}

The first line in the memory subsystem states that in case of memory pressure, when the system is actively thrashing, the memory of all processess in the bash_memory cgroup will be reclaimed (by discarding caches and swapping out dirty pages) to reduce it physical memory usage to 5.2 GB.
The second line states that the total physical memory consumption of all processes in bash_memory cgroup will never exceed 6GB, they will be swapped out instead.

You can use similar mechanisms (using different subsystem than memory) to limit various resources, like cpu usage, disk IO, network bandwidth etc.

2) Now all we need to do is add processes to the bash_memory cgroup. We will do this via the cgred service by adding a rule to the /etc/cgrules.conf file. Add the following line to the file:

*:bash          memory          bash_memory/

The first column says that the rule applies to bash process of all users, the second says that the memory controller is being set, and the third column says that the bash_memory group is applied for the memory controller. Now, the cgred.service will take care of automatically applying the bash_memory group to all bash processes whenever they are started. Due to the inheritance of cgroups, all subprocesses started by bash (and their subprocessess too) will belong to the bash_memory cgroup, thus limiting their cumulative RAM consumption.

You can add more lines to limit specific users or processess.

3) Now all we need to do is start the services and reap their benefit.

sudo systemctl enable cgconfig.service
sudo systemctl start cgconfig.service
sudo systemctl enable cgred.service
sudo systemctl start cgred.service

The linux init daemon systemd uses cgroups to control various services, and you can set specific limit on all services in a systemd based system. Also, the lxc project (lightweight linux containers) uses cgroups and the related namespace functionality to sandbox linux containers. This feature of linux is being used for great benefit by various projects.

Please comment if you find this useful or have some better suggestions :)

PS: It seems the gdb memory hog bug which started all this adventure has been fixed: https://bugzilla.redhat.com/show_bug.cgi?id=1013453

Sunday, August 25, 2013

Two Years @ Enthought: A Review

Last week, i completed two full years working for Enthought. It has been a great time working with some of the best minds and wonderful people i have known. Two years is a long time for being at same job i hear. None of my flatmates is at the same place where there were when i joined, and the same with two of my best friends. In fact some friends keep asking me where i am, as if its like changing clothes (hey, what color are you wearing today?). To come clean on this, let me clarify: i am very well loving my work here and do not see any change in the status quo foreseeable future. I'm well into my comfort zone.

I still vividly remember when we started off, right after finishing college, when Enthought India was still an idea. For few weeks, we used to work at the my graduate lab in IIT itself. I and Prabhu visited a bunch of places around in search of office space, and Eric has asked us to get a good one. We chose VBC after a few other visits, simply because it would let us start immediately in a pre-furnished office, and it was a good office no doubt. We started work there even before the agreement was officially signed, with Bill's visit. Puneeth also joined us physically on 5th Sept. I remembered him as the hard-core Free Software guy, who bought an expensive "Adorable GNU" in an auction by Richard Stallman when he visited IIT Bombay. For a few months, we worked on a contract with Enthought USA, until Enthought India was officially founded in December. Since then we have seen a significant growth in Enthought India, with more good company in the office.

In the last two years, we have hit several milestones (and probably missed more), and it has been a great journey of learning, doing and sharing. In summary:

The Good:

  • Working with some of the best people i have ever known, both intellectually as well as personally.
  • The lack of "Corporate Culture". I have only heard horror stories of "corporate culture" from friends, TV series and movies. Thought secretly, i wish i could experience it once (for maybe a week or so), just for the feeling and also that i could empathize with my friends.
  • Meaningful work which is used by many scientists and engineers across the world, that which we can be proud (and ashamed) of, and take responsibility for.
  • Culture of collaboration and working together, instead of delegating and taking credit.

The Bad:

  • Bugs: However you try, bugs are inevitable. They are the nightmare of every developer. The problem is not just the code you wrote, but also the libraries which you use. The former cause embarrassment, and the latter frustration.
  • Conflicting demands by users: Though you wish to support all users, it is simply not possible. You have to decide among conflicting ideas based on your insights, the work required for implementation, and sometimes who is paying more.
  • Time Management: There is always more to do in life than the time you have to do it, and prioritization is a hard problem :)

The Ugly:

  • Mumbai City: One of my friend from Delhi says he can identify Mumbai city by its omnipresent stink. Mumbai is hyper expensive for whatever it is that it offers (i can't think of getting a home in Mumbai without robbing a bank, or joining the government (i.e. robbing the people). Add to that the pollution of the city which reduces average life expectancy by a minimum 10 years, and the lack of any open spaces in the city (we have to commute 13 km to the beach to play ultimate, and that too is unplayable during the monsoons by the stink of the garbage and the injury risk). I still can't say i love Mumbai, despite staying here most of my life. What i'd prefer is staying by the woods, on a hill beside a lake or a river.
  • Once i had a funny dream about Prabhu taking a transfer from IIT Bombay to the newly set up IIT Gandhinagar (which is mentored by IIT Bombay) to set up Aerospace Department there, and we (Enthought India) moving the office to SEZ in GIFT, Gujarat for the tax sops and the generally better administration, infrastructure and quality of life :)

Tuesday, July 2, 2013

QPlainTextEdit with setMaximumBlockCount performance regression

Recently while working with ipython qtconsole, i realized that it is easy to freeze the qtconsole or any app embedding the it by simply doing a "for i in xrange(10000000):print i" loop, then all arguments about separate kernel process safety etc. are voided. Since this is not something i liked, i set about to fix it in true nature of open-source development (scratching my own back). In this post i'll describe some Qt issues, benchmarks and the approaches i took and how you too can deal with similar problem of overwhelming text input to your text buffer.
Note: The ipython pull request https://github.com/ipython/ipython/pull/3409 is part of this experiment.

The Problem:

The qtconsole widget is a QPlainTextEdit, with the maximumBlockCount set to 500 by default, which is a very small buffer size by any standard. However, despite this, in case too much text is written to the QPlainTextEdit, it takes too much time drawing the text and the application appears to be frozen This is because the Qt event loop is blocked by too much time consumed by drawing the QPlainTextEdit and incessant stream of new text to draw at a faster rate than QPlainTextEdit can render.

The Solution Approaches:

My first though at the problem was to use a timer to throttle the stream rendering rate, and append text to the area only every 100ms. That wouldn't cause any perceptible usability loss, but make the program more responsive. Also, another essential idea was to only send the maximumBlockCount number of lines of text ot the QPlainTextEdit. It seems that QPlainTextEdit is very bad at render performance if text is clipped by limiting maximumBlockCount, contrary to its single major use as a way to improve performance and memory usage.

An initial look into the ipython code made it clear that the test code was giving about 8000 lines per stream receive, which i was glad to clip to maximumBlockCount and coalesce multiple text streams into a single appendPlainText call every 100 ms. The qtconsole seemed very responsive, terminating the test print loop without any perceptible delay. All was well, i could go to sleep peacefully now, except for a small glitch which i realized soon enough. Due to a bug, my timer loop wasn't really doing anything. Every stream from ipython kernel was being written to the widget. How then was the widget so responsive, an attentive reader might ask. This post attempts to provide an answer to that very question.

The following are plots of time taken to append plain text to QPlainTextEdit using the code linked here. The x axis is the number of lines appended per call and the different lines are for different maximumBlockCount of the QPlainTextEdit
Appending to an already full buffer
 Clearing and then appending text
 Appending text to empty widget

In all the above cases, the final result is same because the number of lines appended is equal to maximumBlockCount so that all previous content is gone.
As you can see yourself, simply appending text to a full buffer is *very* expensive, so much so that it is almost an order of magnitude larger than clearing the text and then appending for ipython's default case of maxumumBlockCount = 500. All appends are fast until any line overflows the maximumBockCount, when onwards it becomes very expensive to append any more content. I intend to modify the ipython pull request in view of this seemingly bizzare result and attempt to improve the performance further. Hopefully, this would obliterate the need to have a timer based stream output loop and the related complexity. Ideally, someone should fix this right at Qt level, but i do not yet feel confident to do it. Until that happens, this simple workaround should be good enough.


PS: Comments, feedback and further insights welcome