Best python questions in October 2010

Why is printing to stdout so slow? Can it be sped up?

26 votes

I've always been amazed/frustrated with how long it takes to simply output to the terminal with a print statement. After some recent painfully slow logging I decided to look into it and was quite surprised to find that almost all the time spent is waiting for the terminal to process the results.

Can writing to stdout be sped up somehow?

I wrote a script ('print_timer.py' at the bottom of this question) to compare timing when writing 100k lines to stdout, to file, and with stdout redirected to /dev/null. Here is the timing result:

$python print_timer.py
this is a test
this is a test
<snipped 99997 lines>
this is a test
-----
timing summary (100k lines each)
-----
print                         :11.950 s
write to file (+ fsync)       : 0.122 s
print with stdout = /dev/null : 0.050 s

Wow. To make sure python isn't doing something behind the scenes like recognizing that I reassigned stdout to /dev/null or something, I did the redirection outside the script...

$ python print_timer.py > /dev/null
-----
timing summary (100k lines each)
-----
print                         : 0.053 s
write to file (+fsync)        : 0.108 s
print with stdout = /dev/null : 0.045 s

So it isn't a python trick, it is just the terminal. I always knew dumping output to /dev/null sped things up, but never figured it was that significant!

It amazes me how slow the tty is. How can it be that writing to physical disk is WAY faster than writing to the "screen" (presumably an all-RAM op), and is effectively as fast as simply dumping to the garbage with /dev/null?

This link talks about how the terminal will block I/O so it can "parse [the input], update its frame buffer, communicate with the X server in order to scroll the window and so on"... but I don't fully get it. What can be taking so long?

I expect there is no way out (short of a faster tty implementation?) but figure I'd ask anyway.


UPDATE: after reading some comments I wondered how much impact my screen size actually has on the print time, and it does have some significance. The really slow numbers above are with my Gnome terminal blown up to 1920x1200. If I reduce it very small I get...

-----
timing summary (100k lines each)
-----
print                         : 2.920 s
write to file (+fsync)        : 0.121 s
print with stdout = /dev/null : 0.048 s

That is certainly better (~4x), but doesn't change my question. It only adds to my question as I don't understand why the terminal screen rendering should slow down an application writing to stdout. Why does my program need to wait for screen rendering to continue?

Are all terminal/tty apps not created equal? I have yet to experiment. It really seems to me like a terminal should be able to buffer all incoming data, parse/render it invisibly, and only render the most recent chunk that is visible in the current screen configuration at a sensible frame rate. So if I can write+fsync to disk in ~0.1 seconds, a terminal should be able to complete the same operation in something of that order (with maybe a few screen updates while it did it).

I'm still kind of hoping there is a tty setting that can be changed from the application side to make this behaviour better for programmer. If this is strictly a terminal application issue, then this maybe doesn't even belong on StackOverflow?

What am I missing?


Here is the python program used to generate the timing:

import time, sys, tty
import os

lineCount = 100000
line = "this is a test"
summary = ""

cmd = "print"
startTime_s = time.time()
for x in range(lineCount):
    print line
t = time.time() - startTime_s
summary += "%-30s:%6.3f s\n" % (cmd, t)

#Add a newline to match line outputs above...
line += "\n"

cmd = "write to file (+fsync)"
fp = file("out.txt", "w")
startTime_s = time.time()
for x in range(lineCount):
    fp.write(line)
os.fsync(fp.fileno())
t = time.time() - startTime_s
summary += "%-30s:%6.3f s\n" % (cmd, t)

cmd = "print with stdout = /dev/null"
sys.stdout = file(os.devnull, "w")
startTime_s = time.time()
for x in range(lineCount):
    fp.write(line)
t = time.time() - startTime_s
summary += "%-30s:%6.3f s\n" % (cmd, t)

print >> sys.stderr, "-----"
print >> sys.stderr, "timing summary (100k lines each)"
print >> sys.stderr, "-----"
print >> sys.stderr, summary

Thanks for all the comments! I've ended up answering it myself with your help. It feels dirty answering your own question, though.

Question 1: Why is printing to stdout slow?

Answer: Printing to stdout is not inherently slow. It is the terminal you work with that is slow. And it has pretty much zero to do with I/O buffering on the application side (eg: python file buffering). See below.

Question 2: Can it be sped up?

Answer: Yes it can, but seemingly not from the program side (the side doing the 'printing' to stdout). To speed it up, use a faster different terminal emulator.

Explanation...

I tried a self-described 'lightweight' terminal program called wterm and got significantly better results. Below is the output of my test script (at the bottom of the question) when running in wterm at 1920x1200 in on the same system where the basic print option took 12s using gnome-terminal:

-----
timing summary (100k lines each)
-----
print                         : 0.261 s
write to file (+fsync)        : 0.110 s
print with stdout = /dev/null : 0.050 s

0.26s is MUCH better than 12s! I don't know whether wterm is more intelligent about how it renders to screen along the lines of how I was suggesting (render the 'visible' tail at a reasonable frame rate), or whether it just "does less" than gnome-terminal. For the purposes of my question I've got the answer, though. gnome-terminal is slow.

So - If you have a long running script that you feel is slow and it spews massive amounts of text to stdout... try a different terminal and see if it is any better!

Note that I pretty much randomly pulled wterm from the ubuntu/debian repositories. This link might be the same terminal, but I'm not sure. I did not test any other terminal emulators.


Update: Because I had to scratch the itch, I tested a whole pile of other terminal emulators with the same script and full screen (1920x1200). My manually collected stats are here:

wterm           0.3s
aterm           0.3s
rxvt            0.3s
mrxvt           0.4s
konsole         0.6s
yakuake         0.7s
lxterminal        7s
xterm             9s
gnome-terminal   12s
xfce4-terminal   12s
vala-terminal    18s
xvt              48s

The recorded times are manually collected, but they were pretty consistent. I recorded the best(ish) value. YMMV, obviously.

As a bonus, it was an interesting tour of some of the various terminal emulators available out there! I'm amazed my first 'alternate' test turned out to be the best of the bunch.

Why do people say that Java is more scalable than python?

16 votes

I've seen this argument in a few places, and now, recently i saw it again on a reddit post. This is by no means a flame against any of these two languages. I am just puzzled why there is this bad reputation about python not being scalable.
I'm a python guy and now I'm getting started with Java and i just want to understand what makes Java so scalable and if the python setup that I have in mind is a good way to scale large python apps.

Now back to my idea of scaling a Python app. Let's say you code it using Django. Django runs its apps in fastcgi mode. So what if you have a front Nginx server and behind it as many other servers as needed that will each run your Django app in fastcgi mode. The front Nginx server will then load balance between your backend Djnago fastcgi running servers. Django also supports multiple databases so you could write to one master DB and then read from many slaves, again for load balancing. Throw a memcached server in to this mix and there you go you have scalability. Don't you?

Is this a viable setup? What does Java makes better? How do you scale a Java app?

Scalability is a very overloaded term these days. The comments probably refer to in-process vertical scalability.

Python has a global interpreter lock (GIL) that severely limits its ability to scale up to many threads. It releases it when calling native code (reacquiring it when the native returns), but this still requires careful design when trying to write scalable software in Python.

What does raise in Python raise?

15 votes

Consider the following code:

try:
    raise Exception("a")
except:
    try:
        raise Exception("b")
    finally:
        raise

This will raise Exception: a. I expected it to raise Exception: b (need I explain why?). Why does the final raise raise the original exception rather than (what I thought) was the last exception raised?

On python2.6

I guess, you are expecting the finally block to be tied with the "try" block where you raise the exception "B". The finally block is attached to the first "try" block.

If you added an except block in the inner try block, then the finally block will raise exception B.

try:
  raise Exception("a")
except:
  try:
    raise Exception("b")
  except:
    pass
  finally:
    raise

Output:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    raise Exception("b")
Exception: b

Another variation that explains whats happening here

try:
  raise Exception("a")
except:
  try:
    raise Exception("b")
  except:
    raise

Output:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    raise Exception("b")
Exception: b

If you see here, replacing the finally block with except does raise the exception B.

Python: Which encoding is used for processing sys.argv?

14 votes

What encoding are the elements of sys.argv in, in Python? are they encoded with the sys.getdefaultencoding() encoding?

sys.getdefaultencoding(): Return the name of the current default string encoding used by the Unicode implementation.

PS: As pointed out in some of the answers, sys.stdin.encoding would indeed be a better guess. I would love to see a definitive answer to this question, though, with pointers to solid sources!

PPS: As Wim pointed out, Python 3 solves this issue by putting str objects in sys.argv (if I understand correctly). The question remains open for Python 2.x, though. Under Unix, the LC_CTYPE environment variable seems to be the correct thing to check, no? What should be done with Windows (so that sys.argv elements are correctly interpreted whatever the console)?

"What should be done with Windows (so that sys.argv elements are correctly interpreted whatever the console)?"

For Python 2.x, see this comment on issue2128.

(Note that no encoding is correct for the original sys.argv, because some characters may have been mangled in ways that there is not enough information to undo; for example, if the ANSI codepage cannot represent Greek alpha then it will be mangled to 'a'.)

How to use virtualenv with Google App Engine SDK on Mac OS X 10.6

12 votes

I am pulling my hair out trying to figure this out because I had it working until last week and somehow it broke.

When I setup a virtualenv for a Google App Engine app and start the app with dev_appserver.py, I get errors importing the standard library (like "ImportError: No module named base64").

Here's what I'm doing:

(Using the system Python)

virtualenv --python=python2.5 --no-site-packages ~/.virtualenv/foobar

Then I add the a gae.pth file to ~/.virtualenv/foobar/lib/python2.5/site-packages/ containing the Google App Engine libraries:

/usr/local/google_appengine
/usr/local/google_appengine/lib/antlr3
/usr/local/google_appengine/lib/cacerts
/usr/local/google_appengine/lib/django
/usr/local/google_appengine/lib/fancy_urllib
/usr/local/google_appengine/lib/ipaddr
/usr/local/google_appengine/lib/webob
/usr/local/google_appengine/lib/yaml/lib

(That's based on this answer.)

Then I source my "foobar" virtualenv and try to start my app with dev_appserver.py.

The server starts but the first request errors out with the aforementioned "ImportError: No module named base64". If I visit the admin console I get "ImportError: No module named cgi".

If I start up python, I can load these modules.

>>> import base64
>>> base64.__file__
'/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/base64.py'

It seems that the SDK's sandboxing is preventing these libraries from getting loaded. But like I said, I had this working until last week...something changed or I inadvertently broke my virtualenv and I can't figure out how I got it working in the first place.

Software versions:

Google App Engine SDK 1.3.7
Mac OS X Snow Leopard 10.6.4
virtualenv 1.5.1

Update: In response to Alan Franzoni's questions:

I am using the system Python that came with Mac OS X. I installed virtualenv via easy_install. I upgraded to virtualenv 1.5.1 today to try to fix the problem.

If I run python /usr/local/bin/dev_appserver.py with the virtualenv python, the problem persists. If I deactivate the virtualenv and run that command with the system python2.5, it works. (Also, I can use the GoogleAppEngineLauncher to start my app.)

Here is a full stack trace (this one uses the Kay framework, but the problem is the same with webapp):

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3206, in _HandleRequest
    self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3149, in _Dispatch
    base_env_dict=env_dict)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 525, in Dispatch
    base_env_dict=base_env_dict)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2402, in Dispatch
    self._module_dict)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2312, in ExecuteCGI
    reset_modules = exec_script(handler_path, cgi_path, hook)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2208, in ExecuteOrImportScript
    exec module_code in script_module.__dict__
  File "/Users/look/myapp/kay/main.py", line 17, in <module>
    kay.setup()
  File "/Users/look/myapp/kay/__init__.py", line 122, in setup
    from google.appengine.ext import db
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 1287, in Decorate
    return func(self, *args, **kwargs)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 1937, in load_module
    return self.FindAndLoadModule(submodule, fullname, search_path)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 1287, in Decorate
    return func(self, *args, **kwargs)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 1839, in FindAndLoadModule
    description)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 1287, in Decorate
    return func(self, *args, **kwargs)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 1790, in LoadModuleRestricted
    description)
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/db/__init__.py", line 81, in <module>
    import base64
ImportError: No module named base64

It's an issue 4339 with the GAE SDK, it's confirmed and there are two slightly different patches available in the bug entry that make it work.

What happens is dev_appserver.py sets up a restricted python environment by disallowing access to any non-system-python modules and it does that by calculating the system python folder from the location of the os module. In a virtualenv instance the os.py gets symlinked into the virtualenv but gets compiled straight into virtualenv, and this is the path that dev_appserver uses, effectively blocking access to any module from the system python library that is not linked by virtualend, which is most of them. The solution is to "bless" both paths.

Python's list comprehension vs .NET LINQ

12 votes

The following simple LINQ code

string[] words = { "hello", "wonderful", "linq", "beautiful", "world" };

// Get only short words
var shortWords =
  from word in words
  where word.Length <= 5
  select word;

// Print each word out
shortWords.Dump();

can be translated into python using list comprehension as follows.

words = ["hello", "wonderful", "linq", "beautiful", "world"]
shortWords = [x for x in words if len(x) <=5]
print shortWords
  • Is LINQ just another idea to implement list comprehension?
  • What examples might be that LINQ can do but list comprehension can't do.

(Warning: Mammoth answer ahead. The part up to the first horizontal line makes a good tl;dr section, I suppose)

I'm not sure if I qualify as Python guru... but I have a solid grasp on iteration in Python, so let's try :)

First off: Afaik, LINQ queries are executed lazily - if that's the case, generator expressions are a closer Python concept (either way, list-, dict- and set comprehensions are conceptually just generator expressions fed to the list/dict/set constructor!).

Also, there is a conceptual difference: LINQ is for, as the name says, querying data structures. List-/dict-/set comprehensions are possible application of this (e.g. filtering and projecting the items of a list). So they are in fact less general (as we will see, many things built into LINQ are not built into them). Likewise, generator expressions are a way to formulate an one-time forward iterator in-place (I like to think of it as lambda for generator functions, only without an ugly, long keyword ;) ) and not a way to describe a complex query. They overlap, yes, but they are not identical. If you want all the power of LINQ in Python, you will have to write a fully-fledged generator. Or combine the numerous powerful generators built-in and in itertools.


Now, Python counterparts for the LINQ capabilities Jon Skeet named:

Projections: (x.foo for ...)

Filtering: (... if x.bar > 5)

  • Joins (x join y on x.foo equals y.bar)

The closest thing would be((x_item, next(y_item for y_item in y if x_item.foo == y_item.bar)) for x_item in x), I suppose.

Note that this will not iterate over the whole y for each x_item, it will only get the first match.

  • Group joins (x join y on x.foo equals y.bar into g)

This is harder. Python doesn't have anonymous types, though they are trivial to do yourself if you don't mind messing with __dict__:

class Anonymous(object):
    def __init__(self, **kwargs):
        self.__dict__ = kwargs

Then, we could do (Anonymous(x=x, y=y) for ...) to get a list of objects that are have x and y members with the respective values. The right thing is usually feeding the results to the constructor of an approriate class, say, XY.

  • Grouping (group x.foo by x.bar)

Now it gets hairy... there is no build-in way, afaik. But we can define it ourself if we need it:

from collections import defaultdict

def group_by(iterable, group_func):
    groups = defaultdict(list)
    for item in iterable:
        groups[group_func(item)].append(item)
    return groups

Example:

>>> from operator import attrgetter
>>> group_by((x.foo for x in ...), attrgetter('bar'))
defaultdict(<class 'list'>, {some_value_of_bar: [x.foo of all x where x.bar == some_value_of_bar], some_other_value_of_bar: [...], ...})

This requires whatever we group by to be hashable, though. It's possible to avoid this, and I'll make a stab if there is public demand. But for now, I'm being lazy :)

We can also just return an iterable of groups without the values we grouped by, by calling .values() on the result (of course we can feed that to list to get something we can index and iterate several times). But who knows if we won't need the group values...

  • Ordering (orderby x.foo ascending, y.bar descending)

Sorting needs special syntax? The build-in sorted works for iterables, too: sorted(x % 2 for x in range(10)) or sorted(x for x in xs, key=attrgetter('foo')). Sorted ascending by default, the keyword argument reverse gives descending order.

Alas, afaik sorting by multiple attributes is not that easy, especially when mixing ascending and descending. Hmm... topic for a recipe?

  • Intermediate variables (let tmp = x.foo)

No, not possible in comprehensions or generator expressions - they are, as the name says, supposed to be expressions (and usually only span one or two lines). It's perfectly possible in generator function, though:

(x * 2 for x in iterable)

rewritten as generator with intermediate variable:

def doubles(iterable):
    for x in iterable:
        times2 = x * 2
        yield times2

Flattening: (c for s in ("aa","bb") for c in s )


Note that although LINQ to Objects deals with delegates, other query providers (e.g. LINQ to SQL) can deal in expression trees which describe the query instead of just presenting executable delegates. This allows the query to be translated into SQL (or other query languages) - again, I don't know whether Python supports that sort of thing or not. It's a significant part of LINQ though.

Python definitely does no such thing. List expressions correspond one-to-one to accumulating a plain list in a (possibly nested) for-loop, generator expressions correspond one-to-one to a generator. Given the parser and ast module, it would be possible in theory to write a library for converting a comprehension into e.g. an SQL query. But nobody cares to.

Differences Between Python and C++ Constructors

12 votes

I've been learning more about Python recently, and as I was going through the excellent Dive into Python the author noted here that the __init__ method is not technically a constructor, even though it generally functions like one.

I have two questions:

  1. What are the differences between how C++ constructs an object and how Python "constructs" an object?

  2. What makes a constructor a constructor, and how does the __init__ method fail to meet this criteria?

The distinction that the author draws is that, as far as the Python language is concerned, you have a valid object of the specified type before you even enter __init__. Therefore it's not a "constructor", since in C++ and theoretically, a constructor turns an invalid, pre-constructed object into a "proper" completed object of the type.

Basically __new__ in Python is defined to return "the new object instance", whereas C++ new operators just return some memory, which is not yet an instance of any class.

However, __init__ in Python is probably where you first establish some important class invariants (what attributes it has, just for starters). So as far as the users of your class are concerned, it might as well be a constructor. It's just that the Python runtime doesn't care about any of those invariants. If you like, it has very low standards for what constitutes a constructed object.

I think the author makes a fair point, and it's certainly an interesting remark on the way that Python creates objects. It's quite a fine distinction, though and I doubt that calling __init__ a constructor will ever result in broken code.

Also, I note that the Python documentation refers to __init__ as a constructor (http://docs.python.org/release/2.5.2/ref/customization.html)

As a special constraint on constructors, no value may be returned

... so if there are any practical problems with thinking of __init__ as a constructor, then Python is in trouble!

The way that Python and C++ construct objects have some similarities. Both call a function with a relatively simple responsibility (__new__ for an object instance vs some version of operator new for raw memory), then both call a function which has the opportunity to do more work to initialize the object into a useful state (__init__ vs a constructor).

Practical differences include:

  • in C++, no-arg constructors for base classes are called automatically in the appropriate order if necessary, whereas for __init__ in Python, you have to explicitly init your base in your own __init__. Even in C++, you have to specify the base class constructor if it has arguments.

  • in C++, you have a whole mechanism for what happens when a constructor throws an exception, in terms of calling destructors for sub-objects that have already been constructed. In Python I think the runtime (at most) calls __del__.

Then there's also the difference that __new__ doesn't just allocate memory, it has to return an actual object instance. Then again, raw memory isn't really a concept that applies to Python code.

First programming language to be taught - C or Python?

11 votes

I know that there is a long debate regarding this matter. I also understand that this is strictly not a programming question. But I am asking here as this platform contains wide range of experts from different realms.

When we got admitted in a Computer Science and Engineering(CSE) course in university, we were first taught C. The course was actually structured programming language but we used C as the language. And on next semester we were taught C++ and Java as OOP. Recently I have heard that the department is going to introduce Python as the first language. I strongly oppose the idea for the following reasons:

  1. Python is a super high language. In the first course the students should become familiar with the basics of programming concepts like data type, pointer, by value or by reference etc. You can write lots of things in Python without understanding these in details.

  2. Python has a wide range of build in data structures and library. In first language students should become familiar with basic algorithms like sorting or searching. I know there is sorting library in C too, but that is not as widely used as Python's sorting methods.

  3. Python is OOP. How can you teach someone OOP when (s)he does not have the basic knowledge of structured programming. If Python is the first language, then they might not differ OOP with non-OOP concepts.

  4. Memory is crucial. If you allocate, then you need to release the memory. These concepts are not necessary with a language with garbage collector.

So what is your opinion? What do you prefer as the first teaching language?

Please don't start a flamewar or something similar. Whatever you suggests, please explain why you think so. And also please keep in mind that the course is for university level. It's not for kids and so trying to make things simple is not much helpful.

And also I know that Python is a great language. I am personally a fan of it. But the question is whether Python should be first teaching language instead of C.

Thanks in advance.

EDIT :

  1. When I asked this, I was not aware about programmers.stackexchange.com. It can be moved there if that is better.

  2. The question contains my opinion. That does not mean I don't wanna hear others. In fact that is exactly what I want. Please don't get me wrong. I am not designing the curriculum. So my opinion has no effect on it. The thing is I think this and this, and I want to hear what others think.

  3. I am well aware that this is not a question in that sense. My first para tells that.

Bottom-up learning is often considered the "better" way to learn. Start from first principles and make your way up to more advanced ideas. The problem with this approach is that much of what we humans learn in life doesn't follow that model at all. Children, in fact, are the fastest learners and they do so by pattern-matching, extrapolation, interpolation, etc., all of which would be thoroughly frowned upon by anyone promoting the classical bottom up system. And yet somehow they run rings around adults in learning, say, the language in a new country, not by reading the bottom-up text books faster than the grown-ups, but by talking to other kids.

I don't know whether Python is the best language to use, but I do believe that any language that gets people writing code and solving interesting problems quickly can't be too bad a choice.

Remove empty strings from a list of strings

10 votes

I want to remove all empty strings from a list of strings in python.

My idea looks like this:

while '' in str_list:
    str_list.remove('')

Is there any more pythonic way to do this?

I would use filter:

str_list = filter(None, str_list) # fastest
str_list = filter(bool, str_list) # fastest
str_list = filter(len, str_list)  # a bit of slower
str_list = filter(lambda item: item, str_list) # slower than list comprehension

Tests:

>>> timeit('filter(None, str_list)', 'str_list=["a"]*1000', number=100000)
2.4797441959381104
>>> timeit('filter(bool, str_list)', 'str_list=["a"]*1000', number=100000)
2.4788150787353516
>>> timeit('filter(len, str_list)', 'str_list=["a"]*1000', number=100000)
5.2126238346099854
>>> timeit('[x for x in str_list if x]', 'str_list=["a"]*1000', number=100000)
13.354584932327271
>>> timeit('filter(lambda item: item, str_list)', 'str_list=["a"]*1000', number=100000)
17.427681922912598

Converting a python numeric expression to LaTeX.

10 votes

I need to convert strings with valid python syntax such as:

'1+2**(x+y)'

and get the equivalent LaTeX:

$1+2^{x+y}$

I have tried sympy's latex function but it processes actual expression, rather than the string form of it:

>>> latex(1+2**(x+y))
'$1 + 2^{x + y}$'
>>> latex('1+2**(x+y)')
'$1+2**(x+y)$'

but to even do this, it requires x and y to be declared as type "symbols".

I want something more straight forward, preferably doable with the parser from the compiler module.

>>> compiler.parse('1+2**(x+y)')
Module(None, Stmt([Discard(Add((Const(1), Power((Const(2), Add((Name('x'), Name('y'))))))))]))

Last but not least, the why: I need to generate those latex snipptes so that I can show them in a webpage with mathjax.

Here's a rather long but still incomplete method that doesn't involve sympy in any way. It's enough to cover the example of (-b-sqrt(b**2-4*a*c))/(2*a) which gets translated to \frac{- b - \sqrt{b^{2} - 4 \; a \; c}}{2 \; a} and renders as

alt text

It basically creates the AST and walks it producing the latex math the corresponds to the AST nodes. What's there should give enough of an idea how to extend it in the places it's lacking.


import ast

class LatexVisitor(ast.NodeVisitor):

    def prec(self, n):
        return getattr(self, 'prec_'+n.__class__.__name__, getattr(self, 'generic_prec'))(n)

    def visit_Call(self, n):
        func = self.visit(n.func)
        args = ', '.join(map(self.visit, n.args))
        if func == 'sqrt':
            return '\sqrt{%s}' % args
        else:
            return r'\operatorname{%s}\left(%s\right)' % (func, args)

    def prec_Call(self, n):
        return 1000

    def visit_Name(self, n):
        return n.id

    def prec_Name(self, n):
        return 1000

    def visit_UnaryOp(self, n):
        if self.prec(n.op) > self.prec(n.operand):
            return r'%s \left(%s\right)' % (self.visit(n.op), self.visit(n.operand))
        else:
            return r'%s %s' % (self.visit(n.op), self.visit(n.operand))

    def prec_UnaryOp(self, n):
        return self.prec(n.op)

    def visit_BinOp(self, n):
        if self.prec(n.op) > self.prec(n.left):
            left = r'\left(%s\right)' % self.visit(n.left)
        else:
            left = self.visit(n.left)
        if self.prec(n.op) > self.prec(n.right):
            right = r'\left(%s\right)' % self.visit(n.right)
        else:
            right = self.visit(n.right)
        if isinstance(n.op, ast.Div):
            return r'\frac{%s}{%s}' % (self.visit(n.left), self.visit(n.right))
        elif isinstance(n.op, ast.FloorDiv):
            return r'\left\lfloor\frac{%s}{%s}\right\rfloor' % (self.visit(n.left), self.visit(n.right))
        elif isinstance(n.op, ast.Pow):
            return r'%s^{%s}' % (left, self.visit(n.right))
        else:
            return r'%s %s %s' % (left, self.visit(n.op), right)

    def prec_BinOp(self, n):
        return self.prec(n.op)

    def visit_Sub(self, n):
        return '-'

    def prec_Sub(self, n):
        return 300

    def visit_Add(self, n):
        return '+'

    def prec_Add(self, n):
        return 300

    def visit_Mult(self, n):
        return '\\;'

    def prec_Mult(self, n):
        return 400

    def visit_Mod(self, n):
        return '\\bmod'

    def prec_Mod(self, n):
        return 500

    def prec_Pow(self, n):
        return 700

    def prec_Div(self, n):
        return 400

    def prec_FloorDiv(self, n):
        return 400

    def visit_LShift(self, n):
        return '\\operatorname{shiftLeft}'

    def visit_RShift(self, n):
        return '\\operatorname{shiftRight}'

    def visit_BitOr(self, n):
        return '\\operatorname{or}'

    def visit_BitXor(self, n):
        return '\\operatorname{xor}'

    def visit_BitAnd(self, n):
        return '\\operatorname{and}'

    def visit_Invert(self, n):
        return '\\operatorname{invert}'

    def prec_Invert(self, n):
        return 800

    def visit_Not(self, n):
        return '\\neg'

    def prec_Not(self, n):
        return 800

    def visit_UAdd(self, n):
        return '+'

    def prec_UAdd(self, n):
        return 800

    def visit_USub(self, n):
        return '-'

    def prec_USub(self, n):
        return 800
    def visit_Num(self, n):
        return str(n.n)

    def prec_Num(self, n):
        return 1000

    def generic_visit(self, n):
        if isinstance(n, ast.AST):
            return r'' % (n.__class__.__name__, ', '.join(map(self.visit, [getattr(n, f) for f in n._fields])))
        else:
            return str(n)

    def generic_prec(self, n):
        return 0

def py2tex(expr):
    pt = ast.parse(expr)
    return LatexVisitor().visit(pt.body[0].value)

In Python, why do we need readlines() when we can iterate over the file handle itself?

10 votes

In Python, after

fh = open('file.txt')

one may do the following to iterate over lines:

for l in fh:
    pass

Then why do we have fh.readlines()?

I would imagine that it's from before files were iteratators and is maintained for backwards compatibility. Even for a one-liner, it's totally1 fairly redundant as list(fh) will do the same thing in a more intuitive way. That also gives you the freedom to do set(fh), tuple(fh), etc.

1 See gnibbler's answer.

How is CPython's set() implemented?

10 votes

I've seen people say that set objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

Every answer here was really enlightening, but I can only accept one, so I'll go with the closest answer to my original question. Thanks all for the info!

According to this thread:

Indeed, CPython's sets are implemented as something like dictionaries with dummy values (the keys being the members of the set), with some optimization(s) that exploit this lack of values

So basically a set uses a hashtable as it's underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation.

If you are so inclined, you can even browse the CPython source code for set which, according to Achim Domma is mostly a cut-and-paste from the dict implementation.

How to implement "autoincrement" on Google AppEngine

10 votes

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.

  1. A number MUST NOT BE used twice
  2. Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).

Fancy way of saying: I need to count 1,2,3,4 ... The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.

I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".

Can this be implemented on Google AppEngine (preferably in Python)?

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

Vim: Use shorter textwidth in comments and docstrings

9 votes

From the mighty PEP 8:

[P]lease limit all lines to a maximum of 79 characters. For flowing long blocks of text (docstrings or comments), limiting the length to 72 characters is recommended.

When editing Python code in Vim, I set my textwidth to 79, and Vim automatically wraps long lines of Python code for me when I hit the character limit.

But in comments and docstrings, I need to wrap text at 72 characters instead. Is there any way to make Vim automatically set textwidth to 72 when I'm in a comment or docstring, and set it back when I'm done?

So, I've never done any Vim scripting before, but based on this question about doing something similar in C and this tip for checking if you're currently in a comment, I've hacked together a solution.

By default, this uses the PEP8-suggested widths of 79 characters for normal lines and 72 characters for comments, but you can override them by letting g:python_normal_text_width or g:python_comment_text_width variables, respectively. (Personally, I wrap normal lines at 78 characters.)

Drop this baby in your .vimrc and you should be good to go. I may package this up as a plugin later.

function! GetPythonTextWidth()
    if !exists('g:python_normal_text_width')
        let normal_text_width = 79
    else
        let normal_text_width = g:python_normal_text_width
    endif

    if !exists('g:python_comment_text_width')
        let comment_text_width = 72
    else
        let comment_text_width = g:python_comment_text_width
    endif

    let cur_syntax = synIDattr(synIDtrans(synID(line("."), col("."), 0)), "name")
    if cur_syntax == "Comment"
        return comment_text_width
    elseif cur_syntax == "String"
        " Check to see if we're in a docstring
        let lnum = line(".")
        while lnum >= 1 && (synIDattr(synIDtrans(synID(lnum, col([lnum, "$"]) - 1, 0)), "name") == "String" || match(getline(lnum), '\v^\s*$') > -1)
            if match(getline(lnum), "\\('''\\|\"\"\"\\)") > -1
                " Assume that any longstring is a docstring
                return comment_text_width
            endif
            let lnum -= 1
        endwhile
    endif

    return normal_text_width
endfunction

augroup pep8
    au!
    autocmd CursorMoved,CursorMovedI * :if &ft == 'python' | :exe 'setlocal textwidth='.GetPythonTextWidth() | :endif
augroup END

Checking if an ISBN number is correct

9 votes

I'm given some ISBN numbers e.g. 3-528-03851 (not valid) , 3-528-16419-0 (valid). I'm supposed to write a program which tests if the ISBN number is valid.

Here' my code:

def check(isbn):
    check_digit = int(isbn[-1])
    match = re.search(r'(\d)-(\d{3})-(\d{5})', isbn[:-1])

    if match:
        digits = match.group(1) + match.group(2) + match.group(3)
        result = 0

        for i, digit in enumerate(digits):
          result += (i + 1) * int(digit)

        return True if (result % 11) == check_digit else False

    return False

I've used a regular expression to check a) if the format is valid and b) extract the digits in the ISBN string. While it seems to work, being a Python beginner I'm eager to know how I could improve my code. Suggestions?

First, try to avoid code like this:

if Action():
    lots of code
    return True
return False

Flip it around, so the bulk of code isn't nested. This gives us:

def check(isbn):
    check_digit = int(isbn[-1])
    match = re.search(r'(\d)-(\d{3})-(\d{5})', isbn[:-1])

    if not match:
        return False

    digits = match.group(1) + match.group(2) + match.group(3)
    result = 0

    for i, digit in enumerate(digits):
      result += (i + 1) * int(digit)

    return True if (result % 11) == check_digit else False

There are some bugs in the code:

  • If the check digit isn't an integer, this will raise ValueError instead of returning False: "0-123-12345-Q".
  • If the check digit is 10 ("X"), this will raise ValueError instead of returning True.
  • This assumes that the ISBN is always grouped as "1-123-12345-1". That's not the case; ISBNs are grouped arbitrarily. For example, the grouping "12-12345-12-1" is valid. See http://www.isbn.org/standards/home/isbn/international/html/usm4.htm.
  • This assumes the ISBN is grouped by hyphens. Spaces are also valid.
  • It doesn't check that there are no extra characters; '0-123-4567819' returns True, ignoring the extra 1 at the end.

So, let's simplify this. First, remove all spaces and hyphens, and make sure the regex matches the whole line by bracing it in '^...$'. That makes sure it rejects strings which are too long.

def check(isbn):
    isbn = isbn.replace("-", "").replace(" ", "");
    check_digit = int(isbn[-1])
    match = re.search(r'^(\d{9})$', isbn[:-1])
    if not match:
        return False

    digits = match.group(1)

    result = 0
    for i, digit in enumerate(digits):
      result += (i + 1) * int(digit)

    return True if (result % 11) == check_digit else False

Next, let's fix the "X" check digit problem. Match the check digit in the regex as well, so the entire string is validated by the regex, then convert the check digit correctly.

def check(isbn):
    isbn = isbn.replace("-", "").replace(" ", "").upper();
    match = re.search(r'^(\d{9})(\d|X)$', isbn)
    if not match:
        return False

    digits = match.group(1)
    check_digit = 10 if match.group(2) == 'X' else int(match.group(2))

    result = 0
    for i, digit in enumerate(digits):
      result += (i + 1) * int(digit)

    return True if (result % 11) == check_digit else False

Finally, using a generator expression and max is a more idiomatic way of doing the final calculation in Python, and the final conditional can be simplified.

def check(isbn):
    isbn = isbn.replace("-", "").replace(" ", "").upper();
    match = re.search(r'^(\d{9})(\d|X)$', isbn)
    if not match:
        return False

    digits = match.group(1)
    check_digit = 10 if match.group(2) == 'X' else int(match.group(2))

    result = sum((i + 1) * int(digit) for i, digit in enumerate(digits))
    return (result % 11) == check_digit

Developing with Django+Celery without running `celeryd`?

8 votes

In development, it's a bit of a hassle to run the celeryd as well as the Django development server. Is it possible to, for example, ask celery to run tasks synchronously during development? Or something similar?

Yes you can do this by setting CELERY_ALWAYS_EAGER = True in your settings.
http://ask.github.com/celery/configuration.html#celery-always-eager

How do you get the exact path to "My Documents"?

8 votes

In C++ it's not too hard to get the full pathname to the folder that the shell calls "My Documents" in Windows XP and Windows 7 and "Documents" in Vista; see http://stackoverflow.com/questions/2414828/get-path-to-my-documents

Is there a simple way to do this in Python?

You could use the ctypes module to get the "My Documents" directory:

import ctypes

dll = ctypes.windll.shell32
buf = ctypes.create_unicode_buffer(300)
dll.SHGetSpecialFolderPathW(None, buf, 0x0005, False)
print(buf.value)

Source: http://bugs.python.org/issue1763#msg62242

How to access a data structure from a currently running Python process on Linux?

7 votes

I have a long-running Python process that is generating more data than I planned for. My results are stored in a list that will be serialized (pickled) and written to disk when the program completes -- if it gets that far. But at this rate, it's more likely that the list will exhaust all 1+ GB free RAM and the process will crash, losing all my results in the process.

I plan to modify my script to write results to disk periodically, but I'd like to save the results of the currently-running process if possible. Is there some way I can grab an in-memory data structure from a running process and write it to disk?

I found code.interact(), but since I don't have this hook in my code already, it doesn't seem useful to me (http://stackoverflow.com/questions/1637198/method-to-peek-at-a-python-program-running-right-now).

I'm running Python 2.5 on Fedora 8. Any thoughts?

Thanks a lot.

Shahin

There is not much you can do for a running program. The only thing I can think of is to attach the gdb debugger, stop the process and examine the memory. Alternatively make sure that your system is set up to save core dumps then kill the process with kill --sigsegv <pid>. You should then be able to open the core dump with gdb and examine it at your leisure.

There are some gdb macros that will let you examine python data structures and execute python code from within gdb, but for these to work you need to have compiled python with debug symbols enabled and I doubt that is your case. Creating a core dump first then recompiling python with symbols will NOT work, since all the addresses will have changed from the values in the dump.

Here are some links for introspecting python from gdb:

http://wiki.python.org/moin/DebuggingWithGdb

http://chrismiles.livejournal.com/20226.html

or google for 'python gdb'

N.B. to set linux to create coredumps use the ulimit command.

ulimit -a will show you what the current limits are set to.

ulimit -c unlimited will enable core dumps of any size.

elegant way to match two wildcarded strings

7 votes

I'm OCRing some text from two different sources. They can each make mistakes in different places, where they won't recognize a letter/group of letters. If they don't recognize something, it's replaced with a ?. For example, if the word is Roflcopter, one source might return Ro?copter, while another, Roflcop?er. I'd like a function that returns whether two matches might be equivalent, allowing for multiple ?s. Example:

match("Ro?copter", "Roflcop?er") --> True
match("Ro?copter", "Roflcopter") --> True
match("Roflcopter", "Roflcop?er") --> True
match("Ro?co?er", "Roflcop?er") --> True

So far I can match one OCR with a perfect one by using regular expressions:

>>> def match(tn1, tn2):
    tn1re = tn1.replace("?", ".{0,4}")
    tn2re = tn2.replace("?", ".{0,4}")

    return bool(re.match(tn1re, tn2) or re.match(tn2re, tn1))

>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True

But this doesn't work when they both have ?s in different places:

>>> match("R??lcopter", "Roflcop?er")
False

Thanks to Hamish Grubijan for this idea. Every ? in my ocr'd names can be anywhere from 0 to 3 letters. What I do is expand each string to a list of possible expansions:

>>> list(expQuestions("?flcopt?"))
['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']

then I expand both and use his matching function, which I called matchats:

def matchOCR(l, r):
    for expl in expQuestions(l):
        for expr in expQuestions(r):
            if matchats(expl, expr):
                return True
    return False

Works as desired:

>>> matchOCR("Ro?co?er", "?flcopt?")
True
>>> matchOCR("Ro?co?er", "?flcopt?z")
False
>>> matchOCR("Ro?co?er", "?flc?pt?")
True
>>> matchOCR("Ro?co?e?", "?flc?pt?")
True


The matching function:

def matchats(l, r):
    """Match two strings with @ representing exactly 1 char"""
    if len(l) != len(r): return False
    for i, c1 in enumerate(l):
        c2 = r[i]
        if c1 == "@" or c2 == "@": continue
        if c1 != c2: return False
    return True

and the expanding function, where cartesian_product does just that:

def expQuestions(s):
    """For OCR w/ a questionmark in them, expand questions with
    @s for all possibilities"""
    numqs = s.count("?")

    blah = list(s)
    for expqs in cartesian_product([(0,1,2,3)]*numqs):
        newblah = blah[:]
        qi = 0
        for i,c in enumerate(newblah):
            if newblah[i] == '?':
                newblah[i] = '@'*expqs[qi]
                qi += 1
        yield "".join(newblah)

How to distinguish different types of NaN float in Python

6 votes

I'm writing Python 2.6 code that interfaces with NI TestStand 4.2 via COM in Windows. I want to make a "NAN" value for a variable, but if I pass it float('nan'), TestStand displays it as IND.

Apparently TestStand distinguishes between floating point "IND" and "NAN" values. According to TestStand help:

  • IND corresponds to Signaling NaN in Visual C++, while
  • NAN corresponds to QuietNaN

That implies that Python's float('nan') is effectively a Signaling NaN when passed through COM. However, from what I've read about Signaling NaN, it seems that Signaling NaN is a bit "exotic" and Quiet NaN is your "regular" NaN. So I have my doubts that Python would be passing a Signaling NaN through COM. How could I find out if a Python float('nan') is passed through COM as a Signaling NaN or Quiet NaN, or maybe Indeterminate?

Is there any way to make a Signaling NaN versus QuietNaN or Indeterminate in Python, when interfacing with other languages? (Using ctypes perhaps?) I assume this would be a platform-specific solution, and I'd accept that in this case.

Update: In the TestStand sequence editor, I tried making two variables, one set to NAN and the other set to IND. Then I saved it to a file. Then I opened the file and read each variable using Python. In both cases, Python reads them as a nan float.

I dug a bit for you, and I think you might be able to use the struct module in combination with the information on at Kevin's Summary Charts. They explain the exact bit patterns used for the various kinds of IEEE 754 floating point numbers.

The only thing you probably will have to be careful for, if I read the topics on this IND-eterminate value, is that that value tends to trigger some kind of floating point interrupt when assigned directly in C code, causing it to be turned into a plain NaN. Which in turn meant those people were advised to do this kind of thing in ASM rather than C since C abstracted that stuff away.. Since it is not my field, and that I am not sure to what extent this kind of value would mess with Python, I figured I'd mention it so you can at least keep an eye for any such weird behaviour. (See the accepted answer for this question).

>>> import struct

>>> struct.pack(">d", float('nan')).encode("hex_codec")
'fff8000000000000'

>>> import scipy
>>> struct.pack(">d", scipy.nan).encode("hex_codec")
'7ff8000000000000'

Referring to Kevin's Summary Charts, that shows that float('nan') is actually technically the Indeterminate value, while scipy.nan is a Quiet NaN.

Let's try making a Signaling NaN, and then verify it.

>>> try_signaling_nan = struct.unpack(">d", "\x7f\xf0\x00\x00\x00\x00\x00\x01")[0]
>>> struct.pack(">d", try_signaling_nan).encode("hex_codec")
'7ff8000000000001'

No, the Signaling NaN gets converted to a Quiet NaN.

Now let's try making a Quiet NaN directly, and then verify it.

>>> try_quiet_nan = struct.unpack(">d", "\x7f\xf8\x00\x00\x00\x00\x00\x00")[0]
>>> struct.pack(">d", try_quiet_nan).encode("hex_codec")
'7ff8000000000000'

So that's how to make a proper Quiet NaN using struct.unpack()--at least, on a Windows platform.