Best python questions in May 2012

str performance in python

68 votes

While profiling a piece of python code (python 2.6 up to 3.2), I discovered that the str method to convert an object (in my case an integer) to a string is almost an order of magnitude slower than using string formatting.

Here is the benchmark

>>> from timeit import Timer
>>> Timer('str(100000)').timeit()
0.3145311339386332
>>> Timer('"%s"%100000').timeit()
0.03803517023435887

Does anyone know why this is the case? Am I missing something?

'%s' % 100000 is evaluated by the compiler and is equivalent to a constant at run-time.

>>> import dis
>>> dis.dis(lambda: str(100000))
  8           0 LOAD_GLOBAL              0 (str)
              3 LOAD_CONST               1 (100000)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE        
>>> dis.dis(lambda: '%s' % 100000)
  9           0 LOAD_CONST               3 ('100000')
              3 RETURN_VALUE        

% with a run-time expression is not (significantly) faster than str:

>>> Timer('str(x)', 'x=100').timeit()
0.25641703605651855
>>> Timer('"%s" % x', 'x=100').timeit()
0.2169809341430664

Do note that str is still slightly slower, as @DietrichEpp said, this is because str involves lookup and function call operations, while % compiles to a single immediate bytecode:

>>> dis.dis(lambda x: str(x))
  9           0 LOAD_GLOBAL              0 (str)
              3 LOAD_FAST                0 (x)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE        
>>> dis.dis(lambda x: '%s' % x)
 10           0 LOAD_CONST               1 ('%s')
              3 LOAD_FAST                0 (x)
              6 BINARY_MODULO       
              7 RETURN_VALUE        

Of course the above is true for the system I tested on (CPython 2.7); other implementations may differ.

why is xrange able to go back to beginning in Python?

22 votes

I've encountered this code from Most pythonic way of counting matching elements in something iterable

r = xrange(1, 10)
print sum(1 for v in r if v % 2 == 0) # 4
print sum(1 for v in r if v % 3 == 0) # 3

r is iterated once. and then it's iterated again. I thought if an iterator is once consumed then it's over and it should not be iterated again.

Generator expressions can be iterated only once:

r = (7 * i for i in xrange(1, 10))
print sum(1 for v in r if v % 2 == 0) # 4
print sum(1 for v in r if v % 3 == 0) # 0

enumerate(L) too:

r = enumerate(mylist)

and file object too:

f = open(myfilename, 'r')

Why does xrange behave differently?

Because the xrange object produced by calling xrange() specifies an __iter__ that provides a unique version of itself (actually, a separate rangeiterator object) each time it's iterated.

>>> x = xrange(3)
>>> type(x)
<type 'xrange'>
>>> i = x.__iter__()
>>> type(i)
<type 'rangeiterator'>

Why Numpy treats a+=b and a=a+b differently

19 votes

Is the following numpy behavior intentional or is it a bug?

from numpy import *

a = arange(5)
a = a+2.3
print 'a = ', a
# Output: a = 2.3, 3.3, 4.3, 5.3, 6.3 

a = arange(5)
a += 2.3
print 'a = ', a
# Output: a = 2, 3, 4, 5, 6

Python version: 2.7.2, Numpy version: 1.6.1

That's intentional.

The += operator preserves the type of the array. In other words, an array of integers remains an array of integers.

This enables NumPy to perform the += operation using existing array storage. On the other hand, a=a+b creates a brand new array for the sum, and rebinds a to point to this new array; this increases the amount of storage used for the operation.

To quote the documentation:

Warning: In place operations will perform the calculation using the precision decided by the data type of the two operands, but will silently downcast the result (if necessary) so it can fit back into the array. Therefore, for mixed precision calculations, A {op}= B can be different than A = A {op} B. For example, suppose a = ones((3,3)). Then, a += 3j is different than a = a + 3j: while they both perform the same computation, a += 3 casts the result to fit back in a, whereas a = a + 3j re-binds the name a to the result.

Finally, if you're wondering why a was an integer array in the first place, consider the following:

In [3]: np.arange(5).dtype
Out[3]: dtype('int64')

In [4]: np.arange(5.0).dtype
Out[4]: dtype('float64')

Why does "[] == False" evaluate to False when "if not []" succeeds?

18 votes

I'm asking this because I know that the pythonic way to check whether a list is empty or not is the following:

my_list = []
if not my_list:
    print "computer says no"
else:
    # my_list isn't empty
    print "computer says yes"

will print computer says no, etc. So this leads me to identify [] with False truth-values; however, if I try to compare [] and False "directly", I obtain the following:

>>> my_list == False
False
>>> my_list is False
False
>>> [] == False
False

etc...

What's going on here? I feel like I'm missing something really obvious.

The if statement evaluates everything in a Boolean context, it is like there is an implicit call to the bool() built-in function.

Here is how you would actually check how things will be evaluated by an if statement:

>>> bool([])
False
>>> bool([]) == False
True

See the documentation on Truth Value Testing, empty lists are considered false, but this doesn't mean they are equivalent to False.

PEP 285 also has some excellent information on why it was implemented this way, see the very last bullet in the Resolved Issues section for the part that deals with x == True and x == False specifically.

The most convincing aspect to me is that == is generally transitive, so a == b and b == c implies a == c. So if it were the way you expected and [] == False were true and '' == False were true, one might assume that [] == '' should be true (even though it obviously should not be).

Which database model should I use for dynamic modification of entities/properties during runtime?

18 votes

I am thinking about creating an open source data management web application for various types of data.

A privileged user must be able to

  • add new entity types (for example a 'user' or a 'family')
  • add new properties to entity types (for example 'gender' to 'user')
  • remove/modify entities and properties

These will be common tasks for the privileged user. He will do this through the web interface of the application. In the end, all data must be searchable and sortable by all types of users of the application. Two questions trouble me:

a) How should the data be stored in the database? Should I dynamically add/remove database tables and/or columns during runtime?

I am no database expert. I am stuck with the imagination that in terms of relational databases, the application has to be able to dynamically add/remove tables (entities) and/or columns (properties) at runtime. And I don't like this idea. Likewise, I am thinking if such dynamic data should be handled in a NoSQL database.

Anyway, I believe that this kind of problem has an intelligent canonical solution, which I just did not find and think of so far. What is the best approach for this kind of dynamic data management?

b) How to implement this in Python using an ORM or NoSQL?

If you recommend using a relational database model, then I would like to use SQLAlchemy. However, I don't see how to dynamically create tables/columns with an ORM at runtime. This is one of the reasons why I hope that there is a much better approach than creating tables and columns during runtime. Is the recommended database model efficiently implementable with SQLAlchemy?

If you recommend using a NoSQL database, which one? I like using Redis -- can you imagine an efficient implementation based on Redis?

Thanks for your suggestions!

Edit in response to some comments:

The idea is that all instances ("rows") of a certain entity ("table") share the same set of properties/attributes ("columns"). However, it will be perfectly valid if certain instances have an empty value for certain properties/attributes.

Basically, users will search the data through a simple form on a website. They query for e.g. all instances of an entity E with property P having a value V higher than T. The result can be sorted by the value of any property.

The datasets won't become too large. Hence, I think even the stupidest approach would still lead to a working system. However, I am an enthusiast and I'd like to apply modern and appropriate technology as well as I'd like to be aware of theoretical bottlenecks. I want to use this project in order to gather experience in designing a "Pythonic", state-of-the-art, scalable, and reliable web application.

I see that the first comments tend to recommending a NoSQL approach. Although I really like Redis, it looks like it would be stupid not to take advantage of the Document/Collection model of Mongo/Couch. I've been looking into mongodb and mongoengine for Python. By doing so, do I take steps into the right direction?

Edit 2 in response to some answers/comments:

From most of your answers, I conclude that the dynamic creation/deletion of tables and columns in the relational picture is not the way to go. This already is valuable information. Also, one opinion is that the whole idea of the dynamic modification of entities and properties could be bad design.

As exactly this dynamic nature should be the main purpose/feature of the application, I don't give up on this. From the theoretical point of view, I accept that performing operations on a dynamic data model must necessarily be slower than performing operations on a static data model. This is totally fine.

Expressed in an abstract way, the application needs to manage

  1. the data layout, i.e. a "dynamic list" of valid entity types and a "dynamic list" of properties for each valid entity type
  2. the data itself

I am looking for an intelligent and efficient way to implement this. From your answers, it looks like NoSQL is the way to go here, which is another important conclusion.

So, if you conceptualize your entities as "documents," then this whole problem maps onto a no-sql solution pretty well. As commented, you'll need to have some kind of model layer that sits on top of your document store and performs tasks like validation, and perhaps enforces (or encourages) some kind of schema, because there's no implicit backend requirement that entities in the same collection (parallel to table) share schema.

Allowing privileged users to change your schema concept (as opposed to just adding fields to individual documents - that's easy to support) will pose a little bit of a challenge - you'll have to handle migrating the existing data to match the new schema automatically.

Reading your edits, Mongo supports the kind of searching/ordering you're looking for, and will give you the support for "empty cells" (documents lacking a particular key) that you need.

If I were you (and I happen to be working on a similar, but simpler, product at the moment), I'd stick with Mongo and look into a lightweight web framework like Flask to provide the front-end. You'll be on your own to provide the model, but you won't be fighting against a framework's implicit modeling choices.

Python dictionary that defaults to key?

17 votes

Is there a way to get a defaultdict to return the key by default? Or some data structure with equivalent behavior? I.e., after initializing dictionary d,

>>> d['a'] = 1
>>> d['a']
1
>>> d['b']
'b'
>>> d['c']
'c'

I've only seen default dictionaries take functions that don't take parameters, so I'm not sure if there's a solution other than creating a new kind of dictionary .

I'd override the __missing__ method of dict:

>>> class MyDefaultDict(dict):
...     def __missing__(self, key):
...         self[key] = key
...         return key
...
>>> d = MyDefaultDict()
>>> d['joe']
'joe'
>>> d
{'joe': 'joe'}

How should I format a long url in a python comment and still be PEP8 compliant

15 votes

In a block comment, I want to reference a URL that is over 80 characters long.

What is the preferred convention for displaying this URL?

I know bit.ly is an option, but the URL itself is descriptive. Shortening it and then having a nested comment describing the shortened URL seems like a crappy solution.

Don't break the url:

# A Foolish Consistency is the Hobgoblin of Little Minds [1]
# [1]: http://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds

Python 3.x rounding behavior

15 votes

I was just re-reading What’s New In Python 3.0 and it states:

The round() function rounding strategy and return type have changed. Exact halfway cases are now rounded to the nearest even result instead of away from zero. (For example, round(2.5) now returns 2 rather than 3.)

and the documentation for round:

For the built-in types supporting round(), values are rounded to the closest multiple of 10 to the power minus n; if two multiples are equally close, rounding is done toward the even choice

So, under v2.7.3:

In [85]: round(2.5)
Out[85]: 3.0

In [86]: round(3.5)
Out[86]: 4.0

as I'd have expected. However, now under v3.2.3:

In [32]: round(2.5)
Out[32]: 2

In [33]: round(3.5)
Out[33]: 4

This seems counter-intuitive and contrary to what I understand about rounding (and bound to trip up people). English isn't my native language but until I read this I thought I knew what rounding meant :-/ I am sure at the time v3 was introduced there must have been some discussion of this, but I was unable to find a good reason in my search.

  1. Does anyone have insight into why this was changed to this?
  2. Are there any other mainstream programming languages (e.g., C, C++, Java, Perl, ..) that do this sort of (to me inconsistent) rounding?

What am I missing here?

UPDATE: @Li-aungYip's comment re "Banker's rounding" gave me the right search term/keywords to search for and I found this SO question: Why does .NET use banker's rounding as default?, so I will be reading that carefully.

Python 3.0's way is considered the standard rounding method these days, though some language implementations aren't on the bus yet.

The simple "always round 0.5 up" technique results in a slight bias toward the higher number. With large numbers of calculations, this can be significant. The Python 3.0 approach eliminates this issue.

Your puzzlement may derive from a misconception that there is only one method of rounding. IEEE 754, the international standard for floating-point math, defines five different rounding methods (the one used by Python 3.0 is the default). And there are others.

You're not alone, however; this behavior is not as widely known as it ought to be. AppleScript was, if I remember correctly, an early adopter of this rounding method. The round command in AppleScript actually does offer several options, but round-toward-even is the default as it is in IEEE 754. Apparently the engineer who implemented the round command got so fed up with all the requests to "make it work like I learned in school" that he implemented just that: round 2.5 rounding as taught in school is a valid AppleScript command. :-)

Count all elements in list of arbitrary nested list without recursion

14 votes

I have just learned about recursion in Python and have completed assignments, one of which was to count all the elements within a list of arbitrarily nested lists. I have searched this site and the answers found all seem to use recursive calls. Since it has been taught that anything which could be expressed recursively could be expressed iteratively, and iteration is preferred in Python, how would this be accomplished without recursion or imported modules in Python 2.6 (as a learning exercise)? (A nested list itself would be counted as an element, as would its contents.) For example:

>>> def element_count(p):
...     count = 0
...     for entry in p:
...         count += 1
...         if isinstance(entry, list):            
...             count += element_count(entry)
...     return count
>>> print element_count([1, [], 3]) 
3 
>>> print element_count([1, [1, 2, [3, 4]]])
7
>>> print element_count([[[[[[[[1, 2, 3]]]]]]]])
10

How would this be written using iteration?

Here is one way to do it:

def element_count(p):
  q = p[:]
  count = 0
  while q:
    entry = q.pop()
    if isinstance(entry, list):
      q += entry
    count += 1
  return count

print element_count([1, [], 3]) 
print element_count([1, [1, 2, [3, 4]]])
print element_count([[[[[[[[1, 2, 3]]]]]]]])

The code maintains a queue of things to be looked at. Whenever the loop encounters a sub-list, it adds its contents to the queue.

Python: Why does the int class not have rich comparison operators like `__lt__()`?

14 votes

Mostly curious.

I've noticed (at least in py 2.6 and 2.7) that a float has all the familiar rich comparison functions: __lt__(), __gt__, __eq__, etc.

>>> (5.0).__gt__(4.5)
True

but an int does not

>>> (5).__gt__(4)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'int' object has no attribute '__gt__'

Which is odd to me, because the operator itself works fine

>>> 5 > 4
True

Even strings support the comparison functions

>>> "hat".__gt__("ace")
True

but all the int has is __cmp__()

Seems strange to me, and so I was wondering why this came to be.

Just tested and it works as expected in python 3, so I am assuming some legacy reasons. Still would like to hear a proper explanation though ;)

If we look at the PEP 207 for Rich Comparisions there is this interesting sentence right at the end:

The inlining already present which deals with integer comparisons would still apply, resulting in no performance cost for the most common cases.

So it seems that in 2.x there is an optimisation for integer comparison. If we take a look at the source code we can find this:

case COMPARE_OP:
    w = POP();
    v = TOP();
    if (PyInt_CheckExact(w) && PyInt_CheckExact(v)) {
        /* INLINE: cmp(int, int) */
        register long a, b;
        register int res;
        a = PyInt_AS_LONG(v);
        b = PyInt_AS_LONG(w);
        switch (oparg) {
        case PyCmp_LT: res = a <  b; break;
        case PyCmp_LE: res = a <= b; break;
        case PyCmp_EQ: res = a == b; break;
        case PyCmp_NE: res = a != b; break;
        case PyCmp_GT: res = a >  b; break;
        case PyCmp_GE: res = a >= b; break;
        case PyCmp_IS: res = v == w; break;
        case PyCmp_IS_NOT: res = v != w; break;
        default: goto slow_compare;
        }
        x = res ? Py_True : Py_False;
        Py_INCREF(x);
    }
    else {
      slow_compare:
        x = cmp_outcome(oparg, v, w);
    }

So it seems that in 2.x there was an existing performance optimisation - by allowing the C code to compare integers directly - which would not have been preserved if the rich comparison operators had been implemented.

Now in Python 3 __cmp__ is no longer supported so the rich comparison operators must there. Now this does not cause a performance hit as far as I can tell. For example, compare:

Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit("2 < 1")
0.06980299949645996

to:

Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit("2 < 1")
0.06682920455932617

So it seems that similar optimisations are there but my guess is the judgement call was that putting them all in the 2.x branch would have been too great a change when backwards compatibility was a consideration.

In 2.x if you want something like the rich comparison methods you can get at them via the operator module:

>>> import operator
>>> operator.gt(2,1)
True

Why does using None fix Python's mutable default argument issue?

13 votes

I'm at the point in learning Python where I'm dealing with the Mutable Default Argument problem.

def bad_append(new_item, a_list=[]):
    a_list.append(new_item)
    return a_list

def good_append(new_item, a_list=None):
    if a_list is None:
        a_list = []
    a_list.append(new_item)
    return a_list

I understand that a_list is initialized only when the def statement is first encountered, and that's why subsequent calls of bad_append use the same list object.

What I don't understand is why good_append works any different. It looks like a_list would still be initialized only once; therefore, the if statement would only be true on the first invocation of the function, meaning a_list would only get reset to [] on the first invocation, meaning it would still accumulate all past new_item values and still be buggy.

Why isn't it? What concept am I missing? How does a_list get wiped clean every time good_append runs?

The default value of a_list (or any other default value, for that matter) is stored in the function's interiors once it has been initialized and thus can be modified in any way:

>>> def f(x=[]): return x
...
>>> f.func_defaults
([],)
>>> f.func_defaults[0] is f()

So the value in func_defaults is the same which is as well known inside function (and returned in my example in order to access it from outside.

IOW, what happens when calling f() is an implicit x = f.func_defaults[0]. If that object is modified subsequently, you'll keep that modification.

In contrast, an assignment inside the function gets always a new []. Any modification will last until the last reference to that [] has gone; on the next function call, a new [] is created.

IOW again, it is not true that [] gets the same object on every execution, but it is (in the case of default argument) only executed once and then preserved.

Python class that extends int doesn't entirely behave like an int

13 votes

I'm seeing some weird behavior when trying to convert a string to a class I wrote that extends int. Here's a simple program that demonstrates my problem:

class MyInt(int):
    pass

toInt = '123456789123456789123456789'

print "\nConverting to int..."
print type(int(toInt))

print "\nConverting to MyInt..."
print type(MyInt(toInt))

Since MyInt is empty, I expected that it would behave exactly like an int. Instead, here's the output I got from the program above:

Converting to int...
<type 'long'>

Converting to MyInt...
Traceback (most recent call last):
  File "int.py", line 9, in <module>
    print type(MyInt(toInt))
OverflowError: long int too large to convert to int

The string can't convert to a MyInt! What about the way I wrote MyInt causes it to behave differently than its base class? In this case, there seems to be some kind of maximum on MyInt; are there other properties that get implicitly imposed like this when a built-in class is extended in Python? And, finally, is there a way to change MyInt so that it doesn't have this maximum anymore?

The secret is all in the __new__() method:

>>> class MyInt(int): pass
>>> MyInt.__new__ == int.__new__
True
>>> MyInt.__new__(MyInt, '123456789101234567890')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long
>>> MyInt.__new__(int, '123456789101234567890')
123456789101234567890L

Basically when you instantiate a class the very first thing that happens (before __init__(self, *args)) is that __new__(cls, *args) is called. It is passed the class object as it first argument. The __new__ method for int (which is inherited by MyInt) only performs the conversion to long if the class it is passed is int. I assume this is to avoid messing up subclasses, since converted MyInt to long would remove all the special functionality you added.

You should use long as your base class if you want integers bigger than int can handle.

Can I iterate over a class in Python?

13 votes

I have a class that keeps track of its instances in a class variable, something like this:

class Foo:
    by_id = {}

    def __init__(self, id):
        self.id = id
        self.by_id[id] = self

What I'd like to be able to do is iterate over the existing instances of the class. I can do this with:

for foo in Foo.by_id.values():
    foo.do_something()

but it would look neater like this:

for foo in Foo:
    foo.do_something()

is this possible? I tried defining a classmethod __iter__, but that didn't work.

If you want to iterate over the class, you have to define a metaclass which supports iteration.

x.py:

class it(type):
    def __iter__(self):
        # Wanna iterate over a class? Then ask that class for iterator.
        return self.classiter()

class Foo:
    __metaclass__ = it # We need that meta class...
    by_id = {} # Store the stuff here...

    def __init__(self, id): # new isntance of class
        self.id = id # do we need that?
        self.by_id[id] = self # register istance

    @classmethod
    def classiter(cls): # iterate over class by giving all instances which have been instantiated
        return iter(cls.by_id.values())

if __name__ == '__main__':
    a = Foo(123)
    print list(Foo)
    del a
    print list(Foo)

As you can see in the end, deleting an instance will not have any effect on the object itself, because it stays in the by_id dict. You can cope with that using weakrefs when you

import weakref

and then do

by_id = weakref.WeakValueDictionary()

. This way the values will only kept as long as there is a "strong" reference keeping it, such as a in this case. After del a, there are only weak references pointing to the object, so they can be gc'ed.

Due to the warning concerning WeakValueDictionary()s, I suggest to use the following:

[...]
    self.by_id[id] = weakref.ref(self)
[...]
@classmethod
def classiter(cls):
    # return all class instances which are still alive according to their weakref pointing to them
    return (i for i in (i() for i in cls.by_id.values()) if i is not None)

Looks a bit complicated, but makes sure that you get the objects and not a weakref object.

Can Python generate a random number that excludes a set of numbers, without using recursion?

11 votes

I looked over docs.python.org, and I may have misunderstood, but I didn't see that there was a way to do this without calling a recursive function. What I'd like to do is generate a random value which excludes values in the middle.

In other words, let's imagine I wanted X to be a random number that's not in range(a - b, a + b). Can I do this on the first pass, or do I have to constantly generate a number, check if in range(), wash rinse?

As for why I don't wish to write a recursive function, (a) it 'feels like' I should not have to (b) the set of numbers I'm doing this for could actually end up being quite large, and I hear stack overflows are bad, and I might just be being overly cautious in doing this but like I said (c) see a. I'm sure that there's a nice, pythonic, non-recursive way to do it.

Use random.choice(). In this example, a is your lower bound, the range between b and c is skipped and d is your upper bound.

import random
numbers = range(a,b) + range(c,d)
r = random.choice(numbers)

Converting a string representation of a list into an actual list object

11 votes

I have a string that looks identical to a list, let's say:

fruits = "['apple', 'orange', 'banana']"

What would be the way to convert that to a list object?

>>> fruits = "['apple', 'orange', 'banana']"
>>> import ast
>>> fruits = ast.literal_eval(fruits)
>>> fruits
['apple', 'orange', 'banana']
>>> fruits[1]
'orange'

As pointed out in the comments ast.literal_eval is safe. From the docs:

Safely evaluate an expression node or a string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None.

This can be used for safely evaluating strings containing Python expressions from untrusted sources without the need to parse the values oneself.

Django-Pinax : How do you use a pinax app apart from what you get with a pinax base project?

9 votes

I am trying to understand Pinax and plan to use it in my next project.

I have started with a pinax basic project, and now I have something to go with runserver.

Now, I understand that I can customize the initial setup that I got from pinax and customize the profiles, themes, etc as per my requirements.

But is that all that pinax provides ?

I am very confused here, like I want to use the pinax phileo app in my project, so how does pinax helps me do that ?

My Effort :

  • I searched and found that I have to install it with pip install phileo
  • Then, add it to INSTALLED_APPS and use it as required.

But what did pinax do in this ?

Pinax has phileo featured on its website, but why ? Since I could have used it just like any other app on my non-pinax django project.

So, my question in a nutshell is :

What does pinax provide after a base project and default templates that come with pinax ?

Right, now it feels like pinax just provides a base project with some apps already working with some default templates. [ That's it ? ]

Then, what about other apps featured on pinax's website that do not come with base projects ?

Please, help clear up the confusion !

Update My question is somewhat - What is the significance of pinax-ecosystem when we already have them listed somewhere like djangopackages.com ?

Pinax is just django with a blend of other django plugins. You have to enable them and set them up individually. To use each individual app within pinax, you have to read that specific app's documentation and set it up appropriately (list of apps and repos which likely contain documentation here: http://pinaxproject.com/ecosystem/)

Some people like pinax but I find that its more of a hassel than a solution. In the end pinax doesn't work out of the box. You have to customize everything, but at the same time you position yourself into using a bundle you dont need. I suggest instead starting a project and installing the packages you need individually, and even finding more here: http://djangopackages.com/. Especially, if its a big project because then if you bundle/setup everything on your own you will know the ins and outs of it all.

How do I make Python's negative lookbehind less greedy?

7 votes

I've read all related posts and scoured the internet but this is really beating me.

I have some text containing a date.
I would like to capture the date, but not if it's preceded by a certain phrase.

A straightforward solution is to add a negative lookbehind to my RegEx.

Here are some examples (using findall).
I only want to capture the date if it isn't preceded by the phrase "as of".

19-2-11
something something 15-4-11
such and such as of 29-5-11

Here is my regular expression:

(?<!as of )(\d{1,2}-\d{1,2}-\d{2})

Expected results:

['19-2-11']
['15-4-11']
[]

Actual results:

['19-2-11']
['15-4-11']
['9-5-11']

Notice that's 9 not 29. If I change \d{1,2} to something solid like \d{2} on the first pattern:

bad regex for testing: (?<!as of )(\d{2}-\d{1,2}-\d{2})

Then I get my expected results. Of course this is no good because I'd like to match 2-digit days as well as single-digit days.

Apparently my negative lookbehind is quity greedy -- moreso than my date capture, so it's stealing a digit from it and failing. I've tried every means of correcting the greed I can think of, but I just don't know to fix this.

I'd like my date capture to match with the utmost greed, and then my negative lookbehind be applied. Is this possible? My problem seemed like a good use of negative lookbehinds and not overly complicated. I'm sure I could accomplish it another way if I must but I'd like to learn how to do this.

How do I make Python's negative lookbehind less greedy?

The reason is not because lookbehind is greedy. This happens because the regex engine tries to match the pattern at every position it can.

It advances through the phrase such and such as of 29-5-11 successfully matching (?<!as of ) at first, but failing to match \d{1,2}.

But then the engine finds the itself in the position such and such as of !29-5-11(marked with !). But here it fails to match (?<!as of ).

And it advances to the next position: such and such as of 2!9-5-11. Where it successfully matches (?<!as of ) and then \d{1,2}.

How to avoid it?

The general solution is to formulate the pattern as clear as possible.

In this very case I would prepend the digit with the necessary space or the beginning of the string.

(?<!as of)(?:^|\s+)(\d{1,2}-\d{1,2}-\d{2})

The solution of Mark Byers is also very good.

I think it's very important to understand the reason why regex engine behaves this way and gives unwanted results.

By the way the solution I gave above doesn't work if there are 2 or more spaces. It doesn't work because the fist position matches here such and such as of ! 29-5-11 with the abovementioned pattern.

What can be done to avoid it?

Unfortunately lookbehind in Python regex engine doesn't support quantifiers + or *.

I think the simplest solution would be to make sure there is not spaces before (?:^|\s+) (meaing that all the spaces are consumed by (?:^|\s+) straight after any nonspace text (and in case the text is as of, terminate advancing and backtrack to the next starting position starting the search all over again at the next position of the searched text).

re.search(r'(?<!as of)(?<!\s)(?:^|\s+)(\d{1,2}-\d{1,2}-\d{2})','such and such as of  29-5-11').group(1)

Can the Django ORM store an unsigned 64-bit integer (aka ulong64 or uint64) in a reliably backend-agnostic manner?

7 votes

All the docs I've seen imply that you might be able to do that, but there isn't anything official w/r/t ulong64/uint64 fields. There are a few off-the-shelf options that look quite promising in this arena:

  • BigIntegerField ... almost, but signed,
  • PositiveIntegerField ... suspiciously 32-bit-looking,
  • DecimalField ... a fixed-pointer represented with a python decimal type, according to the docs -- which presumably turns into an analogously pedantic and slow database field when socked away, á la the DECIMAL or NUMERIC PostgreSQL types.

... all of which look like they might store a number like that. Except NONE OF THEM WILL COMMIT, much like every single rom-com character portrayed by Hugh Grant.

My primary criterion is that it works with Django's supported backends, without any if postgresql (...) elif mysql (...) type of special-case nonsense. After that, there is the need for speed -- this is for a model field in an visual-database application that will index image-derived data (e.g. perceptual hashes and extracted keypoint features), allowing ordering and grouping by the content of those images.

So: is there a good Django extension or app that furnishes some kind of PositiveBigIntegerField that will suit my purposes?

And, barring that: If there is a simple and reliable way to use Django's stock ORM to store unsigned 64-bit ints, I'd like to know it. Look, I'm no binary whiz; I have to do two's complement on paper -- so if this method of yours involves some bit-shifting trickery, don't hesitate to explain what it is, even if it strikes you as obvious. Thanks in advance.

Although I did not test it, but you may wish to just subclass BigIntegerField. The original BigIntegerField looks like that (source here):

class BigIntegerField(IntegerField):
    empty_strings_allowed = False
    description = _("Big (8 byte) integer")
    MAX_BIGINT = 9223372036854775807

    def get_internal_type(self):
        return "BigIntegerField"

    def formfield(self, **kwargs):
        defaults = {'min_value': -BigIntegerField.MAX_BIGINT - 1,
                    'max_value': BigIntegerField.MAX_BIGINT}
        defaults.update(kwargs)
        return super(BigIntegerField, self).formfield(**defaults)

Derived PositiveBigIntegerField may looks like this:

class PositiveBigIntegerField(BigIntegerField):
    empty_strings_allowed = False
    description = _("Big (8 byte) positive integer")

    def db_field(self, connection):
        """
        Returns MySQL-specific column data type. Make additional checks
        to support other backends.
        """
        return 'bigint UNSIGNED'

    def formfield(self, **kwargs):
        defaults = {'min_value': 0,
                    'max_value': BigIntegerField.MAX_BIGINT * 2 - 1}
        defaults.update(kwargs)
        return super(PositiveBigIntegerField, self).formfield(**defaults)

Although you should test it thoroughly, before using it. If you do, please share the results :)

EDIT:

I missed one thing - internal database representation. This is based on value returned by get_internal_type() and the definition of the column type is stored eg. here in case of MySQL backend and determined here. It looks like overwriting db_field() will give you control over how the field is represented in the database. However, you will need to find a way to return DBMS-specific value in db_field() by checking connection argument.

Unicode, regular expressions and PyPy

6 votes

I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (1.5.0-alpha0 1.8.0, implementing Python 2.7.1 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u" and ur". The full source is here, and the relevant bits are:

# -*- coding:utf-8 -*-
import re

# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {
    ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',
    ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',
    ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',
    ...
    ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',
    ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',
    ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}

def hack_regexp(regexp_string):
    for (k,v) in unicode_categories.items():
        regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)
    return regexp_string

def regex(regexp_string,flags=0):
    """Shortcut for re.compile that also translates and add the UNICODE flag

    Example usage:
        >>> from unicode_hack import regex
        >>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
        >>> print result.group(0)
        áÇñ
        >>> 
    """
    return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)

(on PyPy there is no match in the "Example usage", so result is None)

Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest and directly typing it in the command line). The source file encoding is also correct, and the coding directive in the header seems to be recognized by Python.

Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding header, different encodings in the command line, different interpretations of r and u) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.

Seems PyPy has some encoding problems, both when reading a source file (unrecognized coding header, maybe) and when inputting/outputting in the command line. I replaced my example code with the following:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0) == u'áÇñ'
True
>>>

And it kept working on CPython and failing on PyPy. Replacing the "áÇñ" for its escaped characters - u'\xe1\xc7\xf1' - OTOH did the trick:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'\xe1\xc7\xf1123')
>>> print result.group(0) == u'\xe1\xc7\xf1'
True
>>>

That worked fine on both. I believe the problem is restricted to these two scenarios (source loading and command line), since trying to open an UTF-8 file using codecs.open works fine. When I try to input the string "áÇñ" in the command line, or when I load the source code of "unicode_hack.py" using codecs, I get the same result on CPython:

>>> u'áÇñ'
u'\xe1\xc7\xf1'
>>> import codecs
>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

but different results on PyPy:

>>>> u'áÇñ'
u'\xa0\u20ac\xa4'
>>>> import codecs
>>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

Update: Issue1139 submitted on PyPy bug tracking system, let's see how that turns out...

How does pgBouncer help to speed up Django

5 votes

I have some management commands that are based on gevent. Since my management command makes thousands to requests, I can turn all socket calls into non-blocking calls using Gevent. This really speeds up my application as I can make requests simultaneously.

Currently the bottleneck in my application seems to be Postgres. It seems that this is because the Psycopg library that is used for connecting to Django is written in C and does not support asynchronous connections.

I've also read that using pgBouncer can speed up Postgres by 2X. This sounds great but it would be great if someone could explain how pgBouncer works and helps?

Thanks

PgBouncer reduces the latency in establishing connections by serving as a proxy which maintains a connection pool. This may help speed up your application if you're opening many short-lived connections to Postgres. If you only have a small number of connections, you won't see much of a win.