Best python questions in February 2012

Why does (1 in [1,0] == True) evaluate to False?

81 votes

When I was looking at answers to this question, I found I didn't understand my own answer.

I don't really understand how this is being parsed. Why does the second example return False?

>>> 1 in [1,0]             #This is expected
True
>>> 1 in [1,0] == True     #This is strange
False
>>> (1 in [1,0]) == True   #This is what I wanted it to be
True
>>> 1 in ([1,0] == True)   #But its not just a precedence issue!
                           #It did not raise an exception on the second example.

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    1 in ([1,0] == True)
TypeError: argument of type 'bool' is not iterable

Thanks for any help. I think I must be missing something really obvious.

Python actually applies comparison operator chaining here. The expression is translated to

(1 in [1, 0]) and ([1, 0] == True)

which is obviously False.

This also happens for expressions like

a < b < c

which translate to

(a < b) and (b < c)

(without evaluating b twice).

See the Python language documentation for further details.

Why is reading lines from stdin much slower in C++ than Python?

55 votes

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I'm not an expert Pythonista, please tell me if I'm doing something wrong or if I'm misunderstanding something. Yes, I tried compiling without the optimization flag with identical results. (Also, why do I get an extra line in the C++ code? Like I said, rusty.)

C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;                                                                   

    while(!cin.eof()) {
        getline(cin, input_line);
        line_count++;
    };
    sec = (int) time(NULL) - start;
    cerr << "Saw " << line_count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = line_count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

//Compiled with:
//g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0:n} lines in {1:n} seconds. LPS: {2:n}".format(count, delta_sec,
       lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp 
Saw 5570001 lines in 9 seconds.  Crunch speed: 618889

$cat test_lines | ./readline_test.py 
Read 5570000 lines in 1 seconds. LPS: 5570000

Thanks in advance!

Edit: I should note that I tried this both under OS-X (10.6.8) and Linux 2.6.32 (RHEL 6.2). The former is a macbook pro, the latter is a very beefy server, not that this is too pertinent.

Edit2: For the skeptics. (Yes, the comma-separated count output in Python is thanks to a locale line not shown above, but which doesn't impact speed, I swear :-) ...)

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:Saw 5570001 lines in 9 seconds.  Crunch speed: 618889
Python:Read 5,570,000 lines in 1 seconds. LPS: 5,570,000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:Saw 5570001 lines in 9 seconds.  Crunch speed: 618889
Python:Read 5,570,000 lines in 1 seconds. LPS: 5,570,000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:Saw 5570001 lines in 9 seconds.  Crunch speed: 618889
Python:Read 5,570,000 lines in 1 seconds. LPS: 5,570,000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:Saw 5570001 lines in 9 seconds.  Crunch speed: 618889
Python:Read 5,570,000 lines in 1 seconds. LPS: 5,570,000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:Saw 5570001 lines in 10 seconds.  Crunch speed: 557000
Python:Read 5,570,000 lines in 1 seconds. LPS: 5,570,000

Edit 3 (Solution?):

Okay, I tried J.N.'s suggestion of trying having python store the line read: but it made no difference to python's speed.

I also tried J.N.'s suggestion of using scanf into a char array instead of getline into a std::string. Bingo! This resulted in equivalent performance for both python and c++. (3,333,333 LPS with my input data, which by the way are just short lines of three fields each usually about 20 chars wide, though sometimes more).

Code:

char input_a[512];
char input_b[32];
char input_c[512];
while(scanf("%s %s %s\n", input_a, input_b, input_c) != EOF) {             
    line_count++;
};

Speed:

$ cat test_lines | ./readline_test_cpp2 
Saw 10000000 lines in 3 seconds.  Crunch speed: 3333333
$ cat test_lines | ./readline_test2.py 
Read 10000000 lines in 3 seconds. LPS: 3333333

(Yes, I ran it several times.) So, I guess I will now use scanf instead of getline. But, I'm still curious if people think this performance hit from std::string/getline is typical and reasonable.

Edit 4 (was: Final Edit / Solution):

Adding: cin.sync_with_stdio(false);

Immediately above my original while loop above results in code that runs faster than Python.

New performance comparison (this is on my 2011 Macbook Pro), using the original code, the original with the sync disabled, and the original python, respectively, on a file with 20M lines of text. Yes, I ran it several times to eliminate disk caching confound.

$ /usr/bin/time cat test_lines_double | ./readline_test_cpp
       33.30 real         0.04 user         0.74 sys
Saw 20000001 lines in 33 seconds.  Crunch speed: 606060
$ /usr/bin/time cat test_lines_double | ./readline_test_cpp1b
        3.79 real         0.01 user         0.50 sys
Saw 20000000 lines in 4 seconds.  Crunch speed: 5000000
$ /usr/bin/time cat test_lines_double | ./readline_test.py 
        6.88 real         0.01 user         0.38 sys
Read 20000000 lines in 6 seconds. LPS: 3333333

Thanks to @Vaughn Cato for his answer! Any elaboration people can make or good references people can point to as to why this sync happens, what it means, when it's useful, and when it's okay to disable would be greatly appreciated by posterity. :-)

Edit 5 / Better Solution:

As suggested by Gandalf The Gray below, gets is even faster than scanf or the unsynchronized cin approach. I also learned that scanf and gets are both UNSAFE and should NOT BE USED due to potential of buffer overflow. So, I wrote this iteration using fgets, the safer alternative to gets. Here are the pertinent lines for my fellow noobs:

char input_line[MAX_LINE];
char *result;

//<snip>

while((result = fgets(input_line, MAX_LINE, stdin )) != NULL)    
    line_count++;
if (ferror(stdin))
    perror("Error reading stdin.");

Now, here are the results using an even larger file (100M lines; ~3.4GB) on a fast server with very fast disk, comparing the python, the unsynced cin, and the fgets approaches, as well as comparing with the wc utility. [The scanf version segfaulted and I don't feel like troubleshooting it.]:

$ /usr/bin/time cat temp_big_file | readline_test.py 
0.03user 2.04system 0:28.06elapsed 7%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
Read 100000000 lines in 28 seconds. LPS: 3571428

$ /usr/bin/time cat temp_big_file | readline_test_unsync_cin 
0.03user 1.64system 0:08.10elapsed 20%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
Saw 100000000 lines in 8 seconds.  Crunch speed: 12500000

$ /usr/bin/time cat temp_big_file | readline_test_fgets 
0.00user 0.93system 0:07.01elapsed 13%CPU (0avgtext+0avgdata 2448maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
Saw 100000000 lines in 7 seconds.  Crunch speed: 14285714

$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU (0avgtext+0avgdata 2464maxresident)k
0inputs+0outputs (0major+182minor)pagefaults 0swaps
100000000

Recap (lines per second):
python:         3,571,428 
cin (no sync): 12,500,000
fgets:         14,285,714
wc:            54,644,808

As you can see, fgets is better but still pretty far from wc performance; I'm pretty sure this is due to the fact that wc examines each character without any memory copying. I suspect that, at this point, other parts of the code will become the bottleneck, so I don't think optimizing to that level would even be worthwhile, even if possible (since, after all, I actually need to store the read lines in memory).

Also note that a small tradeoff with using a char * buffer and fgets vs unsynced cin to string is that the latter can read lines of any length, while the former requires limiting input to some finite number. In practice, this is probably a non-issue for reading most line-based input files, as the buffer can be set to a very large value that would not be exceeded by valid input.

This has been educational. Thanks to all for your comments and suggestions.

Edit 6:

As suggested by J.F. Sebastian in the comments below, the GNU wc utility uses plain C read() (within the safe-read.c wrapper) to read chunks (of 16k bytes) at a time and count new lines. Here's a python equivalent based on J.F.'s code (just showing the relevant snippet that replaces the python for loop:

BUFFER_SIZE = 16384 
count = sum(chunk.count('\n') for chunk in iter(partial(sys.stdin.read, BUFFER_SIZE), ''))

The performance of this version is quite fast (though still a bit slower than the raw c wc utility, of course:

$ /usr/bin/time cat temp_big_file | readline_test3.py 
0.01user 1.16system 0:04.74elapsed 24%CPU (0avgtext+0avgdata 2448maxresident)k
0inputs+0outputs (0major+181minor)pagefaults 0swaps
Read 100000000 lines in 4.7275 seconds. LPS: 21152829

Again, it's a bit silly for me to compare C++ fgets/cin and the first python code on the one hand to wc -l and this last python snippet on the other, as the latter two don't actually store the read lines but merely count newlines. Still, it's interesting to explore all the different implementations and think about the performance implications. Thanks again!

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

cin.sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn't be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.

Is there a way to write this nicer?

28 votes

I need to write these four IFs in Python. Notice what it does is changing between 4 possible states in a loop 1,0 -> 0,1 -> -1,0 -> 0,-1 -> back to first.

if [dx, dy] == [1,0]:
    dx, dy = 0, 1
if [dx, dy] == 0, 1:
    dx, dy = -1, 0
if [dx, dy] == [-1, 0]
    dx, dy = 0, -1
if [dx, dy] == [0, -1]:
    dx, dy = 1, 0

Can anyone suggest me a better/nicer way to write this?

Thanks.

dx, dy = -dy, dx

When in doubt, apply maths. ;)

Why doesn't Python evaluate constant number arithmetic before compiling to bytecode?

26 votes

In the following code, why doesn't Python compile f2 to the same bytecode as f1?

Is there a reason not to?

>>> def f1(x):
    x*100

>>> dis.dis(f1)
  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (100)
              6 BINARY_MULTIPLY
              7 POP_TOP
              8 LOAD_CONST               0 (None)
             11 RETURN_VALUE
>>> def f2(x):
        x*10*10

>>> dis.dis(f2)
  2           0 LOAD_FAST                0 (x)
              3 LOAD_CONST               1 (10)
              6 BINARY_MULTIPLY
              7 LOAD_CONST               1 (10)
             10 BINARY_MULTIPLY
             11 POP_TOP
             12 LOAD_CONST               0 (None)
             15 RETURN_VALUE

This is because x could have a __mul__ method with side-effects. x * 10 * 10 calls __mul__ twice, while x * 100 only calls it once:

>>> class Foo(object):
...     def __init__ (self):
...             self.val = 5
...     def __mul__ (self, other):
...             print "Called __mul__: %s" % (other)
...             self.val = self.val * other
...             return self
... 
>>> a = Foo()
>>> a * 10 * 10
Called __mul__: 10
Called __mul__: 10
<__main__.Foo object at 0x1017c4990>

Automatically folding the constants and only calling __mul__ once could change behavior.

You can get the optimization you want by reordering the operation such that the constants are multiplied first (or, as mentioned in the comments, using parentheses to group them such that they are merely operated on together, regardless of position), thus making explicit your desire for the folding to happen:

>>> def f1(x):
...     return 10 * 10 * x
... 
>>> dis.dis(f1)
  2           0 LOAD_CONST               2 (100)
              3 LOAD_FAST                0 (x)
              6 BINARY_MULTIPLY     
              7 RETURN_VALUE 

Why is splitting a string slower in C++ than Python?

17 votes

I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more like yesterday's lesson).

Python Code:

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C++ Code:

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).

I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"

Results:

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?

Edit 1 / Partial Solution?:

I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

The performance of python is now about the same as the split1 C++ implementation.

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)

Thanks to all for your help.

Final Edit/Solution:

Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.

One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.

Thanks again to all for your suggestions!

As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++ std::string is a mutable value type, and is copied at the smallest opportunity.

If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).

The C++ std::string class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

class StringRef
{
private:
    char const*     begin_;
    int             size_;

public:
    int size() const { return size_; }
    char const* begin() const { return begin_; }
    char const* end() const { return begin_ + size_; }

    StringRef( char const* const begin, int const size )
        : begin_( begin )
        , size_( size )
    {}
};

vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
    vector<StringRef>   result;

    enum State { inSpace, inToken };

    State state = inSpace;
    char const*     pTokenBegin = 0;    // Init to satisfy compiler.
    for( auto it = str.begin(); it != str.end(); ++it )
    {
        State const newState = (*it == delimiter? inSpace : inToken);
        if( newState != state )
        {
            switch( newState )
            {
            case inSpace:
                result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                break;
            case inToken:
                pTokenBegin = &*it;
            }
        }
        state = newState;
    }
    if( state == inToken )
    {
        result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
    }
    return result;
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        //spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        //split2(spline, input_line);

        vector<StringRef> const v = split3( input_line );
        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.

Regexp finding longest common prefix of two strings

16 votes

Is there a regexp which would find longest common prefix of two strings? And if this is not solvable by one regexp, what would be the most elegant piece of code ore oneliner using regexp (perl, ruby, python, anything).

PS: I can do this easily programatically, I am asking rather for curiosity, because it seems to me that this could be solveable by regexp.

PPS: Extra bonus for O(n) solution using regexps. Come on, it should exist!

If there's some character that neither string contains —, say, \0 — you could write

"$first\0$second" =~ m/^(.*).*\0\1/s;

and the longest common prefix would be saved as $1.


Edited to add: This is obviously very inefficient. I think that if efficiency is a concern, then this simply isn't the approach we should be using; but we can at least improve it by changing .* to [^\0]* to prevent useless greediness that will just have to be backtracked again, and wrapping the second [^\0]* in (?>…) to prevent backtracking that can't help. This:

"$first\0$second" =~ m/^([^\0]*)(?>[^\0]*)\0\1/s;

This will yield the same result, but much more efficiently. (But still not nearly as efficiently as a straightforward non–regex-based approach. If the strings both have length n, I'd expect its worst case to take at least O(n2) time, whereas the straightforward non–regex-based approach would take O(n) time in its worst case.)

Python: simple list merging based on intersections

15 votes

Consider there are some lists of integers as:

#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------

The question is to merge lists having at least one common element. So the results only for the given part will be as follows:

#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------

What is the most efficient way to do this on large data (elements are just numbers)? Is tree structure something to think about? I do the job now by converting lists to sets and iterating for intersections, but it is slow! Furthermore I have a feeling that is so-elementary! In addition, the implementation lacks something (unknown) because some lists remain unmerged sometime! Having said that, if you were proposing self-implementation please be generous and provide a simple sample code [apparently Python is my favoriate :)] or pesudo-code.
Update 1: Here is the code I was using:

#--------------------------------------
lsts = [[0,1,3],
        [1,0,3,4,5,10,11],
        [2,8],
        [3,1,0,16]];
#--------------------------------------

The function is (buggy!!):

#--------------------------------------
def merge(lsts):
    sts = [set(l) for l in lsts]
    i = 0
    while i < len(sts):
        j = i+1
        while j < len(sts):
            if len(sts[i].intersection(sts[j])) > 0:
                sts[i] = sts[i].union(sts[j])
                sts.pop(j)
            else: j += 1                        #---corrected
        i += 1
    lst = [list(s) for s in sts]
    return lst
#--------------------------------------

The result is:

#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------

Update 2: To my experience the code given by Niklas Baumstark below showed to be a bit faster for the simple cases. Not tested the method given by "Hooked" yet, since it is completely different approach (by the way it seems interesting). The testing procedure for all of these could be really hard or impossible to be ensured of the results. The real data set I will use is so large and complex, so it is impossible to trace any error just by repeating. That is I need to be 100% satisfied of the reliability of the method before pushing it in its place within a large code as a module. Simply for now Niklas's method is faster and the answer for simple sets is correct of course.
However how can I be sure that it works well for real large data set? Since I will not be able to trace the errors visually!

Update 3: Note that reliability of the method is much more important than speed for this problem. I will be hopefully able to translate the Python code to Fortran for the maximum performance finally.

Update 4:
There are many interesting points in this post and generously given answers, constructive comments. I would recommend reading all thoroughly. Please accept my appreciation for the development of the question, amazing answers and constructive comments and discussion.

My attempt:

def merge(lsts):
  sets = [set(lst) for lst in lsts if lst]
  merged = 1
  while merged:
    merged = 0
    results = []
    while sets:
      common, rest = sets[0], sets[1:]
      sets = []
      for x in rest:
        if x.isdisjoint(common):
          sets.append(x)
        else:
          merged = 1
          common |= x
      results.append(common)
    sets = results
  return sets

lst = [[65, 17, 5, 30, 79, 56, 48, 62],
       [6, 97, 32, 93, 55, 14, 70, 32],
       [75, 37, 83, 34, 9, 19, 14, 64],
       [43, 71],
       [],
       [89, 49, 1, 30, 28, 3, 63],
       [35, 21, 68, 94, 57, 94, 9, 3],
       [16],
       [29, 9, 97, 43],
       [17, 63, 24]]
print merge(lst)

Benchmark:

import random

# adapt parameters to your own usage scenario   
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5

if False:  # change to true to generate the test data file (takes a while)
  with open("/tmp/test.txt", "w") as f:
    lists = []
    classes = [range(class_size*i, class_size*(i+1)) for i in range(class_count)]
    for c in classes:
      # distribute each class across ~300 lists
      for i in xrange(list_count_per_class):
        lst = []
        if random.random() < large_list_probability:
          size = random.choice(large_list_sizes)
        else:
          size = random.choice(small_list_sizes)
        nums = set(c)
        for j in xrange(size):
          x = random.choice(list(nums))
          lst.append(x)
          nums.remove(x)
        random.shuffle(lst)
        lists.append(lst)
    random.shuffle(lists)
    for lst in lists:
      f.write(" ".join(str(x) for x in lst) + "\n")

setup = """
# Niklas'
def merge_niklas(lsts):
  sets = [set(lst) for lst in lsts if lst]
  merged = 1
  while merged:
    merged = 0
    results = []
    while sets:
      common, rest = sets[0], sets[1:]
      sets = []
      for x in rest:
        if x.isdisjoint(common):
          sets.append(x)
        else:
          merged = 1
          common |= x
      results.append(common)
    sets = results
  return sets

# Rik's
def merge_rik(data):
  sets = (set(e) for e in data if e)
  results = [next(sets)]
  for e_set in sets:
    to_update = []
    for i,res in enumerate(results):
      if not e_set.isdisjoint(res):
        to_update.insert(0,i)

    if not to_update:
      results.append(e_set)
    else:
      last = results[to_update.pop(-1)]
      for i in to_update:
        last |= results[i]
        del results[i]
      last |= e_set
  return results

# katrielalex's
def pairs(lst):
  i = iter(lst)
  first = prev = item = i.next()
  for item in i:
    yield prev, item
    prev = item
  yield item, first

import networkx
def merge_katrielalex(lsts):
  g = networkx.Graph()
  for lst in lsts:
    for edge in pairs(lst):
      g.add_edge(*edge)
  return networkx.connected_components(g)

# agf's (optimized)
from collections import deque
def merge_agf_optimized(lists):
  sets = deque(set(lst) for lst in lists if lst)
  results = []
  disjoint = 0
  current = sets.pop()
  while True:
    merged = False
    newsets = deque()
    for _ in xrange(disjoint, len(sets)):
      this = sets.pop()
      if not current.isdisjoint(this):
        current.update(this)
        merged = True
        disjoint = 0
      else:
        newsets.append(this)
        disjoint += 1
    if sets:
      newsets.extendleft(sets)
    if not merged:
      results.append(current)
      try:
        current = newsets.pop()
      except IndexError:
        break
      disjoint = 0
    sets = newsets
  return results

# agf's (simple)
def merge_agf_simple(lists):
  newsets, sets = [set(lst) for lst in lists if lst], []
  while len(sets) != len(newsets):
    sets, newsets = newsets, []
    for aset in sets:
      for eachset in newsets:
        if not aset.isdisjoint(eachset):
          eachset.update(aset)
          break
      else:
        newsets.append(aset)
  return newsets

# alexis'
def merge_alexis(data):
  bins = range(len(data))  # Initialize each bin[n] == n
  nums = dict()

  data = [set(m) for m in data ]  # Convert to sets
  for r, row in enumerate(data):
    for num in row:
      if num not in nums:
        # New number: tag it with a pointer to this row's bin
        nums[num] = r
        continue
      else:
        dest = locatebin(bins, nums[num])
        if dest == r:
          continue # already in the same bin

        if dest > r:
          dest, r = r, dest   # always merge into the smallest bin

        data[dest].update(data[r])
        data[r] = None
        # Update our indices to reflect the move
        bins[r] = dest
        r = dest

  # Filter out the empty bins
  have = [ m for m in data if m ]
  return have


def locatebin(bins, n):
  while bins[n] != n:
    n = bins[n]
  return n

lsts = []
size = 0
num = 0
max = 0
for line in open("/tmp/test.txt", "r"):
  lst = [int(x) for x in line.split()]
  size += len(lst)
  if len(lst) > max: max = len(lst)
  num += 1
  lsts.append(lst)
"""

setup += """
print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
""".format(class_count=class_count)

import timeit
print "niklas"
print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
print "rik"
print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
print "katrielalex"
print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
print "agf (1)"
print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
print "agf (2)"
print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
print "alexis"
print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)

These timings are obviously dependent on the specific parameters to the benchmark, like number of classes, number of lists, list size, etc. Adapt those parameters to your need to get more helpful results.

Below are some example outputs on my machine for different parameters. They show that all the algorithms have their strength and weaknesses, depending on the kind of input they get:

=====================
# many disjoint classes, large lists
class_count = 50
class_size = 1000
list_count_per_class = 100
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
=====================

niklas
5000 lists, 50 equally distributed classes, average size 298, max size 999
4.80084705353
rik
5000 lists, 50 equally distributed classes, average size 298, max size 999
9.49251699448
katrielalex
5000 lists, 50 equally distributed classes, average size 298, max size 999
21.5317108631
agf (1)
5000 lists, 50 equally distributed classes, average size 298, max size 999
8.61671280861
agf (2)
5000 lists, 50 equally distributed classes, average size 298, max size 999
5.18117713928
=> alexis
=> 5000 lists, 50 equally distributed classes, average size 298, max size 999
=> 3.73504281044

===================
# less number of classes, large lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.5
===================

niklas
4500 lists, 15 equally distributed classes, average size 296, max size 999
1.79993700981
rik
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.58237695694
katrielalex
4500 lists, 15 equally distributed classes, average size 296, max size 999
19.5465381145
agf (1)
4500 lists, 15 equally distributed classes, average size 296, max size 999
2.75445604324
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 296, max size 999
=> 1.77850699425
alexis
4500 lists, 15 equally distributed classes, average size 296, max size 999
3.23530197144

===================
# less number of classes, smaller lists
class_count = 15
class_size = 1000
list_count_per_class = 300
large_list_sizes = list(range(100, 1000))
small_list_sizes = list(range(0, 100))
large_list_probability = 0.1
===================

niklas
4500 lists, 15 equally distributed classes, average size 95, max size 997
0.773697137833
rik
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.0523750782
katrielalex
4500 lists, 15 equally distributed classes, average size 95, max size 997
6.04466891289
agf (1)
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.20285701752
=> agf (2)
=> 4500 lists, 15 equally distributed classes, average size 95, max size 997
=> 0.714507102966
alexis
4500 lists, 15 equally distributed classes, average size 95, max size 997
1.1286110878

Why does Python have an __ne__ operator method instead of just __eq__?

15 votes

The answer here gives a handwaving reference to cases where you'd want __ne__ to return something other than just the logical inverse of __eq__, but I can't imagine any such case. Any examples?

SQLAlchemy is a great example. For the uninitiated, SQLAlchemy is a ORM and uses Python expression to generate SQL statements. In a expression such as

meta.Session.query(model.Theme).filter(model.Theme.id == model.Vote.post_id)

the model.Theme.id == model.VoteWarn.post_id does not return a boolean, but a object that eventually produces a SQL query like WHERE theme.id = vote.post_id. The inverse would produce something like WHERE theme.id <> vote.post_id so both methods need to be defined.

In python is there something like updated that is to update what sorted is to sort?

12 votes

In python if I do the following:

>>> list = [ 3, 2, 1]
>>> sorted_list = k.sort()

Then sorted_list is None and list is sorted:

>>> sorted_list = k.sort()
>>> print list, sorted_list
[1, 2, 3] None

However, if I do the following:

>>> list = [ 3, 2, 1]
>>> sorted_list = sorted(list)

Then list remains unsorted and sorted_list contains a copy of the sorted list:

>>> print list, sorted_list
[3, 2, 1] [1, 2, 3]

I am wondering if there is an equivalent for the update function for dictionaries.

That way I could do something like this:

def foo(a, b, extra={}):
    bar = { 'first': a, 'second': b }
    special_function(**updated(bar, extra))
    normal_function(**bar)

rather than having to do something like this:

def foo(a, b, extra={}):
    bar = { 'first': a, 'second': b }
    special_bar = bar.copy()
    special_bar.update(extra) # [1]
    special_function(**special_bar)
    normal_function(**bar)

[1] Yes I realize I could simply replace these two lines with extra.update(bar) but let's assume I want to retain extra as is for later on in the function.

I realize I could implement this myself thusly:

def updated(old_dict, extra={}):
    new_dict = old_dict.copy()
    new_dict.update(extra)
    return new_dict

Or the following highly unreadable in-place statement:

    special_function(**(dict(bar.items()+extra.items())))

But I was hoping there was something built in that I could already use.

You can simply use the built-in dict():

updated_dict = dict(old_dict, **extra_dict)

pexpect can't pass input over 1024 chars?

11 votes

I'm currently passing some input to a process with pexpect with the following code:

p = pexpect.spawn('cat', timeout=5.0 )
p.maxread = 5000
p.setecho(False) # prevent the process from echoing stdin back to us
INPUT_LEN = 1024
p.sendline('a'*INPUT_LEN)
print p.readline() # pexpect.TIMEOUT: Timeout exceeded in read_nonblocking().

When INPUT_LEN < 1024, everything works fine, but for >= 1024 characters, the process does not receive the full input, causing raising a "pexpect.TIMEOUT" error on p.readline().

I've tried splitting my input into pieces smaller than 1024 characters, but this has the same issue:

p = pexpect.spawn('cat', timeout=5.0 )
p.maxread = 5000
p.setecho(False)
INPUT_LEN = 1024
p.send('a'*1000)
p.sendline('a'*(INPUT_LEN-1000))
print p.readline() # pexpect.TIMEOUT: Timeout exceeded in read_nonblocking().

Does anyone know how to make pexpect work with inputs over 1024 characters? I tried looking at the source, but it just seems to be calling os.write(...).

(As a side note, I've noticed the same truncation error occurs when I run "cat" from a shell and attempt to paste in >=1024 characters with "Cmd+V". However, everything works fine if I run "pbpaste | cat" .)

Thanks!

Update: The call to "os.write()" returns 1025, indicating a successful write, but os.read() returns "\x07" (the single character BEL), and then hangs on the next call, resulting in the timeout.

Dividing the os.write() call into two sub-1024 byte write()s, separated by a call to os.fsync(), does not change anything.

Your problem seems to be MacOS related, take a look at MacOSX 10.6.7 cuts off stdin at 1024 chars.

It basically says that 1024 is your tty buffer limit.

I'm not an expert on Mac OS, but maybe others can give you more informations about this.

Getting confused with lambda and list comprehension

10 votes

Read a question on stack overflow sometime back with the following syntax

In [1]: [lambda: x for x in range(5)][0]()
Out[1]: 4
In [2]: [lambda: x for x in range(5)][2]()
Out[2]: 4

But i am having a hard time to understand why exactly the output of this comes as 4, my understanding is it always gives the last value of the list as output,

In [4]: [lambda: x for x in [1,5,7,3]][0]()
Out[4]: 3

but still not convinced how does this syntax ends up with the last value.

Would be very glad if i can get a proper explanation for this syntax

This isn't really about either list comprehensions or lambdas. It's about the scoping rules in Python. Let's rewrite the list comprehension into an equivalent loop:

funcs = []
for x in range(5):
    def f(): return x
    funcs.append(f)
funcs[0]() # returns 4

Here, we can see that we successively construct functions and store them in a list. When one of these functions is called, all that happens is the value of x is looked up and returned. But x is a variable whose value changes, so the final value of 4 is what is always returned. You could even change the value of x after the loop, e.g.,

x = 32 
funcs[2]() # returns 32

To get the behavior you expected, Python would need to scope the for contents as a block; it doesn't. Usually, this isn't a problem, and is easy enough to work around.

Python: how to build a dict from plain list of keys and values

10 votes

I have a list of values like:

["a", 1, "b", 2, "c", 3]

and I would like to build such a dict from it:

{"a": 1, "b": 2, "c": 3}

What is the natural way to do it in Python?

>>> x = ["a", 1, "b", 2, "c", 3]
>>> i = iter(x)
>>> dict(zip(i, i))
{'a': 1, 'c': 3, 'b': 2}

Convert RGBA PNG to RGB with PIL

9 votes

I'm using PIL to convert a transparent PNG image uploaded with Django to a JPG file. The output looks broken.

Source file

transparent source file

Code

Image.open(object.logo.path).save('/tmp/output.jpg', 'JPEG')

or

Image.open(object.logo.path).convert('RGB').save('/tmp/output.png')

Result

Both ways, the resulting image looks like this:

resulting file

Is there a way to fix this? I'd like to have white background where the transparent background used to be.


Solution

Thanks to the great answers, I've come up with the following function collection:

import Image
import numpy as np


def alpha_to_color(image, color=(255, 255, 255)):
    """Set all fully transparent pixels of an RGBA image to the specified color.
    This is a very simple solution that might leave over some ugly edges, due
    to semi-transparent areas. You should use alpha_composite_with color instead.

    Source: http://stackoverflow.com/a/9166671/284318

    Keyword Arguments:
    image -- PIL RGBA Image object
    color -- Tuple r, g, b (default 255, 255, 255)

    """ 
    x = np.array(image)
    r, g, b, a = np.rollaxis(x, axis=-1)
    r[a == 0] = color[0]
    g[a == 0] = color[1]
    b[a == 0] = color[2] 
    x = np.dstack([r, g, b, a])
    return Image.fromarray(x, 'RGBA')


def alpha_composite(front, back):
    """Alpha composite two RGBA images.

    Source: http://stackoverflow.com/a/9166671/284318

    Keyword Arguments:
    front -- PIL RGBA Image object
    back -- PIL RGBA Image object

    """
    front = np.asarray(front)
    back = np.asarray(back)
    result = np.empty(front.shape, dtype='float')
    alpha = np.index_exp[:, :, 3:]
    rgb = np.index_exp[:, :, :3]
    falpha = front[alpha] / 255.0
    balpha = back[alpha] / 255.0
    result[alpha] = falpha + balpha * (1 - falpha)
    old_setting = np.seterr(invalid='ignore')
    result[rgb] = (front[rgb] * falpha + back[rgb] * balpha * (1 - falpha)) / result[alpha]
    np.seterr(**old_setting)
    result[alpha] *= 255
    np.clip(result, 0, 255)
    # astype('uint8') maps np.nan and np.inf to 0
    result = result.astype('uint8')
    result = Image.fromarray(result, 'RGBA')
    return result


def alpha_composite_with_color(image, color=(255, 255, 255)):
    """Alpha composite an RGBA image with a single color image of the
    specified color and the same size as the original image.

    Keyword Arguments:
    image -- PIL RGBA Image object
    color -- Tuple r, g, b (default 255, 255, 255)

    """
    back = Image.new('RGBA', size=image.size, color=color + (255,))
    return alpha_composite(image, back)


def pure_pil_alpha_to_color_v1(image, color=(255, 255, 255)):
    """Alpha composite an RGBA Image with a specified color.

    NOTE: This version is much slower than the
    alpha_composite_with_color solution. Use it only if
    numpy is not available.

    Source: http://stackoverflow.com/a/9168169/284318

    Keyword Arguments:
    image -- PIL RGBA Image object
    color -- Tuple r, g, b (default 255, 255, 255)

    """ 
    def blend_value(back, front, a):
        return (front * a + back * (255 - a)) / 255

    def blend_rgba(back, front):
        result = [blend_value(back[i], front[i], front[3]) for i in (0, 1, 2)]
        return tuple(result + [255])

    im = image.copy()  # don't edit the reference directly
    p = im.load()  # load pixel array
    for y in range(im.size[1]):
        for x in range(im.size[0]):
            p[x, y] = blend_rgba(color + (255,), p[x, y])

    return im

def pure_pil_alpha_to_color_v2(image, color=(255, 255, 255)):
    """Alpha composite an RGBA Image with a specified color.

    Simpler, faster version than the solutions above.

    Source: http://stackoverflow.com/a/9459208/284318

    Keyword Arguments:
    image -- PIL RGBA Image object
    color -- Tuple r, g, b (default 255, 255, 255)

    """
    image.load()  # needed for split()
    background = Image.new('RGB', image.size, color)
    background.paste(image, mask=image.split()[3])  # 3 is the alpha channel
    return background

Performance

The simple non-compositing alpha_to_color function is the fastest solution, but leaves behind ugly borders because it does not handle semi transparent areas.

Both the pure PIL and the numpy compositing solutions give great results, but alpha_composite_with_color is much faster (8.93 msec) than pure_pil_alpha_to_color (79.6 msec). If numpy is available on your system, that's the way to go. (Update: The new pure PIL version is the fastest of all mentioned solutions.)

$ python -m timeit "import Image; from apps.front import utils; i = Image.open(u'logo.png'); i2 = utils.alpha_to_color(i)"
10 loops, best of 3: 4.67 msec per loop
$ python -m timeit "import Image; from apps.front import utils; i = Image.open(u'logo.png'); i2 = utils.alpha_composite_with_color(i)"
10 loops, best of 3: 8.93 msec per loop
$ python -m timeit "import Image; from apps.front import utils; i = Image.open(u'logo.png'); i2 = utils.pure_pil_alpha_to_color(i)"
10 loops, best of 3: 79.6 msec per loop
$ python -m timeit "import Image; from apps.front import utils; i = Image.open(u'logo.png'); i2 = utils.pure_pil_alpha_to_color_v2(i)"
10 loops, best of 3: 1.1 msec per loop

The transparent parts mostly have RGBA value (0,0,0,0). Since the JPG has no transparency, the jpeg value is set to (0,0,0), which is black.

Around the circular icon, there are pixels with nonzero RGB values where A = 0. So they look transparent in the PNG, but funny-colored in the JPG.

You can set all pixels where A == 0 to have R = G = B = 255 using numpy like this:

import Image
import numpy as np

FNAME = 'logo.png'
img = Image.open(FNAME).convert('RGBA')
x = np.array(img)
r, g, b, a = np.rollaxis(x, axis = -1)
r[a == 0] = 255
g[a == 0] = 255
b[a == 0] = 255
x = np.dstack([r, g, b, a])
img = Image.fromarray(x, 'RGBA')
img.save('/tmp/out.jpg')

enter image description here


Note that the logo also has some semi-transparent pixels used to smooth the edges around the words and icon. Saving to jpeg ignores the semi-transparency, making the resultant jpeg look quite jagged.

A better quality result could be made using imagemagick's convert command:

convert logo.png -background white -flatten /tmp/out.jpg

enter image description here


To make a nicer quality blend using numpy, you could use alpha compositing:

import Image
import numpy as np

def alpha_composite(src, dst):
    '''
    Return the alpha composite of src and dst.

    Parameters:
    src -- PIL RGBA Image object
    dst -- PIL RGBA Image object

    The algorithm comes from http://en.wikipedia.org/wiki/Alpha_compositing
    '''
    # http://stackoverflow.com/a/3375291/190597
    # http://stackoverflow.com/a/9166671/190597
    src = np.asarray(src)
    dst = np.asarray(dst)
    out = np.empty(src.shape, dtype = 'float')
    alpha = np.index_exp[:, :, 3:]
    rgb = np.index_exp[:, :, :3]
    src_a = src[alpha]/255.0
    dst_a = dst[alpha]/255.0
    out[alpha] = src_a+dst_a*(1-src_a)
    old_setting = np.seterr(invalid = 'ignore')
    out[rgb] = (src[rgb]*src_a + dst[rgb]*dst_a*(1-src_a))/out[alpha]
    np.seterr(**old_setting)    
    out[alpha] *= 255
    np.clip(out,0,255)
    # astype('uint8') maps np.nan (and np.inf) to 0
    out = out.astype('uint8')
    out = Image.fromarray(out, 'RGBA')
    return out            

FNAME = 'logo.png'
img = Image.open(FNAME).convert('RGBA')
white = Image.new('RGBA', size = img.size, color = (255, 255, 255, 255))
img = alpha_composite(img, white)
img.save('/tmp/out.jpg')

enter image description here

Speed of Python Extensions in C vs. C

9 votes

Python extension modules written in C are faster than the equivalent programs written in pure Python. How do these extension modules compare (speed wise) to programs written in pure C? Are programs written in pure C even faster than the equivalent Python extension module?

How do these extension modules compare (speed wise) to programs written in pure C?

They are slightly slower due to the translation between Python data structures -> C types. Disregarding this translation the actual C code runs at exactly the same speed as a regular C function would.

Are programs written in pure C even faster than the equivalent Python extension module?

C programs (written entirely in C) can be faster than Python programs using the C extension modules. If the C program and the extension module are written with the same level of complexity, coder skill, algorithmic complexity, etc., the C program will win every time. However, if you're not a C guru and you're competing with a highly optimized Python C extension Python could be faster.

Find words and combinations of words that can be spoken the quickest

9 votes

I'm a big fan of discovering sentences that can be rapped very quickly. For example, "gotta read a little bit of Wikipedia" or "don't wanna wind up in the gutter with a bottle of malt." (George Watsky)

I wanted to write a program in Python that would enable me to find words (or combinations of words) that can be articulated such that it sounds very fast when spoken.

I initially thought that words that had a high syllable to letter ratio would be the best, but upon writing a Python program to do find those words, I retrieved only very simple words that didn't really sound fast.

So I'm at a loss at what actually makes words sound fast. Is it the morpheme to letter ratio? Is it the number of alternating vowel-consonant pairs?

How would you guys go about devising a python program to resolve this problem?

This is just a stab in the dark as I'm not a linguist (although, I have written a voice synthesizer), the metric that be useful here is the number of phonemes that make up each word, since the phonemes themselves are going to be the same approximate duration regardless of use. There's an International Phonetic Alphabet chart for english dialects, as well as a nice phonology of English.

A good open-source phonetic dictionary is available from the cmudict project which has about 130k words

Here's a really quick stab at a look up program:

#!/usr/bin/python

import re

words={}

for line in open("cmudict.0.7a",'ro').readlines():
    split_idx = line.find(' ')
    words[line[0:split_idx]] = line[split_idx+1:-1]

user_input = raw_input("Words: ")

print
for word in user_input.split(' '):
    try:
        print "%25s %s" % (word, words[word.upper()])
    except:
        print "%25s %s" % (word, 'unable to find phonems for word')

When run..

Words: I support hip hop from the underground up

                    I  AY1
              support  S AH0 P AO1 R T
                  hip  HH IH1 P
                  hop  HH AA1 P
                 from  F R AH1 M
                  the  DH AH0
          underground  AH1 N D ER0 G R AW2 N D
                   up  AH1 P

If you want to get super fancy pants about this, there's always the Python Natural Language Toolkit which may have some useful tidbits for you.

Additionally, some real world use.. although to be fair, I fixed 'stylin' to 'styling'.. But left 'tellin' to reveal the deficiency of unknown words.. You could probably try a lookup for words ending with in' by subbing the g in for the apostrophe and then drop the NG phoneme from the lookup..

                  Yes  Y EH1 S
                  the  DH AH0
               rhythm  R IH1 DH AH0 M
                  the  DH AH0
                rebel  R EH1 B AH0 L
              Without  W IH0 TH AW1 T
                    a  AH0
                pause  P AO1 Z
                  I'm  AY1 M
             lowering  L OW1 ER0 IH0 NG
                   my  M AY1
                level  L EH1 V AH0 L
                  The  DH AH0
                 hard  HH AA1 R D
               rhymer  R AY1 M ER0
                where  W EH1 R
                  you  Y UW1
                never  N EH1 V ER0
                 been  B IH1 N
                  I'm  AY1 M
                   in  IH0 N
                  You  Y UW1
                 want  W AA1 N T
              styling  S T AY1 L IH0 NG
                  you  Y UW1
                 know  N OW1
                 it's  IH1 T S
                 time  T AY1 M
                again  AH0 G EH1 N
                    D  D IY1
                  the  DH AH0
                enemy  EH1 N AH0 M IY0
               tellin unable to find phonems for word
                  you  Y UW1
                   to  T UW1
                 hear  HH IY1 R
                   it  IH1 T
                 They  DH EY1
              praised  P R EY1 Z D
              etc...

If this is something you plan on putting some time into, I'd be interested in helping. I think putting 'Worlds first rapping IDE' on my resume would be hilarious. And if one exists already, world's first Python based rapping IDE. :p

Foolproof cross-platform process kill daemon

8 votes

I have some python automation, which spawns telnet sessions that I log with the linux script command; there are two script process IDs (a parent and child) for each logging session.

I need to solve a problem where if the python automation script dies, the script sessions never close on their own; for some reason this is much harder than it should be.

So far, I have implemented watchdog.py (see bottom of the question), which daemonizes itself, and polls the python automation script's PID in a loop. When it sees the python automation PID disappear from the server's process table, it attempts to kill the script sessions.

My problem is:

  • script sessions always spawn two separate processes, one of the script sessions is the parent of the other script session.
  • watchdog.py will not kill the child script sessions, if I start script sessions from the automation script (see AUTOMATION EXAMPLE, below)

AUTOMATION EXAMPLE (reproduce_bug.py)

import pexpect as px
from subprocess import Popen
import code
import time
import sys
import os

def read_pid_and_telnet(_child, addr):
    time.sleep(0.1) # Give the OS time to write the PIDFILE
    # Read the PID in the PIDFILE
    fh = open('PIDFILE', 'r')
    pid = int(''.join(fh.readlines()))
    fh.close()
    time.sleep(0.1)
    # Clean up the PIDFILE
    os.remove('PIDFILE')
    _child.expect(['#', '\$'], timeout=3)
    _child.sendline('telnet %s' % addr)
    return str(pid)

pidlist = list()
child1 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE1.txt'""")
pidlist.append(read_pid_and_telnet(child1, '10.1.1.1'))

child2 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE2.txt'""")
pidlist.append(read_pid_and_telnet(child2, '10.1.1.2'))

cmd = "python watchdog.py -o %s -k %s" % (os.getpid(), ','.join(pidlist))
Popen(cmd.split(' '))
print "I started the watchdog with:\n   %s" % cmd

time.sleep(0.5)
raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"

Now the example of what happens when I run the automation above... note that PID 30017 spawns 30018 and PID 30020 spawns 30021. All the aforementioned PIDs are script sessions.

[mpenning@Hotcoffee Network]$ python reproduce_bug.py 
I started the watchdog with:
   python watchdog.py -o 30016 -k 30017,30020
Traceback (most recent call last):
  File "reproduce_bug.py", line 35, in <module>
    raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"
RuntimeError: Simulated script crash.  Note that script child sessions are hung
[mpenning@Hotcoffee Network]$

After I run the automation above, all the child script sessions are still running.

[mpenning@Hotcoffee Models]$ ps auxw | grep script
mpenning 30018  0.0  0.0  15832   508 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE1.txt
mpenning 30021  0.0  0.0  15832   516 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE2.txt
mpenning 30050  0.0  0.0   7548   880 pts/8    S+   12:08   0:00 grep script
[mpenning@Hotcoffee Models]$

I am running the automation under Python 2.6.6, on a Debian Squeeze linux system (uname -a: Linux Hotcoffee 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 GNU/Linux).

QUESTION:

It seems that the daemon doesn't survive the spawning-process crash. How can I fix watchdog.py to close all the script sessions if the automation dies (as shown in the example above)?

A watchdog.py log that illustrates the problem (sadly, the PIDs do not coincide with the original question)...

[mpenning@Hotcoffee ~]$ cat watchdog.log 
2012-02-22,15:17:20.356313 Start watchdog.watch_process
2012-02-22,15:17:20.356541     observe pid = 31339
2012-02-22,15:17:20.356643     kill pids = 31352,31356
2012-02-22,15:17:20.356730     seconds = 2
[mpenning@Hotcoffee ~]$

Resolution

The problem essentially was a race condition. When I tried to kill the "parent" script processes, they had already died coincidental with the automation event...

To solve the problem... first, the watchdog daemon needed to identify the entire list of children to be killed before polling the observed PID (my original script attempted to identify children after the observed PID crashed). Next, I had to modify my watchdog daemon to allow for the possibility that some script processes might die with the observed PID.


watchdog.py:

#!/usr/bin/python
"""
Implement a cross-platform watchdog daemon, which observes a PID and kills 
other PIDs if the observed PID dies.

Example:
--------

watchdog.py -o 29322 -k 29345,29346,29348 -s 2

The command checks PID 29322 every 2 seconds and kills PIDs 29345, 29346, 29348 
and their children, if PID 29322 dies.

Requires:
----------

 * http://code.google.com/p/psutil/
 * http://pypi.python.org/pypi/python-daemon
"""
from optparse import OptionParser
import datetime as dt
import signal
import daemon
import logging
import psutil
import time
import sys
import os

class MyFormatter(logging.Formatter):
    converter=dt.datetime.fromtimestamp
    def formatTime(self, record, datefmt=None):
        ct = self.converter(record.created)
        if datefmt:
            s = ct.strftime(datefmt)
        else:
            t = ct.strftime("%Y-%m-%d %H:%M:%S")
            s = "%s,%03d" % (t, record.msecs)
        return s

def check_pid(pid):        
    """ Check For the existence of a unix / windows pid."""
    try:
        os.kill(pid, 0)   # Kill 0 raises OSError, if pid isn't there...
    except OSError:
        return False
    else:
        return True

def kill_process(logger, pid):
    try:
        psu_proc = psutil.Process(pid)
    except Exception, e:
        logger.debug('Caught Exception ["%s"] while looking up PID %s' % (e, pid))
        return False

    logger.debug('Sending SIGTERM to %s' % repr(psu_proc))
    psu_proc.send_signal(signal.SIGTERM)
    psu_proc.wait(timeout=None)
    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)


    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)
    children = list()

    # Get PIDs of all child processes...
    for childpid in kill.split(','):
        children.append(childpid)
        p = psutil.Process(int(childpid))
        for subpsu in p.get_children():
            children.append(str(subpsu.pid))

    # Poll observed PID...
    while check_pid(int(observe)):
        logger.debug('Poll PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('Poll PID: %s is *dead*, starting kills of %s' % (observe, ', '.join(children)))

    for pid in children:
        # kill all child processes...
        kill_process(logger, int(pid))
    sys.exit(0) # Exit gracefully

def run(observe, kill, seconds):

    with daemon.DaemonContext(detach_process=True, 
        stdout=sys.stdout,
        working_directory=os.getcwd()):
        watch_process(observe=observe, kill=kill, seconds=seconds)

if __name__=='__main__':
    parser = OptionParser()
    parser.add_option("-o", "--observe", dest="observe", type="int",
                      help="PID to be observed", metavar="INT")
    parser.add_option("-k", "--kill", dest="kill",
                      help="Comma separated list of PIDs to be killed", 
                      metavar="TEXT")
    parser.add_option("-s", "--seconds", dest="seconds", default=2, type="int",
                      help="Seconds to wait between observations (default = 2)", 
                      metavar="INT")
    (options, args) = parser.parse_args()
    run(options.observe, options.kill, options.seconds)

Your problem is that script not detached from automation script after spawning, so it's working as child, and when parent dies it stay unmanageable.

To handle python script exit you can use atexit module. To monitor child processes exit you can use os.wait or handle SIGCHLD signal

How to store a crypto key securely?

7 votes

I'm looking at using a crypto lib such as pycrypto for encrypting/decrypting fields in my python webapp db. But encryption algorithms require a key. If I have an unencrypted key in my source it seems silly to attempt encryption of db fields as on my server if someone has access to the db files they will also have access to my python sourcecode.

Is there a best-practice method of securing the key used? Or an alternative method of encrypting the db fields (at application not db level)?

UPDATE: the fields I am trying to secure are oauth tokens.

UPDATE: I guess there is no common way to avoid this. I think I'll need to encrypt the fields anyway as it's likely the db files will get backed up and moved around so at least I'll reduce the issue to a single vulnerable location - viewing my source code.

UPDATE: The oauth tokens need to be used for api calls while the user is offline, therefore using their password as a key is not suitable in this case.

If you are encrypting fields that you only need to verify (not recall), then simple hash with SHA or one-way encrypt with DES, or IDEA using a salt to prevent a rainbow table to actually reveal them. This is useful for passwords or other access secrets.

Python and webapps makes me think of GAE, so you may want something that is not doing an encrypt/decrypt on every DB transaction since these are already un-cheap on GAE.

Best practice for an encrypted databased is to encrypt the fields with the users own secret, but to include an asymmetric backdoor that encrypts the users secret key so you (and not anyone who has access to the DB source files, or the tables) can unencrypt the users key with your secret key, should recovery or something else necessitate.

In that case, the user (or you or trusted delegate) can retireve and unencrypt their own information only. You may want to be more stringent in validating user secrets if you are thinking you need to secure their fields by encryption.

In this regards, a passphrase (as opposed to a password) of some secret words such "in the jungle the mighty Jungle" is a good practice to encourage.

EDIT: Just saw your update. The best way to store OAuth is to give them a short lifespan, only request resources your need and re-request them over getting long tokens. It's better to design around getting authenticated, getting your access and getting out, than leaving the key under the backdoor for 10 years.

Since, if you need to recall OAuth when the user comes online, you can do as above and encrypt with a user specfic secret. You could also keygen from an encrypted counter (encrypted with the user secret) so the actual encryption key changes at each transaction, and the counter is stored in plaintext. But check specific crypto algo discussion of this mode before using. Some algorithms may not play nice with this.

Count overlapping regex matches once again

7 votes

How can I obtain the number of overlapping regex matches using Python?

I've read and tried the suggestions from this, that and a few other questions, but found none that would work for my scenario. Here it is:

  • input example string: akka
  • search pattern: a.*k

A proper function should yield 2 as the number of matches, since there are two possible end positions (k letters).

The pattern might also be more complicated, for example a.*k.*a should also be matched twice in akka (since there are two k's in the middle).

Yes, it is ugly and unoptimized but it seems to be working. This is a simple try of all possible but unique variants

def myregex(pattern,text,dir=0):
    import re
    m = re.search(pattern, text)
    if m:
        yield m.group(0)
        if len(m.group('suffix')):
            for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[1:],m.group('end')),1):
                yield r
            if dir<1 :
                for r in myregex(pattern, "%s%s%s" % (m.group('prefix'),m.group('suffix')[:-1],m.group('end')),-1):
                    yield r


def myprocess(pattern, text):    
    parts = pattern.split("*")    
    for i in range(0, len(parts)-1 ):
        res=""
        for j in range(0, len(parts) ):
            if j==0:
                res+="(?P<prefix>"
            if j==i:
                res+=")(?P<suffix>"
            res+=parts[j]
            if j==i+1:
                res+=")(?P<end>"
            if j<len(parts)-1:
                if j==i:
                    res+=".*"
                else:
                    res+=".*?"
            else:
                res+=")"
        for r in myregex(res,text):
            yield r

def mycount(pattern, text):
    return set(myprocess(pattern, text))

test:

>>> mycount('a*b*c','abc')
set(['abc'])
>>> mycount('a*k','akka')
set(['akk', 'ak'])
>>> mycount('b*o','bboo')
set(['bbo', 'bboo', 'bo', 'boo'])
>>> mycount('b*o','bb123oo')
set(['b123o', 'bb123oo', 'bb123o', 'b123oo'])
>>> mycount('b*o','ffbfbfffofoff')
set(['bfbfffofo', 'bfbfffo', 'bfffofo', 'bfffo'])

Python is slow when iterating over a large list

7 votes

I am currently selecting a large list of rows from a database using pyodbc. The result is then copied to a large list, and then i am trying to iterate over the list. Before I abandon python, and try to create this in C#, I wanted to know if there was something I was doing wrong.

clientItems.execute("Select ids from largetable where year =?", year);
allIDRows = clientItemsCursor.fetchall() #takes maybe 8 seconds.

for clientItemrow in allIDRows:
    aID = str(clientItemRow[0])
    # Do something with str -- Removed because I was trying to determine what was slow
    count = count+1

Some more information:

  • The for loop is currently running at about 5 loops per second, and that seems insanely slow to me.
  • The total rows selected is ~489,000.
  • The machine its running on has lots of RAM and CPU. It seems to only run one or two cores, and ram is 1.72GB of 4gb.

Can anyone tell me whats wrong? Do scripts just run this slow?

Thanks

This should not be slow with Python native lists - but maybe ODBC's driver is returning a "lazy" object that tries to be smart but just gets slow. Try just doing

allIDRows = list(clientItemsCursor.fetchall())

in your code and post further benchmarks.

(Python lists can get slow if you start inserting things in its middle, but just iterating over a large list should be fast)

Using __new__ on classes derived from Django's models does not work

4 votes

This is something which is puzzling me, but I cannot get a definitive answer. Using the __new__ method (or more accurately, static method) within classes derived from DJango model.

This is how __new__ should be ideally used (since we are using Django, we can assume that version 2.x of python is being used):

class A(object):
  def __new__(self, *args, **kwargs):
    print ("This is A's new function")
    return super(A, self).__new__(self, *args, **kwargs)

  def __init__(self):
    print ("This is A's init function")

Instantiating an object from the above class works as expected. Now, when one tries something like this on classes derived from Django models, something unexpected happens:

class Test(models.Model):
  def __new__(self, *args, **kwargs):
    return super(Test, self).__new__(self, *args, **kwargs)

Instantiating an object from the above class results in this error: TypeError: unbound method __new__() must be called with Test instance as first argument (got ModelBase instance instead).

I can't understand why this is happening (although I know some class magic is happening behind the scenes due to the Django framework).

Any answers will be appreciated.

__new__ doesn't receive an instance as its first parameter. How could it when (a) it's a static method, as you note, and (b) its job is to create an instance and return it! The first parameter of __new__ is conventionally called cls, as it is the class.

Which makes the error message you quote very weird; it is normally the error message you'd get when you called an unbound method (i.e. what you get by accessing ClassName.methodName) with something other than an instance of that class as the self parameter. However, staticmethods (including __new__) don't become unbound methods, they're just simple functions that happen to be attributes of a class:

>>> class Foo(object):
    def __new__(cls, *args, **kwargs):
        return object.__new__(cls)
    def method(self):
        pass

>>> class Bar(object):
    pass

>>> Foo.method
<unbound method Foo.method>

>>> Foo.__new__
<function __new__ at 0x0000000002DB1C88>

>>> Foo.method(Bar())
Traceback (most recent call last):
  File "<pyshell#36>", line 1, in <module>
    Foo.method(Bar())
TypeError: unbound method method() must be called with Foo instance as first argument (got Bar instance instead)

>>> Foo.__new__(Bar)
<__main__.Bar object at 0x0000000002DB4F28>

You can see from this that __new__ should never be an unbound method. Furthermore (unlike a normal method) it doesn't care that you're consistent in what you pass it; I have actually managed to construct an instance of Bar by calling Foo.__new__ because both it and Bar.__new__ are ultimately implemented the same way (deferring all actual work to object.__new__).

However, this led me to look at the source code of Django itself, briefly. Django's Model class has a metaclass, ModelBase. Which is quite complex, and I didn't figure out what it's doing entirely, but I did notice something very interesting.

ModelBase.__new__ (the metaclass __new__, which is the function that creates the class at the end of your class block) invokes its super __new__ without passing it your class dictionary. It instead passes a dictionary with only the __module__ attribute set. Then, after doing a whole bunch of processing, it does the following:

    # Add all attributes to the class.
    for obj_name, obj in attrs.items():
        new_class.add_to_class(obj_name, obj)

(attrs is the dictionary containing all your definitions in your class block, including your __new__ function; add_to_class is a metaclass method that is mostly just setattr).

I'm now 99% certain that the problem is here, because __new__ is a weird implicitly static method. So unlike every other static method, you haven't applied the staticmethod decorator to it. Python (at some level) just recognises the __new__ method and processes it as a static method rather than a normal method[1]. But I'll bet this only happens when you define __new__ in a class block, not when you set it using setattr.

So your __new__, which should be a static method but hasn't been processed by the staticmethod decorator, is being converted into a normal instance method. Then when Python calls it passing the class Test, as per the normal instance creation protocol, it complains that it's not getting an instance of Test.

If that's all correct, then:

  1. This fails because Django is a bit broken, but only in that it fails to take Python's inconsistency about __new__ being static into account.
  2. You could probably make this work by applying @staticmethod to your __new__ method, even though you shouldn't have to.

[1] I believe this is a historical quirk of Python, since __new__ was introduced before the staticmethod decorator, but __new__ can't take an instance, since there's no instance to call it on.