Best python questions in April 2011

Can we shed some definitive light on how python packaging and import works ?

32 votes

I had my fair chance of getting through the python management of modules, and every time is a challenge: packaging is not what people do every day, and it becomes a burden to learn, and a burden to remember, even when you actually do it, since this happens normally once.

I would like to collect here the definitive overview of how import, package management and distribution works in python, so that this question becomes the definitive explanation for all the magic that happens under the hood. Although I understand the broad level of the question, these things are so intertwined that any focused answer will not solve the main problem: understand how all works, what is outdated, what is current, what are just alternatives for the same task, what are the quirks.

The list of keywords to refer to is the following, but this is just a sample out of the bunch. There's a lot more and you are welcome to add additional details.

  • PyPI
  • setuptools / Distribute
  • distutils
  • eggs
  • egg-link
  • pip
  • zipimport
  • site.py
  • site-packages
  • .pth files
  • virtualenv
  • handling of compiled modules in eggs (with and without installation via easy_install)
  • use of get_data()
  • pypm
  • bento
  • PEP 376
  • the cheese shop
  • eggsecutable

Linking to other answers is probably a good idea. As I said, this question is for the high-level overview.

For the most part, this is an attempt to look at the packaging/distribution side, not the mechanics of import. Unfortunately, packaging is the place where Python provides way more than one way to do it. I'm just trying to get the ball rolling, hopefully others will help fill what I miss or point out mistakes.

First of all there's some messy terminology here. A directory containing an __init__.py file is a package. However, most of what we're talking about here are specific versions of packages published on PyPI, one of it's mirrors, or in a vendor specific package management system like Debian's Apt, Redhat's Yum, Fink, Macports, Homebrew, or ActiveState's pypm.

These published packages are what folks are trying to call "Distributions" going forward in an attempt to use "Package" only as the Python language construct. You can see some of that usage in PEP-376 PEP-376.

Now, your list of keywords relate to several different aspects of the Python Ecosystem:

Finding and publishing python distributions:

  • PyPI (aka the cheese shop)
  • PyPI Mirrors
  • Various package management tools / systems: apt, yum, fink, macports, homebrew
  • pypm (ActiveState's alternative to PyPI)

The above are all services that provide a place to publish Python distributions in various formats. Some, like PyPI mirrors and apt / yum repositories can be run on your local machine or within your companies network but folks typically use the official ones. Most, if not all provide a tool (or multiple tools in the case of PyPI) to help find and download distributions.

Libraries used to create and install distributions:

  • setuptools / Distribute
  • distutils

Distutils is the standard infrastructure on which Python packages are compiled and built into distributions. There's a ton of functionality in distutils but most folks just know:

from distutils.core import setup

setup(name='Distutils',
      version='1.0',
      description='Python Distribution Utilities',
      author='Greg Ward',
      author_email='gward@python.net',
      url='http://www.python.org/sigs/distutils-sig/',
      packages=['distutils', 'distutils.command'],
 )

And to some extent that's a most of what you need. With the prior 9 lines of code you have enough information to install a pure Python package and also the minimal metadata required to publish that package a distribution on PyPI.

Setuptools provides the hooks necessary to support the Egg format and all of it's features and foibles. Distribute is an alternative to Setuptools that adds some features while trying to be mostly backwards compatible. I believe Distribute is going to be included in Python 3 as the successor to Distutil's from distutils.core import setup.

Both Setuptools and Distribute provide a custom version of the distutils setup command that does useful things like support the Egg format.

Python Distribution Formats:

Distributions are typically provided either as source archives (tarball or zipfile). The standard way to install a source distribution is by downloading and uncompressing the archive and then running the setup.py file inside.

For example, the following will download, build, and install the Pygments syntax highlighting library:

curl -O -G http://pypi.python.org/packages/source/P/Pygments/Pygments-1.4.tar.gz
tar -zxvf Pygments-1.4.tar.gz
cd Pygments-1.4
python setup.py build
sudo python setup.py install

Alternatively you can download the Egg file and install it. Typically this is accomplished by using easy_install or pip:

sudo easy_install pygments
or 
sudo pip install pygments

Eggs were inspired by Java's Jarfiles and they have quite a few features you should read about here

Python Package Formats:

  • uncompressed directories
  • zipimport (zip compressed directories)

A normal python package is just a directory containing an __init__.py file and an arbitrary number of additional modules or sub-packages. Python also has support for finding and loading source code within *.zip files as long as they are included on the PYTHONPATH (sys.path).

Installing Python Packages:

  • easy_install: the original egg installation tool, depends on setuptools
  • pip: currently the most popular way to install python packages. Similar to easy_install but more flexible and has some nice features like requirements files to help document dependencies and reproduce deployments.
  • pypm, apt, yum, fink, etc

Environment Management / Automated Deployment:

  • bento
  • buildout
  • virtualenv (and virtualenvwrapper)

The above tools are used to help automate and manage dependencies for a Python project. Basically they give you tools to describe what distributions your application requires and automate the installation of those specific versions of your dependencies.

Locations of Packages / Distributions:

  • site-packages
  • PYTHONPATH
  • the current working directory (depends on your OS and environment settings)

By default, installing a python distribution is going to drop it into the site-packages directory. That directory is usually something like /usr/lib/pythonX.Y/site-packages.

A simple programmatic way to find your site-packages directory:

from distuils import sysconfig
print sysconfig.get_python_lib()

Ways to modify your PYTHONPATH:

Python's import statement will only find packages that are located in one of the directories included in your PYTHONPATH.

You can inspect and change your path from within Python by accessing:

import sys
print sys.path
sys.path.append("/home/myname/lib")

Besides that, you can set the PYTHONPATH environment variable like you would any other environment variable on your OS or you could use:

  • .pth files: *.pth files located in directories that are already on your PYTHONPATH are read and each line of the *.pth file is added to your PYTHONPATH. Basically any time you would copy a package into a directory on your PYTHONPATH you could instead create a mypackages.pth. Read more about *.pth files: site module
  • egg-link files: Internal structure of python eggs they are a cross platform alternative to symbolic links. Creating an egg link file is similar to creating a pth file.
  • site.py modifications

To add the above /home/myname/lib to site-packages with a *.pth file you'd create a *.pth file. The name of the file doesn't matter but you should still probably choose something sensible.

Let's create myname.pth:

# myname.pth
/home/myname/lib

That's it. Drop that into sysconfig.get_python_lib() on your system or any other directory in your PYTHONPATH and /home/myname/lib will be added to the path.

Packaging and shipping a python library and scripts, the professional way

29 votes

I have the task of packaging and shipping a commercial application bundle, which will include:

  1. a python library (developed by us)
  2. some python programs depending on the library above
  3. additional libraries not developed by us, but which are dependencies of our library.
  4. a complete python installation (python 2.6)
  5. additional stuff, libs and programs in other languages. Not a concern here, as they are not hooked into the above machinery, and the current shipping process works already.

The bundle is shipped to Linux, OSX and Windows. On Linux, it's distributed as a simple tar.gz. The user just unpacks the tar.gz and source a provided bash script in .bashrc, so that the environment is correctly set. On mac, it's a dmg. On windows, I have no idea. The windows guy is not here today, but what I see is that an exe is created somehow.

I will now explain in more detail the above points.

Our Python Library

We don't want to give out sources, so we want to provide only compiled python files. A better strategy to make them even more tamper-proof is welcome, even if it involves some deep hacking (e.g. I once saw magic done importing stuff from a .zip which was "corrupted" ad-hoc). The library at the moment does not have C level code or similar platform dependent code, but this is going to change soon. We will therefore have to provide platform-specific compiled .so together with the pyc.

Clearly, this library will be shipped in the package, together with the rest of our application. It will therefore be installed on the downloaded bundle. For this reason, it must be fully relocatable, and the user must in some way (either manually or via our env script) add the location of the untarred package to PYTHONPATH, so that the interpreter can find it.

Our Python Programs

We will ship applications in our bundle, and these applications will depend on our library. The code of these applications must be either visible by the user (so that he can learn how to use the library interface), or not visible (for those utilities we want to keep closed-source), so a double approach is called for.

Additional Libraries

Our library depends on 3rd party libraries we will have to ship, so that the user is up and running without any dependency hunting. Clearly, these libraries will be installed by us in the bundle, but we must hope these don't store the install path somewhere during the build, because that would make them non relocatable.

Our python

We will ship our version of python, which we assume the user will run in order to access our script. This is because we want to be sure of the python version running. Also, we may tinker a bit the executable or the standard library. We may have a concern about the interaction of this python with the standard python, and if the user wants a specific library on our python it will have to install it within our bundled package, and not on the standard place for libraries.

Request

I need to make my mind around this task. I've seen it done, but never done it personally, so I need your point of view. What I presented above is how I think things should work, according to how things are working right now, but it may be wrong. Any hint, quirk, suggestion, or strategy for a successful deployment is welcome. Given the complexity of the question, I already announce a high bounty on it, according to the best answer I can get.

This is not a complete answer but just a bunch of ideas. I wrote an installer for a client that incorporated some ideas that might be useful to you.

It was Linux only so I focussed on just that. We needed to ship specific custom versions of mySQL, lighttpd, python, memcached, a few 3rd party Python modules and some custom scripts. We needed to launch all these services without any problems and let the user control them using regular initscripts. It should work fine on a bunch of popular distros and therefore shouldn't rely on distro specific stuff.

What I did was as follows.

  1. Created a 500MB (I'm don't recollect the size) file and formatted it as an ext3fs file system.
  2. Mounted it at a point using a loopback device.
  3. Ran deb-bootstrap on the mountpoint to create a custom Debian install.
  4. Chrooted inside the partition and then ran a bunch of scripts which did an apt-get install on all our dependencies, installed all the eggs and other packages which were necessary for the app, installed the app itself in /opt (inside the chroot), installed supervisord (to do process management) and set things up. Now, this partition was a completely self contained Linux filesystem that contained the application and everything needed to run it. You could dump it anywhere, chroot inside it and launch the app. The only dependency it had with the outside world were the ports it would use for its services and the supervisord control socket. This was the main point. We were able to include exactly what we needed (compiled files, .pycs only etc.) for a few of the applications and didn't have to bother with any limitations in standard installation tools.
  5. After this, we packaged a few extra scripts that would go into the external operating system. These were custom made for each distro that we would have to support. This part was distro specific. There were scripts that would go into /etc/init.d and some scripts that would setup the database and stuff at the beginning.
  6. We then created an archive of the entire filesystem using makeself. It would checksum stuff and all that and provide a self extracting archive which if run would untar the whole thing into /opt on the host machine, chroot inside the directory and run a setup script that would ask the user a few questions like db username/password etc. and set things up. After that, it would fetch the scripts I mentioned in step 5 and put them on the host OS.

The initscripts would simply chroot into the partition and start supervisord. It would then take care of launching all the services we cared about. Shutting down the application was simply a matter of connecting to running supervisord and running a command. We wrapped this in the initscript so that the user experience was UNIX like.

Now, we'd give clients the self extracting .run file. They'd run it, get asked a few questions and it would create a directory under /opt which contained our app and all it's dependencies. The init scripts would be modified to start our app on bootup and things would work as expected.

I think step 4 gives you the freedom to install whatever you want, however you want so that things would work fine.

Algorithm for finding the busiest period?

19 votes

I have some data like this:

1: 2 - 10
2: 3 - 15
3: 4 - 9
4: 8 - 14
5: 7 - 13
6: 5 - 10
7: 11 - 15

I will attempt a representation to make it clearer:

        1     2     3     4     5     6     7     8     9     10     11     12     13     14     15
1             |--------------------------------------X---------|
2                   |--------------------------------X--------------------------------------------|
3                         |--------------------------X---|
4                                                  |-X-------------------------------------|
5                                           |--------X------------------------------|
6                               |--------------------X----------|
7                                                                     |---------------------------|

So in the example case, 8-9 is the critical period if the second scheme is used because all the points are active. What is a fast and good way to solving this problem in python? I am thinking of using dynamic programming but are there other approaches that are suggested?

My approach until now:

I was thinking more from a real-time perspective. So, whenever I get a new point, I do this: Assume I already got 2-10 and I get 3-15 then I pick the max of start and min of end so this case it is 3-10 and increment this interval's count to 2. Then the third point comes in 4-9, pick the max which is 4 and the min is 9 and update the value 3-10 to 4-9 and update count to 3. Now when 8-14 comes in, I pick the start of this interval is greater than 4-9 and the end of this interval is less than 4-9. In this case, it is not true so I will create a new bucket 8-14 and I put the count to 1. This is not the entire algorithm but should give a high-level idea of what I am doing here. I will see if I can sketch the pseudo-code.

        1     2     3     4     5     6     7     8     9     10     11     12     13     14     15
1             |--------------------------------------X---------|
2                   |--------------------------------X--------------------------------------------|
3                         |--------------------------X---|
4                                                  |-X-------------------------------------|
5                                           |--------X------------------------------|
6                               |--------------------X----------|
7                                                                     |---------------------------|

             +1    +1     +1   +1           +1     +1    -1    -2     +1           -1     -1     -2
              1     2     3     4           5       6    5      3     4             3      2      0
                                                     ^^^^

Get it?

So you need to transform this:

1: 2 - 10
2: 3 - 15
3: 4 - 9
4: 8 - 14
5: 7 - 13
6: 5 - 10
7: 11 - 15

into:

[(2,+), (3,+), (4,+), (5,+), (7,+), (8,+), (9,-), (10,-), (10,-), (11,+), (13,-), (14,-), (15,-), (15,-)]

and then you simply iterate through, counting up when you see a + and counting down on -. The busiest interval will be when the count is maximum.

So in code:

intervals = [(2, 10), (3, 15), (4, 9), (8, 14), (7, 13), (5, 10), (11, 15)]
intqueue = sorted([(x[0], +1) for x in intervals] + [(x[1], -1) for x in intervals])
rsum = [(0,0)]
for x in intqueue: 
    rsum.append((x[0], rsum[-1][1] + x[1]))
busiest_start = max(rsum, key=lambda x: x[1])
# busiest_end = the next element in rsum after busiest_start 

# instead of using lambda, alternatively you can do:
#     def second_element(x):
#         return x[1]
#     busiest_start = max(rsum, key=second_element)
# or:
#     import operator
#     busiest_start = max(rsum, key=operator.itemgetter(1))

runtime complexity is (n+n)*log(n+n)+n+n or O(n*log(n))

It is also possible to convert this idea into an online algorithm if you don't have the complete list of intervals at the start of the program but is guaranteed that incoming intervals will never be scheduled for a past point. Instead of sorting you will use a priority queue, each time an interval comes, you push in two items, the start point and the end point, each with a +1 and -1 respectively. And then you pop off and count and keep track of the peak hour.

Why is this program faster in Python than Objective-C?

18 votes

I got interested in this small example of an algorithm in Python for looping through a large word list. I am writing a few "tools" that will allow my to slice a Objective-C string or array in a similar fashion as Python.

Specifically, this elegant solution caught my attention for executing very quickly and it uses a string slice as a key element of the algorithm. Try and solve this without a slice!

I have reproduced my local version using the Moby word list below. You can use /usr/share/dict/words if you do not feel like downloading Moby. The source is just a large dictionary-like list of unique words.

#!/usr/bin/env python

count=0
words = set(line.strip() for line in   
           open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl"))
for w in words:
    even, odd = w[::2], w[1::2]
    if even in words and odd in words:
        count+=1

print count      

This script will a) be interpreted by Python; b) read the 4.1 MB, 354,983 word Moby dictionary file; c) strip the lines; d) place the lines into a set, and; e) and find all the combinations where the evens and the odds of a given word are also words. This executes in about 0.73 seconds on a MacBook Pro.

I tried to rewrite the same program in Objective-C. I am a beginner at this language, so go easy please, but please do point out the errors.

#import <Foundation/Foundation.h>

NSString *sliceString(NSString *inString, NSUInteger start, NSUInteger stop, 
        NSUInteger step){
    NSUInteger strLength = [inString length];

    if(stop > strLength) {
        stop = strLength;
    }

    if(start > strLength) {
        start = strLength;
    }

    NSUInteger capacity = (stop-start)/step;
    NSMutableString *rtr=[NSMutableString stringWithCapacity:capacity];    

    for(NSUInteger i=start; i < stop; i+=step){
        [rtr appendFormat:@"%c",[inString characterAtIndex:i]];
    }
    return rtr;
}

NSSet * getDictWords(NSString *path){

    NSError *error = nil;
    NSString *words = [[NSString alloc] initWithContentsOfFile:path
                         encoding:NSUTF8StringEncoding error:&error];
    NSCharacterSet *sep=[NSCharacterSet newlineCharacterSet];
    NSPredicate *noEmptyStrings = 
                     [NSPredicate predicateWithFormat:@"SELF != ''"];

    if (words == nil) {
        // deal with error ...
    }
    // ...

    NSArray *temp=[words componentsSeparatedByCharactersInSet:sep];
    NSArray *lines = 
        [temp filteredArrayUsingPredicate:noEmptyStrings];

    NSSet *rtr=[NSSet setWithArray:lines];

    NSLog(@"lines: %lul, word set: %lul",[lines count],[rtr count]);
    [words release];

    return rtr;    
}

int main (int argc, const char * argv[])
{
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];

    int count=0;

    NSSet *dict = 
       getDictWords(@"/Users/andrew/Downloads/Moby/mwords/354984si.ngl");

    NSLog(@"Start");

    for(NSString *element in dict){
        NSString *odd_char=sliceString(element, 1,[element length], 2);
        NSString *even_char=sliceString(element, 0, [element length], 2);
        if([dict member:even_char] && [dict member:odd_char]){
            count++;
        }

    }    
    NSLog(@"count=%i",count);

    [pool drain];
    return 0;
}

The Objective-C version produces the same result, (13,341 words), but takes almost 3 seconds to do it. I must be doing something atrociously wrong for a compiled language to be more than 3X slower than a scripted language, but I'll be darned if I can see why.

The basic algorithm is the same: read the lines, strip them, and put them in a set.

My guess of what is slow is the processing of the NSString elements, but I do not know an alternative.

Edit

I edited the Python to be this:

#!/usr/bin/env python
import codecs
count=0
words = set(line.strip() for line in 
     codecs.open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl",
          encoding='utf-8'))
for w in words:
    if w[::2] in words and w[1::2] in words:
        count+=1

print count 

For the utf-8 to be on the same plane as the utf-8 NSString. This slowed the Python down to 1.9 secs.

I also switch the slice test to short-circuit type as suggested for both the Python and obj-c version. Now they are close to the same speed. I also tried using C arrays rather than NSStrings, and this was much faster, but not as easy. You also loose utf-8 support doing that.

Python is really cool...

Edit 2

I found a bottleneck that sped things up considerably. Instead of using the [rtr appendFormat:@"%c",[inString characterAtIndex:i]]; method to append a character to the return string, I used this:

for(NSUInteger i=start; i < stop; i+=step){
    buf[0]=[inString characterAtIndex:i];
    [rtr appendString:[NSString stringWithCharacters:buf length:1]];
}

Now I can finally claim that the Objective-C version is faster than the Python version -- but not by much.

Keep in mind that the Python version has been written to move a lot of the heavy lifting down into highly optimised C code when executed on CPython (especially the file input buffering, string slicing and the hash table lookups to check whether even and odd are in words).

That said, you seem to be decoding the file as UTF-8 in your Objective-C code, but leaving the file in binary in your Python code. Using Unicode NSString in the Objective-C version, but 8-bit byte strings in the Python version isn't really a fair comparison - I would expect the performance of the Python version to drop noticeably if you used codecs.open() to open the file with a declared encoding of "utf-8".

You're also making a full second pass to strip the empty lines in your Objective-C, while no such step is present in the Python code.

Pythonic method of determining a list's contents change from odd to even values

16 votes

Writing some test cases and my mind wanders, assuming there is a better way to write something like this. I have a list, its numbers transition from all odd values to all even, doesn't matter where. I need to assert this is the case, here's what I came up with:

values = [1, 3, 5, 7, 5, 3, 5, 3, 5, 7, 4, 6, 8, 4, 2, 2, 8, 6]

# find all the indexes of odd and even values
odds = [i for (i, v) in enumerate(values) if v % 2 == 1]
evens = [i for (i, v) in enumerate(values) if v % 2 == 0]

# indexes should be a continuous sequence: 0, 1, 2, 3 ... n
assert odds + evens == range(evens[-1] + 1)

Seems like a long way to go. Suggestions on how this could be reduced?

[x for x in values if x % 2 == 1] + [x for x in values if x % 2 == 0] == values

This is only true, if values starts with all of it's own odd values, followed by all of its even values.

How do I start and stop a Linux program using the subprocess module in Python?

14 votes

I’m writing a web app that uses Selenium to screen-scrape another website. This screen-scraping only happens once a day, so I’d rather not leave Selenium and Xvfb running all the time.

I’m trying to figure out how to start Xvfb and Selenium from Python, and then stop them once the screen-scraping’s done.

If I was doing it manually, I’d start them at the command line, and hit CTRL C to stop them. I’m trying to do the same thing from Python.

I seem to be able to successfully start Xvfb like this:

xvfb = Popen('Xvfb :99 -nolisten tcp', shell=True)

But when I’ve tried to terminate it:

xvfb.terminate()

and then tried to start it again (by repeating my initial command), it tells me it’s already running.

I don't know why you want to run Xvfb as root. Your usual X server only needs to run as root (on many but not all unices) only so that it can access the video hardware; that's not an issue for Xvfb by definition.

tempdir = tempfile.mkdtemp()
xvfb = subprocess.Popen(['Xvfb', ':99', '-nolisten', 'tcp', '-fbdir', tempdir])

When you terminate the X server, you may see a zombie process. This is in fact not a process (it's dead), just an entry in the process table that goes away when the parent process either reads the child's exit status or itself dies. Zombies are mostly harmless, but it's cleaner to call wait to read the exit status.

xvfb.terminate()
# At this point, `ps -C Xvfb` may still show a running process
# (because signal delivery is asynchronous) or a zombie.
xvfb.wait()
# Now the child is dead and reaped (assuming it didn't catch SIGTERM).

Why is an MD5 hash created by Python different from one created using echo and md5sum in the shell?

14 votes

A Python MD5 hash is different than the one created by the md5sum command on the shell. Why?

  >>> import hashlib
  >>> h = hashlib.md5()
  >>> h.update("mystringforhash")
  >>> print h.hexdigest()

    86b6423cb6d211734fc7d81bbc5e11d3 # result from python

$ echo mystringforhash | md5sum # on the shell

    686687dd68c5de717b34569dbfb8d3c3  -

echo appends a \n. Use the -n argument to omit the trailing linebreak and it will print the same checksum as your python script:

> echo -n mystringforhash |md5sum
86b6423cb6d211734fc7d81bbc5e11d3  -

Detect File Change Without Polling

14 votes

I'm trying to use a method within a Python program to detect whether a file on the file system has been modified. I know that I could have something run on an every-5-seconds to check the last modification date off of the system, but I was curious as to whether there's an easier method for doing this, without needing to require my program to check repeatedly.

Does anyone know of such a method?

For linux, there is pyinotify.

From the homepage:

Pyinotify is a Python module for monitoring filesystems changes. Pyinotify relies on a Linux Kernel feature (merged in kernel 2.6.13) called inotify. inotify is an event-driven notifier, its notifications are exported from kernel space to user space through three system calls. pyinotify binds these system calls and provides an implementation on top of them offering a generic and abstract way to manipulate those functionalities.

Thus it is obviously not cross-platform and relies on a new enough kernel version. However, as far as I can see, requiring kernel support would be true about any non-polling mechanism.

Dangerous Python Keywords?

13 votes

I am about to get a bunch of python scripts from an untrusted source.

I'd like to be sure that no part of the code can hurt my system, meaning:

(1) the code is not allowed to import ANY MODULE

(2) the code is not allowed to read or write any data, connect to the network etc

(the purpose of each script is to loop through a list, compute some data from input given to it and return the computed value)

before I execute such code, I'd like to have a script 'examine' it and make sure that there's nothing dangerous there that could hurt my system.

I thought of using the following approach: check that the word 'import' is not used (so we are guaranteed that no modules are imported)

yet, it would still be possible for the user (if desired) to write code to read/write files etc (say, using open).

Then here comes the question:

(1) where can I get a 'global' list of python methods (like open)?

(2) Is there some code that I could add to each script that is sent to me (at the top) that would make some 'global' methods invalid for that script (for example, any use of the keyword open would lead to an exception)?

I know that there are some solutions of python sandboxing. but please try to answer this question as I feel this is the more relevant approach for my needs.

EDIT: suppose that I make sure that no import is in the file, and that no possible hurtful methods (such as open, eval, etc) are in it. can I conclude that the file is SAFE? (can you think of any other 'dangerous' ways that built-in methods can be run?)

I'd check for eval() first, as you can obfuscate import with it. eval() evaluates code, so you can run this code:

eval('imp{0}rt os'.format('o')) # 'imp{0}rt os'.format('o') -> 'import os'

Which imports the os module without explicitly having the import statement within the script. But as @MK suggested in his comment, use a sandboxed Python installation. If something breaks, it's inside of the "box", not your system.

I'd look into PyPy's sandboxing capabilities here, as I've heard good stuff about it: http://codespeak.net/pypy/dist/pypy/doc/sandbox.html.

But to be completely assured that you're not going to have any troubles whatsoever, run a Virtual Machine within your computer. A Virtual Machine is a completely isolated operating system installation within your normal operating system: for example, you can run Mac OS X from within Ubuntu Linux, and they are completely isolated from one another.

Basically, it's just a miniature screen within a window (man, am I going overkill with this VM thing):

enter image description here

I'd suggest VirtualBox, as it's free and really simple to setup. The only downside to virtualizing an OS is that you actually have to install the operating system into the virtual machine, which takes as much time as installing the operating system onto a normal computer.

Either way, it's up to you, but if you're worried about security, virtualizing is the way to go.

End of nonblocking file

13 votes

How is end of file detected for a file in nonblocking mode?

At least on POSIX (including Linux), the obvious answer is that nonblocking regular files don't exist. Regular files ALWAYS block, and O_NONBLOCK is silently ignored.

Similarly, poll()/select() et al. will always tell you that a fd pointing to a regular file is ready for I/O, regardless of whether the data is ready in the page cache or still on disk (mostly relevant for reading).

EDIT And, since O_NONBLOCK is a no-op for regular files, a read() on a regular file will never set errno to EAGAIN, contrary to what another answer to this question claims.

EDIT2 References:

From the POSIX (p)select() specification: "File descriptors associated with regular files shall always select true for ready to read, ready to write, and error conditions."

From the POSIX poll() specification: "Regular files shall always poll TRUE for reading and writing."

The above suffices to imply that while perhaps not strictly prohibited, non-blocking regular files doesn't make sense as there would be no way to poll them except busy-waiting.

Beyond the above, there is at least some circumstantial evidence

From the POSIX open() specification: The behavior for file descriptors referring to pipes, block special files, and character special files is defined. "Otherwise, the behavior of O_NONBLOCK is unspecified."

Some related links:

http://tinyclouds.org/iocp-links.html

http://www.remlab.net/op/nonblock.shtml

http://davmac.org/davpage/linux/async-io.html

And, even here on stackoverflow:

Can regular file reading benefited from nonblocking-IO?

As the answer by R. points out, due to how page caching works, non-blocking for regular files is not very easily defined. E.g. what if by some mechanism you find out that data is ready for reading in the page cache, and then before you read it the kernel decides to kick that page out of cache due to memory pressure? It's different for things like sockets and pipes, because correctness requires that data is not discarded just like that.

Also, how would you select/poll for a seekable file descriptor? You'd need some new API that supported specifying which byte range in the file you're interested in. And the kernel implementation of that API would tie in to the VM system, as it would need to prevent the pages you're interested in from being kicked out. Which would imply that those pages would count against the process locked pages limit (see ulimit -l) in order to prevent a DOS. And, when would those pages be unlocked? And so on.

Discontinuous slice in python list

13 votes

Hi folks

I'm looking for an efficient way of achieving this, which I think is a slicing-like operation:

>>> mylist = range(100)
>>>magicslicer(mylist, 10, 20)
[0,1,2,3,4,5,6,7,8,9,30,31,32,33,34,35,36,37,38,39,60,61,62,63......,97,98,99]

the idea is: the slicing gets 10 elements, then skips 20 elements, then gets next 10, then skips next 20, and so on.

I think I should not use loops if possible, for the very reason to use slice is (I guess) to do the "extraction" efficiently in a single operation.

Thanks for reading.

itertools.compress (new in 2.7/3.1) nicely supports use cases like this one, especially when combined with itertools.cycle:

from itertools import cycle, compress
seq = range(100)
criteria = cycle([True]*10 + [False]*20) # Use whatever pattern you like
>>> list(compress(seq, criteria))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Python 2.7 timing (relative to Sven's explicit list comprehension):

$ ./python -m timeit -s "a = range(100)" "[x for start in range(0, len(a), 30) for x in a[start:start+10]]"
100000 loops, best of 3: 4.96 usec per loop

$ ./python -m timeit -s "from itertools import cycle, compress" -s "a = range(100)" -s "criteria = cycle([True]*10 + [False]*20)" "list(compress(a, criteria))"
100000 loops, best of 3: 4.76 usec per loop

Python 3.2 timing (also relative to Sven's explicit list comprehension):

$ ./python -m timeit -s "a = range(100)" "[x for start in range(0, len(a), 30) for x in a[start:start+10]]"
100000 loops, best of 3: 7.41 usec per loop

$ ./python -m timeit -s "from itertools import cycle, compress" -s "a = range(100)" -s "criteria = cycle([True]*10 + [False]*20)" "list(compress(a, criteria))"
100000 loops, best of 3: 4.78 usec per loop

As can be seen, it doesn't make a great deal of difference relative to the in-line list comprehension in 2.7, but helps significantly in 3.2 by avoiding the overhead of the implicit nested scope.

A similar difference can also be seen in 2.7 if the aim is to iterate over the resulting sequence rather than turn it into a fully realised list:

$ ./python -m timeit -s "a = range(100)" "for x in (x for start in range(0, len(a), 30) for x in a[start:start+10]): pass"
100000 loops, best of 3: 6.82 usec per loop
$ ./python -m timeit -s "from itertools import cycle, compress" -s "a = range(100)" -s "criteria = cycle([True]*10 + [False]*20)" "for x in compress(a, criteria): pass"
100000 loops, best of 3: 3.61 usec per loop

For especially long patterns, it is possible to replace the list in the pattern expression with an expression like chain(repeat(True, 10), repeat(False, 20)) so that it never has to be fully created in memory.

Algorithm Implementations in Python

12 votes

A few months ago I stumbled across a post here that linked to a site that had a long list of open source standalone implementations of various algorithms in python (e.g. Dijkstra's shortest path, Floyd-Warshall, Prim's, Kruskal's, Knapsack, Subset-sum, various DP problems, sorts, etc.. basically a lot of the fundamental algorithms). I've been searching far and wide for that pointer, but have not been able to find it. Can anyone point me to a library like this?

Thanks.

Another good source in Rosetta Code.

Are there advantages to use the Python/C interface instead of Cython ?

12 votes

Hi,

I want to extend python and numpy by writing some modules in C or C++, using BLAS and LAPACK. I also want to be able to distribute the code as standalone C/C++ libraries. I would like this libraries to use both single and double precision float. Some examples of functions I will write are conjugate gradient for solving linear systems or accelerated first order methods. Some functions will need to call a Python function from the C/C++ code.

After playing a little with the Python/C API and the Numpy/C API, I discovered that many people advocate the use of Cython instead (see for example this question or this one). I am not an expert about Cython, but it seems that for some cases, you still need to use the Numpy/C API and know how it works. Given the fact that I already have (some little) knowledge about the Python/C API and none about Cython, I was wondering if it makes sense to keep on using the Python/C API, and if using this API has some advantages over Cython. In the future, I will certainly develop some stuff not involving numerical computing, so this question is not only about numpy. One of the thing I like about the Python/C API is the fact that I learn some stuff about how the Python interpreter is working.

Thanks.

First, there is one point in your question I don't get:

[...] also want to be able to distribute the code as standalone C/C++ libraries. [...] Some functions will need to call a Python function from the C/C++ code.

How is this supposed to work?

Next, as to your actual question, there are certainly advantages of using the Python/C API directly:

  • Most likely, you are more familar with writing C code than writing Cython code.

  • Writing your code in C gives you maximum control. To get the same performance from Cython code as from equivalent C code, you'll have to be very careful. You'll not only need to make sure to declare the types of all variables, you'll also have to set some flags adequately -- just one example is bounds checking. You will need intimate knowledge how Cython is working to get the best performance.

  • Cython code depends on Python. It does not seem to be a good idea to write code that should also be distributed as standalone C library in Cython

Custom dictionary lookup in Python

12 votes

Hi, if I have a dictionary like this

>>> d = {10: 3, 100: 2, 1000: 1}

I can type something like:

>>> d.get(10), d.get(100), d.get(1000)
(3, 2, 1)

Though I want that if the given key is not found, the value corresponding to the nearest key respect the given key is returned:

>>> d.get(20), d.get(60), d.get(200)
(3, 2, 2)

Instead the result in Python is

(None, None, None)

What's a Pythonic way to implement the behavior I described?

Thanks

You can derive from dict to change the behaviour of the get() method:

class ClosestDict(dict):
    def get(self, key):
        key = min(self.iterkeys(), key=lambda x: abs(x - key))
        return dict.get(self, key)

d = ClosestDict({10: 3, 100: 2, 1000: 1})
print (d.get(20), d.get(60), d.get(200))

prints

(3, 2, 2)

Note that the complexity of get() no longer is O(1), but O(n).

How to detect gestures in OpenKinect (with python wrappers)

11 votes

I've started looking into OpenKinect development, and to start, I'm trying to figure out how to look for certain gestures done by the person.

Are there any tutorials out there on how to do this? Or what would be a good place to start?

I'm just trying to do things like know when a person turns their hand in one direction or the other, for example. Although, I'd certainly appreciate any sort of help at all!

Thanks!

UPDATE: From what I can tell, I'm going to be using the OpenNI/NITE frameworks most likely, in addition with the ONIPY Python wrappers. So unless there's a better framework, I just need to figure out how to make my own gestures now

I'm not sure that it will be exactly what you want, but my brother has used the OpenNI/NITE library to recognize some gestures on the Kinect using Ruby. I saw a demo that my brother did where the computer recognized him giving the computer a wave.

There are Python bindings for that library with the onipy project, but I haven't personally used it. I suspect that it still may need some work, but I would certainly look into it. You'll probably want to read some documentation on the OpenNI website too.

Is there an interactive graphing library for python

10 votes

Hi, I'm looking for an interactive graphing library for Python.

By "graph", I meant a set of nodes connected by a set of vertices (not a plot of values over x-y axis, nor a grid of pixels).

By "interactive", I meant I can drag-and-drop the nodes around and I need to be able to click on the nodes/vertices and have the library pass the nodes/vertices to my callbacks, which may add/remove nodes/vertices or display information (I cannot load the full graph at startup as the dataset is too large/complex; instead I'll be loading only the necessary slices of data depending on user inputs).

By Python, I meant the programming language Python, the graphing library should have CPython binding. I have Python 2.7 and Python 3.1, but can downgrade to 2.6 if necessary. This language requirement is because the dataset I'm working with only have Python binding.

The graphing library must support directed graph and be able to layout the nodes automatically. I need to put labels on the nodes.

Preferably, the layouting algorithm should place adjacent nodes near each other. It should be able to handle from 100-1000 nodes and about 300-4000 vertices reasonably in my 4 year old laptop (I typically start with around 100 nodes, but the number might expand depending on user input). Preferably it should be a library with not too many dependencies (except perhaps for Gnome). Open source is preferred.

I have already written a simple prototype of my program using Tkinter Canvas, but I need a more serious graphing library to expand the program. I've looked at graphviz and matplotlib, but apparently they're only for working with static graphs and apparently would need significant amount of work to do interactive manipulations (correct me if I'm wrong, I've only looked at them briefly). I've also tried generating the graph to an SVG file and using Inkscape to view it, but it's too slow and takes too much memory and because of the sheer number of vertices it becomes a tangled mess.

Looks like Nodebox might be what you want:

http://nodebox.net/code/index.php/Graph Mac OSX

http://www.cityinabottle.org/nodebox/ Windows (using OpenGL)

Nodebox screenshot

The graph object has functionality for mouse interaction as well, bundled in the graph.events object. It has the following properties:

  • graph.events.hovered: None or the node over which the mouse hovers.
  • graph.events.pressed: None or the node on which the mouse is pressing down.
  • graph.events.dragged: None or the node being dragged.
  • graph.events.clicked: None or the last node clicked.
  • graph.events.popup: when True, will display a popup window over the hovered node.

Also came accross Gephi, looks like that might have the functionality you want as well.

http://gephi.org/ Windows, Linux and Mac OSX

Gephi is an interactive visualization and exploration platform for all kinds of networks and complex systems, dynamic and hierarchical graphs.

gephi screenshot

Why is this shell script calling itself as python script?

9 votes

Obviously this shell script is calling itself as a Python script:

#!/bin/sh
## repo default configuration
##
REPO_URL='git://android.git.kernel.org/tools/repo.git'
REPO_REV='stable'

magic='--calling-python-from-/bin/sh--'
"""exec" python -E "$0" "$@" """#$magic"
if __name__ == '__main__':
  import sys
  if sys.argv[-1] == '#%s' % magic:
    del sys.argv[-1]
del magic
:
:

(Whole script: http://android.git.kernel.org/repo)

Can anyone explain

  • the purpose of calling it this way?
    Why not having #!/usr/bin/env python in the first line so it gets interpreted as Python script from the beginning?

  • the purpose of adding that magic last command line argument, that is removed afterwards in the beginning of the Python code?

Your first question: this is done to fix unix systems (or emulations thereof) that do not handle the #! correctly or at all. The high art is to make a script that is coreect in shell as well as in the other language. For perl, one often sees something like:

exec "/usr/bin/perl"
   if 0;

The exec is interpreted and executed by the shell, but the perl interpreter sees a conditional statetment (.... if ...) and does nothing because the condition is false.

Using MongoDB as our master database, should I use a separate graph database to implement relationships between entities?

9 votes

We're currently in the process of implementing a CRM-like solution internally for a professional firm. Due to the nature of the information stored, and the varying values and keys for the information we decided to use a document storage database, as it suited the purposes perfectly (In this case we chose MongoDB).

As part of this CRM solution we wish to store relationships and associations between entities, examples include storing conflicts of interest information, shareholders, trustees etc. Linking all these entities together in the most effective way we determined a central model of "relationship" was necessary. All relationships should have history information attached to them ( commencement and termination dates), as well as varying meta data; for example a shareholder relationship would also contain number of shares held.

As traditional RDBMS solutions didn't suit our former needs, using them in our current situation is not viable. What I'm trying to determine is whether using a graph database is more pertinent in our case, or if in fact just using mongo's built-in relational information is appropriate.

The relationship information is going to be used quite heavily throughout the system. An example of some of the informational queries we wish to perform are:

  • Get all 'key contact' people of companies who are 'clients' of 'xyz limited'
  • Get all other 'shareholders' of companies where 'john' is a shareholder
  • Get all 'Key contact' people of entities who are 'clients' of 'abc limited' and are clients of 'trust us bank limited'

Given this "tree" structure of relationships, is using a graph database (such as Neo4j) more appropriate?

Mike,

you should be able to store your relationship data in the graph database. Its high performance on traversing big graphs comes from locality, i.e. you don't run queries globally but rather start a a set of nodes (which equal documents in your case, which are looked up by an index. you might even store start-node-ids for quick access in your mongo documents). From there you can traverse arbitrarily large paths in constant time (wrt data set size).

What are your other requirements (i.e. data set size, # of concurrent accesses etc, relationship/graph complexity).

Your queries are a really good fit for the graph database and easily expressable in its terms.

I'd suggest that you just grab a graphdb like neo4j and do a quick spike with your domain to verify the general feasibility and also find out additional questions you would like to have answered before investing in the second technology.

P.S. If you hadn't started yet, you could also have gone with a pure graphdb approach as graph databases are a superset of document databases. And you'd rather talk domain in your case anyway than just generic documents. (E.g. structr is a CMS built on top of Neo4j).

MySQL -- Joins Between Databases On Different Servers Using Python?

9 votes

In MySQL, I have two different databases -- let's call them A and B.

Database A resides on server server1, while database B resides on server server2.

Both servers {A, B} are physically close to each other, but are on different machines and have different connection parameters (different username, different password etc).

In such a case, is it possible to perform a join between a table that is in database A, to a table that is in database B?

If so, how do I go about it, programatically, in python? (I am using python's MySQLDB to separately interact with each one of the databases).

Try to use FEDERATED Storage Engine.

Full text searching XML data with Python: best practices, pros & cons

8 votes

Task

I want to use Python for doing full text searches of XML data.

Example data

<elements>
  <elem id="1">some element</elem>
  <elem id="2">some other element</elem>
  <elem id="3">some element
    <nested id="1">
    other nested element
    </nested>
  </elem>
</elements>

Basic functionality

The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.

Ideal functionality

The solution should be flexible and scalable. I am looking for possible combinations of these features:

  • search nested elements (infinite depth)
  • search attributes
  • search for sentences and paragraphs
  • search using wildcards
  • search using fuzzy matching
  • return precise matching info
  • good search speed for large XML files

Question

I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.

EDIT: Thanks for the answers so far, I added detail and started a bounty.

Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)

Compared to the feature list that was added later:

  • search nested elements (infinite depth): yes
  • search attributes: yes
  • search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
  • search using wildcards: yes (regular expressions)
  • search using fuzzy matching: no (assuming stemming, synonyms and so on...)
  • return precise matching info: yes
  • good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.

The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)