Best python questions in September 2010

Peak detection in a 2D array

264 votes

I'm helping a veterinary clinic measuring pressure under a dogs paw. I use Python for my data analysis and now I'm stuck trying to divide the paws into (anatomical) subregions.

I made a 2D array of each paw, that consists of the maximal values for each sensor that has been loaded by the paw over time. Here's an example of one paw, where I used Excel to draw the areas I want to 'detect'. These are 2 by 2 boxes around the sensor with local maxima's, that together have the largest sum.

alt text

So I tried some experimenting and decide to simply look for the maximums of each column and row (can't look in one direction due to the shape of the paw). This seems to 'detect' the location of the separate toes fairly well, but it also marks neighboring sensors.

alt text

So what would be the best way to tell Python which of these maximums are the ones I want?

Note: The 2x2 squares can't overlap, since they have to be separate toes!

Also I took 2x2 as a convenience, any more advanced solution is welcome, but I'm simply a human movement scientist, so I'm neither a real programmer or a mathematician, so please keep it 'simple'.

Edit: Here's a link to my array with the four average paws. I used pickle to write it to the file as was suggested here

Here's a version that can be loaded with np.loadtxt


Results

So I tried @jextee's solution (see the results below). As you can see, it works very on the front paws, but it works less well for the hind legs.

More specifically, it can't recognize the small peak that's the fourth toe. This is obviously inherent to the fact that the loop looks top down towards the lowest value, without taking into account where this is.

Would anyone know how to tweak @jextee's algorithm, so that it might be able to find the 4th toe too?

alt text

Since I haven't processed any other trials yet, I can't supply any other samples. But the data I gave before were the averages of each paw. This file is an array with the maximal data of 9 paws in the order they made contact with the plate.

This image shows how they were spatially spread out over the plate.

alt text

Update:

I have set up a blog for anyone interested and I have setup a SkyDrive with all the raw measurements. So to anyone requesting more data: more power to you!


New update:

So after the help I got with my questions regarding paw detection and paw sorting, I was finally able to check the toe detection for every paw! Turns out, it doesn't work so well in anything but paws sized like the one in my own example. Off course in hindsight, it's my own fault for choosing the 2x2 so arbitrarily.

Here's a nice example of where it goes wrong: a nail is being recognized as a toe and the 'heel' is so wide, it gets recognized twice!

alt text

The paw is too large, so taking a 2x2 size with no overlap, causes some toes to be detected twice. The other way around, in small dogs it often fails to find a 5th toe, which I suspect is being caused by the 2x2 area being too large.

After trying the current solution on all my measurements I came to the staggering conclusion that for nearly all my small dogs it didn't find a 5th toe and that in over 50% of the impacts for the large dogs it would find more!

So clearly I need to change it. My own guess was changing the size of the neighborhood to something smaller for small dogs and larger for large dogs. But generate_binary_structure wouldn't let me change the size of the array.

Therefore, I'm hoping that anyone else has a better suggestion for locating the toes, perhaps having the toe area scale with the paw size?

I detected the peaks using a local maximum filter. Here is the result on your first dataset of 4 paws: Peaks detection result

I also ran it on the second dataset of 9 paws and it worked as well.

Here is how you do it:

import numpy as np
from scipy.ndimage.filters import maximum_filter
from scipy.ndimage.morphology import generate_binary_structure, binary_erosion
import matplotlib.pyplot as pp

#for some reason I had to reshape. Numpy ignored the shape header.
paws_data = np.loadtxt("paws.txt").reshape(4,11,14)

#getting a list of images
paws = [p.squeeze() for p in np.vsplit(paws_data,4)]


def detect_peaks(image):
    """
    Takes an image and detect the peaks usingthe local maximum filter.
    Returns a boolean mask of the peaks (i.e. 1 when
    the pixel's value is the neighborhood maximum, 0 otherwise)
    """

    # define an 8-connected neighborhood
    neighborhood = generate_binary_structure(2,2)

    #apply the local maximum filter; all pixel of maximal value 
    #in their neighborhood are set to 1
    local_max = maximum_filter(image, footprint=neighborhood)==image
    #local_max is a mask that contains the peaks we are 
    #looking for, but also the background.
    #In order to isolate the peaks we must remove the background from the mask.

    #we create the mask of the background
    background = (image==0)

    #a little technicality: we must erode the background in order to 
    #successfully subtract it form local_max, otherwise a line will 
    #appear along the background border (artifact of the local maximum filter)
    eroded_background = binary_erosion(background, structure=neighborhood, border_value=1)

    #we obtain the final mask, containing only peaks, 
    #by removing the background from the local_max mask
    detected_peaks = local_max - eroded_background

    return detected_peaks


#applying the detection and plotting results
for i, paw in enumerate(paws):
    detected_peaks = detect_peaks(paw)
    pp.subplot(4,2,(2*i+1))
    pp.imshow(paw)
    pp.subplot(4,2,(2*i+2) )
    pp.imshow(detected_peaks)

pp.show()

All you need to do after is use scipy.ndimage.measurements.label on the mask to label all distinct objects. Then you'll be able to play with them individually.

Note that the method works well because the background is not noisy. If it were, you would detect a bunch of other unwanted peaks in the background. Another important factor is the size of the neighborhood. You will need to adjust it if the peak size changes (the should remain roughly proportional).

Why are scripting languages (e.g. Perl, Python, Ruby) not suitable as shell languages?

82 votes

What are the differences between shell languages like bash, zsh, fish and the scripting languages above that makes them more suitable for the shell?

When using the command line the shell languages seem to be much easier. It feels for me much smoother to use bash for example than to use the shell profile in ipython, despite reports to the contrary. I think most wil agree with me that a large portion of medium to large scale programming is easier in Python than in bash. I use Python as the language I am most familiar with, the same goes for Perl and Ruby.

I have tried to articulate the reason but am unable to, aside from assuming that the treatment of strings differently in both has something to do with it.

The reason of this question is that I am hoping to develop a language usable in both. If you know of such a language, please post it as well.

Edit: As S.Lott explains, the question needs some clarification. I am asking about the features of the shell language versus that of scripting languages. So the comparison is not about the characteristics of various interactive (REPL) environments such as history and command line substitution. An alternative expression for the question would be:

Can a programming language that is suitable for design of complex systems be at the same time able to express useful one-liners that can access the file system or control jobs? Can a programming language usefully scale up as well as scale down?

There are a couple of differences that I can think of. (Just thoughtstreaming here, there's no particular order to those.)

  1. Python & Co. are designed to be good at scripting. Bash & Co. are designed to be only good at scripting, with absolutely no compromise. IOW: Python is designed to be good both at scripting and non-scripting, Bash cares only about scripting.
  2. Bash & Co. are untyped, Python & Co. are strongly typed, which means that the number 123, the string 123 and the file 123 are quite different. They are, however, not statically typed, which means they need to have different literals for those, in order to keep them apart. Example:

    • Ruby: 123 (number), Bash: 123
    • Ruby: '123' (string), Bash: 123
    • Ruby: /123/ (regexp), Bash: 123
    • Ruby: File.open('123') (file), Bash: 123
    • Ruby: IO.open('123') (file descriptor), Bash: 123
    • Ruby: URI.parse('123') (URI), Bash: 123
    • Ruby: `123` (command), Bash: 123
  3. Python & Co. are designed to scale up to 10000, 100000, maybe even 1000000 line programs, Bash & Co. are designed to scale down to 10 character programs.

  4. In Bash & Co., files, directories, file descriptors, processes are all first-class objects, in Python, only Python objects are first-class, if you want to manipulate files, directories etc., you have to wrap them in a Python object first.
  5. Shell programming is basically dataflow programming. Nobody realizes that, not even the people who write shells, but it turns out that shells are quite good at that, and general-purpose languages not so much. In the general-purpose programming world, dataflow seems to be mostly viewed as a concurrency model, not so much as a programming paradigm.

I have the feeling that trying to address these points by bolting features or DSLs onto a general-purpose programming language doesn't work. At least, I have yet to see a convincing implementation of it. There is RuSH (Ruby shell), which tries to implement a shell in Ruby, there is rush, which is an internal DSL for shell programming in Ruby, there is Hotwire, which is a Python shell, but IMO none of those come even close to competing with Bash, Zsh, fish and friends.

Actually, IMHO, the best current shell is Microsoft PowerShell, which is very surprising considering that for several decades now, Microsoft has continually had the worst shells evar. I mean, COMMAND.COM? Really? (Unfortunately, they still have a crappy terminal. It's still the "command prompt" that has been around since, what? Windows 3.0?)

PowerShell was basically created by ignoring everything Microsoft has ever done (COMMAND.COM, CMD.EXE, VBScript, JScript) and instead starting from the Unix shell, then removing all backwards-compatibility cruft (like backticks for command substitution) and massaging it a bit to make it more Windows-friendly (like using the now unused backtick as an escape character instead of the backslash which is the path component separator character in Windows). After that, is when the magic happens.

They address problem 1 and 3 from above, by basically making the opposite choice compared to Python. Python cares about large programs first, scripting second. Bash cares only about scripting. PowerShell cares about scripting first, large programs second. A defining moment for me was watching a video of an interview with Jeffrey Snover (PowerShell's lead designer), when the interviewer asked him how big of a program one could write with PowerShell and Snover answered without missing a beat: "80 characters." At that moment I realized that this is finally a guy at Microsoft who "gets" shell programming (probably related to the fact that PowerShell was neither developed by Microsoft's programming language group (i.e. lambda-calculus math nerds) nor the OS group (kernel nerds) but rather the server group (i.e. sysadmins who actually use shells)), and that I should probably take a serious look at PowerShell.

Number 2 is solved by having arguments be statically typed. So, you can write just 123 and PowerShell knows whether it is a string or a number or a file, because the cmdlet (which is what shell commands are called in PowerShell) declares the types of its arguments to the shell. This has pretty deep ramifications: unlike Unix, where each command is responsible for parsing its own arguments (the shell basically passes the arguments as an array of strings), argument parsing in PowerShell is done by the shell. The cmdlets specify all their options and flags and arguments, as well as their types and names and documentation(!) to the shell, which then can perform argument parsing, tab completion, IntelliSense, inline documentation popups etc. in one centralized place. (This is not revolutionary, and the PowerShell designers acknowledge shells like the DIGITAL Command Language (DCL) and the IBM OS/400 Command Language (CL) as prior art. For anyone who has ever used an AS/400, this should sound familiar. In OS/400, you can write a shell command and if you don't know the syntax of certain arguments, you can simply leave them out and hit F4, which will bring a menu (similar to an HTML form) with labelled fields, dropdown, help texts etc. This is only possible because the OS knows about all the possible arguments and their types.) In the Unix shell, this information is often duplicated three times: in the argument parsing code in the command itself, in the bash-completion script for tab-completion and in the manpage.

Number 4 is solved by the fact that PowerShell operates on strongly typed objects, which includes stuff like files, processes, folders and so on.

Number 5 is particularly interesting, because PowerShell is the only shell I know of, where the people who wrote it were actually aware of the fact that shells are essentially dataflow engines and deliberately implemented it as a dataflow engine.

Another nice thing about PowerShell are the naming conventions: all cmdlets are named Action-Object and moreover, there are also standardized names for specific actions and specific objects. (Again, this should sound familar to OS/400 users.) For example, everything which is related to receiving some information is called Get-Foo. And everything operating on (sub-)objects is called Bar-ChildItem. So, the equivalent to ls is Get-ChildItem (although PowerShell also provides builtin aliases ls and dir – in fact, whenever it makes sense, they provide both Unix and CMD.EXE aliases as well as abbreviations (gci in this case)).

But the killer feature IMO is the strongly typed object pipelines. While PowerShell is derived from the Unix shell, there is one very important distinction: in Unix, all communication (both via pipes and redirections as well as via command arguments) is done with untyped, unstructured, ASCII strings. In PowerShell, it's all strongly typed, structured objects. This is so incredibly powerful that I seriously wonder why noone else has thought of it. (Well, they have, but they never became popular.) In my shell scripts, I estimate that up to one third of the commands is only there to act as an adapter between two other commands that don't agree on a common textual format. Many of those adapters go away in PowerShell, because the cmdlets exchange structured objects instead of unstructured text. And if you look inside the commands, then they pretty much consist of three stages: parse the textual input into an internal object representation, manipulate the objects, convert them back into text. Again, the first and third stage basically go away, because the data already comes in as objects.

However, the designers have taken great care to preserve the dynamicity and flexibility of shell scripting through what they call an Adaptive Type System.

Anyway, I don't want to turn this into a PowerShell commercial. There are plenty of things that are not so great about PowerShell, although most of those have to do either with Windows or with the specific implementation, and not so much with the concepts. (E.g. the fact that it is implemented in .NET means that the very first time you start up the shell can take up to several seconds if the .NET framework is not already in the filesystem cache due to some other application that needs it. Considering that you often use the shell for well under a second, that is completely unacceptable.)

The most important point I want to make is that if you want to look at existing work in scripting languages and shells, you shouldn't stop at Unix and the Ruby/Python/Perl/PHP family. For example, Tcl was already mentioned. Rexx would be another scripting language. Emacs Lisp would be yet another. And in the shell realm there are some of the already mentioned mainframe/midrange shells such as the OS/400 command line and DCL. Also, Plan9's rc.

while (1) Vs. for while(True) -- Why is there a difference?

34 votes

Intrigued by this question about infinite loops in perl: http://stackoverflow.com/questions/885908/while-1-vs-for-is-there-a-speed-difference, I decided to run a similar comparison in python. I expected that the compiler would generate the same byte code for while(True): pass and while(1): pass, but this is actually not the case in python2.7.

The following script:

import dis

def while_one():
    while 1:
        pass

def while_true():
    while True:
        pass

print("while 1")
print("----------------------------")
dis.dis(while_one)

print("while True")
print("----------------------------")
dis.dis(while_true)

produces the following results:

while 1
----------------------------
  4           0 SETUP_LOOP               3 (to 6)

  5     >>    3 JUMP_ABSOLUTE            3
        >>    6 LOAD_CONST               0 (None)
              9 RETURN_VALUE        
while True
----------------------------
  8           0 SETUP_LOOP              12 (to 15)
        >>    3 LOAD_GLOBAL              0 (True)
              6 JUMP_IF_FALSE            4 (to 13)
              9 POP_TOP             

  9          10 JUMP_ABSOLUTE            3
        >>   13 POP_TOP             
             14 POP_BLOCK           
        >>   15 LOAD_CONST               0 (None)
             18 RETURN_VALUE        

Using while True is noticeably more complicated. Why is this?

In other contexts, python acts as though True equals 1:

>>> True == 1
True

>>> True + True
2

Why does while distinguish the two?

I noticed that python3 does evaluate the statements using identical operations:

while 1
----------------------------
  4           0 SETUP_LOOP               3 (to 6) 

  5     >>    3 JUMP_ABSOLUTE            3 
        >>    6 LOAD_CONST               0 (None) 
              9 RETURN_VALUE         
while True
----------------------------
  8           0 SETUP_LOOP               3 (to 6) 

  9     >>    3 JUMP_ABSOLUTE            3 
        >>    6 LOAD_CONST               0 (None) 
              9 RETURN_VALUE         

Is there a change in python3 to the way booleans are evaluated?

In Python 2.x, True is not a keyword, but just a built-in global constant that is defined to 1 in the bool type. Therefore, the interpreter still has to load the contents of True. In other words, True is reassignable:

Python 2.7 (r27:82508, Jul  3 2010, 21:12:11) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> True = 4
>>> True
4

In Python 3.x, it truly becomes a keyword and a real constant:

Python 3.1.2 (r312:79147, Jul 19 2010, 21:03:37) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> True = 4
  File "<stdin>", line 1
SyntaxError: assignment to keyword

thus the interpreter can replace the while True: loop with an infinite loop.

How to properly determine current script directory in Python?

21 votes

I would like to see what is best way to determine current script directory in python?

I discovered that two to the many ways of calling python code, it is hard to find a good solution.

Here are some problems:

  • __file__ is not defined if the script is executed with exec, execfile
  • __module__ is defined only in modules

Use cases:

  • ./myfile.py
  • python myfile.py
  • ./somedir/myfile.py
  • python somedir/myfile.py
  • exec('myfile.py') (from another script, that can be located in another directory and that can have another current directory.

I know that there is no perfect solution, because in some cases but I'm looking for the best approach that solved most of the cases.

The most used approach is os.path.dirname(os.path.abspath(__file__)) but this is really doesn't work if you execute the script from another one with exec().

If you really want to cover the case that a script is called via execfile(...), you can use the inspect module to deduce the filename (including the path). As far as I am aware, this will work for all cases you listed:

filename = inspect.getframeinfo(inspect.currentframe()).filename
path = os.path.dirname(os.path.abspath(filename))

What is the Perl version of a Python iterator?

21 votes

I am learning Perl at my work and enjoying it. I usually do my work in Python but boss wants Perl.

Most of the concepts in Python and Perl match nicely: Python dictionary=Perl hash; Python tuple=Perl list; Python list=Perl array; etc.

Question: Is there a Perl version of the Python form of an Iterator / Generator?

An example: A Classic Python way to generate the Fibonacci numbers is:

#!/usr/bin/python

def fibonacci(mag):
     a, b = 0, 1
     while a<=10**mag:
         yield a
         a, b = b, a+b

for number in fibonacci(15):  
     print "%17d" % number

Iterators are also useful if you want to generate a subsection of a much larger list as needed. Perl 'lists' seem more static - more like a Python tuple. In Perl, can foreach be dynamic or is only based on a static list?

The Python form of Iterator is a form that I have gotten used to, and I do not find it documented in Perl... Other than writing this in loops or recursively or generating a huge static list, how do I (for ex) write the Fibonacci subroutine it in Perl? Is there a Perl yield that I am missing?

Specifically -- how do I write this:

#!/usr/bin/perl
use warnings; use strict; # yes -- i use those!

sub fibonacci {
   # What goes here other than returning an array or list? 
}

foreach my $number (fibonacci(15)) { print $number . "\n"; }

Thanks in advance to being kind to the newbie...

For an even more flexible solution than Python's generators, I have written the module List::Gen on CPAN which provides random access lazy generator arrays:

use List::Gen;

my $fib; $fib = cache gen {$_ < 2  ? $_ : $$fib[$_ - 1] + $$fib[$_ - 2]};

say "@$fib[0 .. 15]";  #  0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610

Since generators pretend to be arrays, they can mix seamlessly with normal perl code. There is also an object oriented approach:

my $fib; $fib = cache gen {$_ < 2 ? $_ : $fib->get($_ - 1) + $fib->get($_ - 2)};

say join ' ' => $fib->slice(0 .. 15);

In each case, the generator is lazy, calculating nothing upon creation, and then calculating only those values required to satisfy the slices. The recursive definition of the Fibonacci sequence calls itself many times, so the cache function is used to make sure each value is only calculated once.

You can also use generators as iterators:

while (my $num = $fib->next) {
    last if $num > 10**15;
    print "$_\n";
}

$fib->next can also be written $fib->(). Since the generator is still random access, you can $fib->reset() or $fib->index = 10;

Let me know if you have any questions.

Update:

I have released a new version of the module (0.80) that makes it easier to use iterative algorithms in generators. Here is an example that closely mirrors the OP's example:

use List::Gen '*';

sub fibonacci {
    my $limit   = 10**shift;
    my ($x, $y) = (0, 1);

    While {$_ < $limit} gather {
        ($x, $y) = ($y, take($x) + $y)
    }
}

say for @{fibonacci 15};

if you use bigint; before or at the top of the sub, you can of course:

say for @{fibonacci 400}; # or more

Generating all 5 card poker hands

21 votes

This problem sounds simple at first glance, but turns out to be a lot more complicated than it seems. It's got me stumped for the moment.

There are 52c5 = 2,598,960 ways to choose 5 cards from a 52 card deck. However, since suits are interchangeable in poker, many of these are equivalent - the hand 2H 2C 3H 3S 4D is equivalent to 2D 2S 3D 3C 4H - simply swap the suits around. According to wikipedia, there are 134,459 distinct 5 card hands once you account for possible suit recolorings.

The question is, how do we efficiently generate all these possible hands? I don't want to generate all hands, then eliminate duplicates, as I want to apply the problem to larger numbers of cards, and the number of hands to evaluate fast spirals out of control. My current attempts have centered around either generating depth-first, and keeping track of the currently generated cards to determine what suits and ranks are valid for the next card, or breadth-first, generating all possible next cards, then removing duplicates by converting each hand to a 'canonical' version by recoloring. Here's my attempt at a breadth-first solution, in Python:

# A card is represented by an integer. The low 2 bits represent the suit, while
# the remainder represent the rank.
suits = 'CDHS'
ranks = '23456789TJQKA'

def make_canonical(hand):
  suit_map = [None] * 4
  next_suit = 0
  for i in range(len(hand)):
    suit = hand[i] & 3
    if suit_map[suit] is None:
      suit_map[suit] = next_suit
      next_suit += 1
    hand[i] = hand[i] & ~3 | suit_map[suit]
  return hand

def expand_hand(hand, min_card):
  used_map = 0
  for card in hand:
    used_map |= 1 << card

  hands = set()
  for card in range(min_card, 52):
    if (1 << card) & used_map:
      continue
    new_hand = list(hand)
    new_hand.append(card)
    make_canonical(new_hand)
    hands.add(tuple(new_hand))
  return hands

def expand_hands(hands, num_cards):
  for i in range(num_cards):
    new_hands = set()
    for j, hand in enumerate(hands):
      min_card = hand[-1] + 1 if i > 0 else 0
      new_hands.update(expand_hand(hand, min_card))
    hands = new_hands
  return hands

Unfortunately, this generates too many hands:

>>> len(expand_hands(set([()]), 5))
160537

Can anyone suggest a better way to generate just the distinct hands, or point out where I've gone wrong in my attempt?

Your overall approach is sound. I'm pretty sure the problem lies with your make_canonical function. You can try printing out the hands with num_cards set to 3 or 4 and look for equivalencies that you've missed.

I found one, but there may be more:

# The inputs are equivalent and should return the same value
print make_canonical([8, 12 | 1]) # returns [8, 13]
print make_canonical([12, 8 | 1]) # returns [12, 9]

For reference, below is my solution (developed prior to looking at your solution). I used a depth-first search instead of a breadth-first search. Also, instead of writing a function to transform a hand to canonical form, I wrote a function to check if a hand is canonical. If it's not canonical, I skip it. I defined rank = card % 13 and suit = card / 13. None of those differences are important.

import collections

def canonical(cards):
    """
    Rules for a canonical hand:
    1. The cards are in sorted order

    2. The i-th suit must have at least many cards as all later suits.  If a
       suit isn't present, it counts as having 0 cards.

    3. If two suits have the same number of cards, the ranks in the first suit
       must be lower or equal lexicographically (e.g., [1, 3] <= [2, 4]).

    4. Must be a valid hand (no duplicate cards)
    """

    if sorted(cards) != cards:
        return False
    by_suits = collections.defaultdict(list)
    for suit in range(0, 52, 13):
        by_suits[suit] = [card%13 for card in cards if suit <= card < suit+13]
        if len(set(by_suits[suit])) != len(by_suits[suit]):
            return False
    for suit in range(13, 52, 13):
        suit1 = by_suits[suit-13]
        suit2 = by_suits[suit]
        if not suit2: continue
        if len(suit1) < len(suit2):
            return False
        if len(suit1) == len(suit2) and suit1 > suit2:
            return False
    return True

def deal_cards(permutations, n, cards):
    if len(cards) == n:
        permutations.append(list(cards))
        return
    start = 0
    if cards:
        start = max(cards) + 1
    for card in range(start, 52):
        cards.append(card)
        if canonical(cards):
            deal_cards(permutations, n, cards)
        del cards[-1]

def generate_permutations(n):
    permutations = []
    deal_cards(permutations, n, [])
    return permutations

for cards in generate_permutations(5):
    print cards

It generates the correct number of permutations:

Cashew:~/$ python2.6 /tmp/cards.py | wc
134459

strange python behaviour with mixing globals/parameters and function named 'top'

18 votes

The following code (not directly in an interpreter, but execute as file)

def top(deck):
    pass

def b():
    global deck

produces the error

SyntaxError: name 'deck' is local and global

on python2.6.4 and

SyntaxError: name 'deck' is parameter and global

on python 3.1

python2.4 seems to accept this code, so does the 2.6.4 interactive interpreter.

This is already odd; why is 'deck' conflicting if it's a global in one method and a parameter in the other?

But it gets weirder. Rename 'top' to basically anything else, and the problem disappears.

Can someone explain this behaviour? I feel like I'm missing something very obvious here. Is the name 'top' somehow affecting certain scoping internals?

Update

This indeed appears to be a bug in the python core. I have filed a bug report.

It looks like it is a bug in the symbol table handling. Python/symtable.c has some code that (although somewhat obfuscated) does indeed treat 'top' as a special identifier:

if (!GET_IDENTIFIER(top) ||
    !symtable_enter_block(st, top, ModuleBlock, (void *)mod, 0)) {
    PySymtable_Free(st);
    return NULL;
}

followed somewhat later by:

if (name == GET_IDENTIFIER(top))
    st->st_global = st->st_cur->ste_symbols;

Further up the file there's a macro:

#define GET_IDENTIFIER(VAR) \
    ((VAR) ? (VAR) : ((VAR) = PyString_InternFromString(# VAR)))

which uses the C preprocessor to initialise the variable top to an interned string with the name of the variable.

I think the symbol table must be using the name 'top' to refer to the top level code, but why it doesn't use something that can't conflict with a real variable I have no idea.

I would report it as a bug if I were you.

Writing a parser for regular expressions

15 votes

Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.

So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.

To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.

The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.

EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.

Writing an implementation of a regular expression engine is indeed a quite complex task.

But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:

Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

Why don't they implement python and ruby on the web browsers?

15 votes

I wonder, why don't they implement other languages like python and ruby on the web browsers?

Don't they fit as client programming languages or did it just happen to be that Javascript was the first one to be implemented, and they then kept on only supporting javascript because it just worked.

I mean...I really hate Javascript compared to Ruby, no matter how hard I try to like it, as soon as I see Ruby code I want to cry for Javascript for being so ugly.

Will there be no chance at all for Ruby and Python on the browser side (without having to use Silverlight) for the next 10 years?

At the time JavaScript was designed, Python would have been at a very immature stage (1.2-ish) and Ruby wouldn't have existed at all. What we consider a modern scripting language now didn't exist then. Python didn't gain Unicode support (vital for a web browser) until version 1.6, several years later; Ruby... well, yeah.

The dominant scripting language then was Perl. Let us be thankful Eich didn't copy that.

Technically, a language for execution on the client side needs strong sandboxing capabilities that CPython and Ruby don't have. Whilst Python can be integrated into IE via the Windows Scripting Host, doing so completely hoses your security. It is not trivial to create a sandboxed version of a language that wasn't designed for it.

Will there be no chance at all for Ruby and Python on the browser side

No, none whatsoever, even in a restricted form that solved the security problems. Even Microsoft couldn't make VBScript for the web catch on. JavaScript is the language that works everywhere; you aren't going to be able to beat that inertia.

At this point we must concentrate on improving the language. The standardisation of ECMAScript Fifth Edition is a big step forward, offering new methods that really help with writing terse code that passes around functions like Ruby blocks. And Mozilla's JavaScript implementation offers some interesting new features like Python-style generators. (On the other hand, it also supports E4X, a vile pox on the language, so whatever.)

JS is not so bad, written tastefully. Of course, the majority of code out there, and in tutorials, is anything but tasteful. But that's hardly a problem limited to JS.

Why do you have to call .iteritems() when iterating over a dictionary in python?

14 votes

Why do you have to call iteritems() to iterate over key, value pairs in a dictionary? ie


dic = {'one':'1', 'two':'2'}
for k, v in dic.iteritems():
    print k, v

Why isn't that the default behavior of iterating over a dictionary


for k, v in dic:
    print k, v

For every python container C, the expectation is that

for item in C:
    assert item in C

will pass just fine -- wouldn't you find it astonishing if one sense of in (the loop clause) had a completely different meaning from the other (the presence check)? I sure would! It naturally works that way for lists, sets, tuples, ...

So, when C is a dictionary, if in were to yield key/value tuples in a for loop, then, by the principle of least astonishment, in would also have to take such a tuple as its left-hand operand in the containment check.

How useful would that be? Pretty useless indeed, basically making if (key, value) in C a synonym for if C.get(key) == value -- which is a check I believe I may have performed, or wanted to perform, 100 times more rarely than what if k in C actually means, checking the presence of the key only and completely ignoring the value.

On the other hand, wanting to loop just on keys is quite common, e.g.:

for k in thedict:
    thedict[k] += 1

having the value as well would not help particularly:

for k, v in thedict.items():
    thedict[k] = v + 1

actually somewhat less clear and less concise. (Note that items was the original spelling of the "proper" methods to use to get key/value pairs: unfortunately that was back in the days when such accessors returned whole lists, so to support "just iterating" an alternative spelling had to be introduced, and iteritems it was -- in Python 3, where backwards compatibility constraints with previous Python versions were much weakened, it became items again).

Understanding __get__ and __set__ and Python descriptors.

14 votes

I am trying to understand what Python's descriptors are and what they can useful for. However, I am failing at it. I understand how they work, but here are my doubts. Consider the following code:

>>> class Celsius(object):
    def __init__(self, value=0.0):
        self.value = float(value)
    def __get__(self, instance, owner):
        return self.value
    def __set__(self, instance, value):
        self.value = float(value)


>>> class Temperature(object):
    celsius = Celsius()
  1. Why do I need the descriptor class? Please explain using this example or the one you think is better.

  2. What is instance and owner here? (in __get__). So my question is, what is the purpose of the third parameter here?

  3. How would I call/ use this example?

Sorry for being such a noob, but I can't really understand how to get this working.

The descriptor is how python's property type is implemented. A descriptor simply implements __get__, __set__, etc. and is then added to another class in its definition (as you did above with the Temperature class). For example

temp=Temperature()
temp.celsius #calls Celsius.__get__

Accessing the property you assigned the descriptor to (celsius in the above example) calls the appropriate descriptor method.

instance in __get__ is the instance of the class (so above, __get__ would recieve temp, while owner is the class with the descriptor (so it would be Temperature).

You need to use a descriptor class to encapsulate the logic that powers it. That way, if the descriptor is used to cache some expensive operation (for example), it could store the value on itself and not its class.

An article about descriptors can be found at http://martyalchin.com/2007/nov/23/python-descriptors-part-1-of-2/

EDIT: As jchl pointed out in the comments, if you simply try Temperature.celsius, instance will be None.

Installing SetupTools on 64-bit Windows

13 votes

I'm running Python 2.7 on Windows 7 64-bit, and when I run the installer for setuptools it tells me that Python 2.7 is not installed. The specific error message is:

`Python Version 2.7 required which was not found in the registry`

My installed version of Python is:

`Python 2.7 (r27:82525, Jul  4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on win32`

I'm looking at the setuptools site and it doesn't mention any installers for 64-bit Windows. Have I missed something or do I have to install this from source?

Apparently (having faced related 64- and 32-bit issues on OS X) there is a bug in the Windows installer. I stumbled across this workaround, which might help?

Why there are no ++ and -- operators in Python?

12 votes

Why there are no ++/-- operators in Python?

It's not because it doesn't make sense; it makes perfect sense to define "x++" as "x += 1, evaluating to the previous binding of x".

If you want to know the original reason, you'll have to either wade through old Python mailing lists or ask somebody who was there (eg. Guido), but it's easy enough to justify after the fact:

Simple increment and decrement aren't needed as much as in other languages. You don't write things like for(int i = 0; i < 10; ++i) in Python very often; instead you do things like for i in range(0, 10).

Since it's not needed nearly as often, there's much less reason to give it its own special syntax; when you do need to increment, += is usually just fine.

It's not a decision of whether it makes sense, or whether it can be done--it does, and it can. It's a question of whether the benefit is worth adding to the core syntax of the language. Remember, this is four operators--postinc, postdec, preinc, predec, and each of these would need to have its own class overloads; they all need to be specified, and tested; it would add opcodes to the language (implying a larger, and therefore slower, VM engine); every class that supports a logical increment would need to implement them (on top of += and -=).

This is all redundant with += and -=, so it would become a net loss.

Pythonic way to check if a list is sorted or not

12 votes

Is there a pythonic way to check if a list is already sorted in AESC or DESC.

listtimestamps=[1,2,3,5,6,7]

something like listtimestamps.isSorted() that returns True or False.

EDIT: I want to input a list of timestamps for some messages and check if the the transactions appeared in the correct order.

and I can write a custom function too :) , Thanks

EDIT: Thanks for pointing out camel case issues :)

EDIT: Also I am sorry if I wasn't clear I don't need to sort it afterwards, I just need to check if the list is sorted, I am using it to solve a decision problem

Actually we are not giving the answer anijhaw is looking for. Here is the one liner:

all(l[i] <= l[i+1] for i in xrange(len(l)-1))

PEP8 and PyQt, how to reconcile

11 votes

I'm starting to use PyQt in some projects and I'm running into a stylistic dilemma. PyQt's functions use camel case, but PEP8, which I prefer to follow, says to use underscores and all lowercase for function names. So on the one hand, I can continue to follow PEP8, meaning that my code will have mixed functions calls to camel case and underscore functions, and even my classes will have mixed function names, since I'll need to be overloading functions like mousePressEvent. Or, I can break PEP8 and adopt camel case for all my function names in the name of consistency.

I realize this is subjective and it's really just what I personally prefer, but I like to hear from others about what they do and why they chose to do it that way.

In your shoes, I wouldn't fight your framework, just like, as a general principle, I don't fight City Hall;-). I happen to share your preference for lowercase-with-underscore function names as PEP 8 specifies, but when I'm programming in a framework that forces a different capitalization style, I resign myself to adopting that style too, since I can't convince the framework to adopt the "better" style, and style inconsistencies (haphazard mixes of different styles) are really worse.

Of course, some mixage is inevitable if you're using more than one framework... e.g., PyQt with its camelcase, and standard Python library functions with their lowercase and underscores!-). But since frameworks like Qt are often intended to be extended by subclassing, while the standard Python library has less aspects of such a design, in most case where the capitalization style is forced (because you need to override a method, so you can't choose a different capitalization), it will be forced to camelcase (by Qt), only rarely to lowercase (by the standard Python library). So, I think that adopting Qt style in this case is still the lesser evil.

Is Flask recommened for inexpereienced Python programmers?

11 votes

Regarding Flask, the basic docs look cool, but I understand that in order to use it efficiently, I would have to use Werkzeug libraries.

I don't know if I would be able to understand all those different components.

Please indicate if Flask is something which will really help me understand things and would it be a good place to start off for a totally inexperienced and amateur like me?

If you are new to web development in Python then Flask is probably one of the best places to start - period, end of story.

It is still small enough that you can learn about WSGI from it's (excellent and extensively documented) source code -- and it's powerful enough, and has enough batteries included that you don't have to spend time trying to pick a good library to use for X or Y. (It includes bindings for Jinja2 by default and has a good extension for SQLAlchemy, for example.)

Django, and other large frameworks are daunting because they include all of the batteries up front (since you are working on a complex website with a deadline -- otherwise, why would you be using them) and are therefore a bit more difficult to pick up. Web.py and other really-micro-frameworks are daunting for the exact opposite reason -- they leave almost everything up to you (since you probably already know what you are doing and really just need the web framework to get out of your way.)

Flask does include everything you need to start building something more complex than a "Hello World" app -- it integrates a templating engine (Jinja2) for you so you don't have to decide whether you would be better off using Brevé, Genshi, Cheetah or Mako (though you could use any of the above if you wanted to). It does not include bash and .bat scripts to set up your project workspace, powerful web-based administrative management systems or an ORM so you can dive right in and start hacking without having to stop for 4 hours to read up on a new concept you had never heard of before.

Now, to be fair to all sides of the spectrum (Django and Web.py alike) they are all great systems for getting things done -- and once you've started learning you might find that you learn quicker with the leaner systems (like Web.py) or that you prefer the convenience of the full-stack frameworks (like Django). But for starting out, for learning the basics of WSGI and Python web development in particular and of dynamic web development in general, I do not know of any web framework that gives a better introduction to the concepts underlying it than Flask.

Recommended .gitignore file for Python projects?

9 votes

I'm trying to collect some of my default settings, and one thing I realized I don't have a standard for is .gitignore files. There's a great thread showing a good .gitignore for Visual Studio projects, but I don't see many recommendations for Python and related tools (PyGTK, Django).

So far, I have...

*.pyc
*.pyo

...for the compiled objects and...

build/
dist/

...for the setuptools output.

Any more recommendations for me?

When using buildout I have following in .gitignore (along with *.pyo and *.pyc):

.installed.cfg
bin
develop-eggs
dist
downloads
eggs
parts
src/*.egg-info
lib
lib64

Thanks to Jacob Kaplan-Moss

Also I tend to put .svn in since we use several SCM-s where I work.

Is python good enough for big applications

8 votes

From the moment I faced python the only thing I can say for it is "It is awesome". I am using Django framework for it and I am amazed how quick the things happen and how developer friendly this language is. But from many sides I hear that Python is scripting language, and very useful for small things, experiments etc.

So the question is can a big and heavy loaded application be build in python(and django). I am mainly focused on the web development so for examples I can give Stack Overflow, Facebook, Amazon etc.

P.S. According to many of the answers maybe I have to rephrase the question. There are several big application working on Python(the best example You Tube) so it can handle them but why then it is no so popular for large projects as(for example) JAVA, C++ and .NET.

Python is a pleasure to work with on big applications. Compared to other enterprise-popular languages you get:

  • No compilation time, if you ever worked on a large C++ project you know how time consuming this can get
  • A concise and clean syntax that makes reading code easier, also a big time saver when reading someone else's code or even yours when it was written long time ago
  • Portability at the core level, if it's important for your app to run on more than one platform it certainly helps
  • It's fast enough for most things, and when it's not, rewriting hot spots in C is trivial with tools such as Cython and numpy. People advocating against dynamic languages for speed reasons have forgotten the 80-20 rule (or never heard about it). The important thing to consider when choosing a language for a performance-critical application IMHO is how easily you can gain access to the C level when needed, and Python is great for that

It's not a magic language however, you need to use the same techniques used for big projects in other languages: TDD (some may argue that it's more important than in other languages because of the lack of type checking, but that's not a win for other languages, unit tests are always important in big projects), clean OO design, etc... or maintaining your application will become a nightmare.

The main reason for its lack of acceptance in enterprise compared to .NET, Java et al. is probably not having herds of consultants and "certified specialists" bragging about their tool being the best thing on Earth. I also heard Java was easily accepted because its syntax resembled C++... that may not be such a silly idea considering C# also chose to take this route.

What is the best way of running shell commands from a web based interface?

7 votes

Imagine a web application that allows a logged in user to run a shell command on the web server at the press of a button. This is relatively simple in most languages via some standard library os tools.

But if that command is long running you don't want your UI to hang. Again this is relatively easy to deal with using some sort of background process or putting the command to be executed onto a message queue (and maybe saving the output and status somewhere for later consumption). Just return quickly saving we'll run that and get back to you.

What I'd like to do is show the output of said web ui triggered shell command as it happens. So vertically scrolling text like when running in a terminal.

I have a vague idea of how I might approach this, streaming the output to a websocket perhaps and simply printing the output to screen.

What I'd like to ask is:

Are their any plugins, libraries or applications that already do this. Something I can either use or read the source of. Ideally an open source python/django or ruby/rails tool, but other stacks would be interesting too.

So, I've tried to answer my own question with code as I couldn't find anything to quite fit the bill. Hopefully it's useful to anyone coming across the same problem.

Redbeard 0X0A pointed me in the general direction, I was able to get a stand along ruby script doing what I wanted using popen. Extending this to using EventMachine (as it provided a convenient way of writing a websocket server) and using it's inbuilt popen method solved my problem.

More details here http://morethanseven.net/2010/09/09/Script-running-web-interface-with-websockets.html and the code at http://github.com/garethr/bolt/

Strategies for Encryption with Django + Postgres?

6 votes

I'm going to be storing a few sensitive pieces of information (SSN, Bank Accounts, etc) so they'll obviously need to be encrypted. What strategies do you recommend?

Should I do all the encryption/decryption in the web app itself? Should I use something like pgcrypto and have the conversions done on the DB side? Something else entirely?

Also, if you think I should do encryption on the web app side, what Python libraries would you recommend?

What are you protecting against? If attacker would get access to your DB/filesystem, he would find how you decrypt data & keys. Hiding your encription key is not an easy task (and rarely implemented in "usual" applications).

I would spend more time on protecting the server and fixing all general security issues.