Best python questions in August 2010

How to translate between programming languages

41 votes

I am setting out to do a side project that has the goal of translating code from one programming language to another. The languages I am starting with are PHP and Python (Python to PHP should be easier to start with), but ideally I would be able to add other languages with (relative) ease. The plan is:

  • This is geared towards web development. The original and target code will be be sitting on top of frameworks (which I will also have to write). These frameworks will embrace an MVC design pattern and follow strict coding conventions. This should make translation somewhat easier.

  • I am also looking at IOC and dependency injection, as they might make the translation process easier and less error prone.

  • I'll make use of Python's parser module, which lets me fiddle with the Abstract Syntax Tree. Apparently the closest I can get with PHP is token_get_all(), which is a start.

  • From then on I can build the AST, symbol tables and control flow.

Then I believe I can start outputting code. I don't need a perfect translation. I'll still have to review the generated code and fix problems. Ideally the translator should flag problematic translations.

Before you ask "What the hell is the point of this?" The answer is... It'll be an interesting learning experience. If you have any insights on how to make this less daunting, please let me know.


EDIT:

I am more interested in knowing what kinds of patterns I could enforce on the code to make it easier to translate (ie: IoC, SOA ?) the code than how to do the translation.

I've been building tools (DMS Software Reengineering Toolkit) to do this kind of thing since 1995, supported by a strong team of computer scientists. It provides generic parsing, AST building, symbol tables, control and data flow analysis, application of translation rules, regeneration of source text with comments, etc., all parameterized by explicit definitions of computer languages.

The amount of machinery you need to do this well is vast, and then you need reliable parsers for langauges with unreliable definitions (PHP is perfect example of this).

There's nothing wrong with you thinking about it or attempting it, but I think you'll find this a much bigger task than you expect. We have some 100 man-years invested in just DMS, and another 6-12 months in each "reliable" language definition (including the one we painfully built for PHP), much more for nasty languages such as C++. It will be a "hell of a learning experience"; it has been for us. (You might find the technical Papers section at the above website interesting to jump start that learning).

People often attempt to build some kind of generalized machinery by starting with some piece of technology with which they are familiar, that does a part of the job. (Python ASTs are great example). The good news, is that part of the job is done. The bad news is that machinery has a zillion assumptions built into it, most of which you won't discover until you try to wrestle it into doing something else. At that point you find out the machinery is wired to do what it originally does, and will really, really resist your attempt to make it do something else. (I suspect trying to get the Python AST to model PHP is going to be a lot of fun).

The reason I started to build DMS originally was to build foundations that had very few such assumptions built in. It has some that give us headaches. So far, no black holes. (The hardest part of my job over the last 15 years is to try to prevent such assumptions from creeping in).

Lots of folks also make the mistake of assuming that if they can parse (and perhaps get an AST), they are well on the way to doing something complicated. One of the hard lessons is that you need symbol tables and flow analysis to do good program analysis or transformation. ASTs are necessary but not sufficient. This is the reason that Aho&Ullman's compiler book doesn't stop at chapter 2. (The OP has this right in that he is planning to build additional machinery beyond the AST).

The remark about "I don't need a perfect translation" is troublesome. What weak translators do is convert the "easy" 80% of the code, leaving the hard 20% to do by hand. If the applications you intend to convert are pretty small, well, then that 20% is OK. If you attempt to convert 100K SLOC then 20% is 20,000 original lines of code that are hard to translate to understand and modify in the context of another 80,000 lines of program you don't understand. That takes a huge amount of effort. At the million line level, this is simply impossible in practice.

What you have to shoot for to translate large-scale systems is high nineties percentage conversion rates, or it is likely that you can't complete the manual part of the translation activity.

I consider our tools to be extremely good (but then, I'm pretty biased). And it is still very hard to build a good translator. The difference is that with this much machinery, we succeed considerably more often than we fail.

*args and **kwargs ?

23 votes

So I have difficulty with the concept of *args and **kwargs.

So far I have learned that:

  • *args = list of arguments -as positional arguments
  • **kwargs = dictionary - whose keys become separate keyword arguments and the values become values of these arguments.

??

To be honest I don't understand and don't get for what programming task this would helpful. (I am sure there is, but I can't get an understanding of it.)

Maybe:

I think to enter lists and dictionaries as arguments of a function AND at the same time as a wildcard, so I can pass ANY argument?

Is there a simple example on which to explain how *args and **kwargs are used?

Also the tutorial I run through used just the "*" and a variable name.

Is *args and **kwargs just a placeholder or do you use exactly "args" and "*kwargs" in the code?

The syntax is the * and **. The names *args and **kwargs are only by convention but there's no need not to use them.

You would use *args when you're not sure how many arguments might be passed to your function, i.e. it allows you pass an arbitrary number of arguments to your function. For example:

>>> def print_everything(*args):
...     count = 1
...     for thing in args:
...         print "%d. %s" % (count,thing)
...         count += 1
...
>>> print_everything('apple', 'banana', 'cabbage')
1. apple
2. banana
3. cabbage

Similarly, **kwargs allows you to handle named arguments that you have not defined in advance:

>>> def table_things(**kwargs):
...     for name,value in kwargs.items():
...         print name, "=", value
...
>>> table_things(apple = 'fruit', cabbage = 'vegetable')
cabbage = vegetable
apple = fruit

You can use these along with named arguments too. The explicit arguments get values first and then everything else is passed to *args and *kwargs. The named arguments come first in the list. For example:

def table_things(titlestring, **kwargs)

You can also use both in the same function definition but *args must occur before **kwargs.

You can also use the * and ** syntax when calling a function. For example:

>>> def print_three_things(a, b, c):
...     print "a =", a, "& b =", b, "& c =", c
...
>>> mylist = ['aardvark', 'baboon', 'cat']
>>> print_three_things(*mylist)
a = aardvark & b = baboon & c = cat

As you can see in this case it takes the list (or tuple) of items and matches them to the arguments in the function. Of course, you could have a * both in the function definition and in the function call.

Real world typo statistics?

22 votes

Where can I find some real world typo statistics?

I'm trying to match people's input text to internal objects, and people tend to make spelling mistakes.
There are 2 kinds of mistakes:

  1. typos - "Helllo" instead of "Hello" / "Satudray" instead of "Saturday" etc.
  2. Spelling - "Shikago" instead of "Chicago"

I use Damerau-Levenshtein distance for the typos and Double Metaphone for spelling (Python implementations here and here).

I want to focus on the Damerau-Levenshtein (or simply edit-distance). The textbook implementations always use '1' for the weight of deletions, insertions substitutions and transpositions. While this is simple and allows for nice algorithms it doesn't match "reality" / "real-world probabilities".

Examples:

  • I'm sure the likelihood of "Helllo" ("Hello") is greater than "Helzlo", yet they are both 1 edit distance away.
  • "Gello" is closer than "Qello" to "Hello" on a QWERTY keyboard.
  • Unicode transliterations: What is the "real" distance between "München" and "Munchen"?

What should the "real world" weights be for deletions, insertions, substitutions, and transpositions?

Even Norvig's very cool spell corrector uses non-weighted edit distance.

BTW- I'm sure the weights need to be functions and not simple floats (per the above examples)...

I can adjust the algorithm, but where can I "learn" these weights? I don't have access to Google-scale data...

Should I just guess them?

EDIT - trying to answer user questions:

  • My current non-weighted algorithm fails often when faced with typos for the above reasons. "Return on Tursday": every "real person" can easily tell Thursday is more likely than Tuesday, yet they are both 1-edit-distance away! (Yes, I do log and measure my performance).
  • I'm developing an NLP Travel Search engine, so my dictionary contains ~25K destinations (expected to grow to 100K), Time Expressions ~200 (expected 1K), People expressions ~100 (expected 300), Money Expressions ~100 (expected 500), "glue logic words" ("from", "beautiful", "apartment") ~2K (expected 10K) and so on...
  • Usage of the edit distance is different for each of the above word-groups. I try to "auto-correct when obvious", e.g. 1 edit distance away from only 1 other word in the dictionary. I have many other hand-tuned rules, e.g. Double Metaphone fix which is not more than 2 edit distance away from a dictionary word with a length > 4... The list of rules continues to grow as I learn from real world input.
  • "How many pairs of dictionary entries are within your threshold?": well, that depends on the "fancy weighting system" and on real world (future) input, doesn't it? Anyway, I have extensive unit tests so that every change I make to the system only makes it better (based on past inputs, of course). Most sub-6 letter words are within 1 edit distance from a word that is 1 edit distance away from another dictionary entry.
  • Today when there are 2 dictionary entries at the same distance from the input I try to apply various statistics to better guess which the user meant (e.g. Paris, France is more likely to show up in my search than Pārīz, Iran).
  • The cost of choosing a wrong word is returning semi-random (often ridiculous) results to the end-user and potentially losing a customer. The cost of not understanding is slightly less expensive: the user will be asked to rephrase.
  • Is the cost of complexity worth it? Yes, I'm sure it is. You would not believe the amount of typos people throw at the system and expect it to understand, and I could sure use the boost in Precision and Recall.

Possible source for real world typo statistics would be in the Wikipedia's complete edit history.

http://download.wikimedia.org/

Also, you might be interested in the AWB's RegExTypoFix

http://en.wikipedia.org/wiki/Wikipedia:AWB/T

Use cases for the 'setdefault' dict method

17 votes

The addition of collections.defaultdict in Python 2.5 greatly reduced the need for dict's setdefault method. This question is for our collective education:

  1. What is setdefault still useful for, today in Python 2.6/2.7?
  2. What popular use cases of setdefault were superseded with collections.defaultdict?

You could say defaultdict is useful for settings defaults before filling the dict and setdefault is useful for setting defaults while or after filling the dict.

Probably the most common use case: Grouping items (in unsorted data, else use itertools.groupby)

# really verbose
new = {}
for (key, value) in data:
    if key in new:
        new[key].append( value )
    else:
        new[key] = [value]


# easy with setdefault
new = {}
for (key, value) in data:
    group = new.setdefault(key, []) # key might exist already
    group.append( value )


# even simpler with defaultdict 
new = defaultdict(list)
for (key, value) in data:
    new[key].append( value ) # all keys have a default already

Sometimes you want to make sure that specific keys exist after creating a dict. defaultdict doesn't work in this case, because it only creates keys on explicit access. Think you use something HTTP-ish with many headers -- some are optional, but you want defaults for them:

headers = parse_headers( msg ) # parse the message, get a dict
# now add all the optional headers
for headername, defaultvalue in optional_headers:
    headers.setdefault( headername, defaultvalue )

Java or any other language: Which method/class invoked mine?

16 votes

I would like to write a code internal to my method that print which method/class has invoked it.

(My assumption is that I can't change anything but my method..)

How about other programming languages?

EDIT: Thanks guys, how about JavaScript? python? C++?

This is specific to Java.

You can use Thread.currentThread().getStackTrace(). This will return an array of StackTraceElements.

The 2nd element in the array will be the calling method.

Example:

public void methodThatPrintsCaller() {
    StackTraceElement elem = Thread.currentThread.getStackTrace()[2];
    System.out.println(elem);

    // rest of you code
}

Distribute points on a circle as evenly as possible

16 votes

Problem statement

I have the following problem: I have a circle with a certain number (zero or more) of points on it. These positions are fixed. Now I have to position another set of points on the circle, such as all points together are as evenly distributed around the circle as possible.

Goal

My goal is now to develop a algorithm taking a list of angles (representing the fixed points) and an int value (representing how many additional points should be placed) and returning a list of angles again (containing only the angles where the additional points should lie).

The points dont have to be really evenly distributed (all same distance from each other), but rather as evenly as it is just possible. A perfect solution may not exist most of the time, as certain points are fixed.

The range of all angles lie in between -pi and +pi.

Examples

Some examples of what I am trying to archieve:

fixed_points = [-pi, -pi/2, pi/2]

 v         v                   v
 |---------|---------|---------|---------|
-pi     -pi/2        0        pi/2       pi

fill_circle(fixed_points, 1)
# should return: [0]

fill_circle(fixed_points, 2)
# should return: [-pi/6, pi/6]

or:

fixed_points = [-pi, -pi*3/4, -pi/4]

 v    v         v
 |---------|---------|---------|---------|
-pi     -pi/2        0        pi/2       pi

fill_circle(fixed_points, 6)

This last example should return something like: One point is to set right in between -pi*3/4 and -pi/4, that is: -pi/2 and distribute the other 5 points between -pi/4 and +pi (remember it is a circle, so in this case -pi = +pi):

 v    v    x    v   x   x    x   x    x
 |---------|---------|---------|---------|
-pi     -pi/2        0        pi/2       pi

Previous try

I started with a recursive algorithm that first searches for the largest interval between two points and sets the new point right in between. However it doesnt give satisfying results. Consider for example this configuration, with two points needed to be inserted:

 v         v                   v
 |---------|---------|---------|---------|
-pi     -pi/2        0        pi/2       pi

first step: insert right in the middle of largest interval
 v         v         x         v
 |---------|---------|---------|---------|
-pi     -pi/2        0        pi/2       pi

second step: insert right in the middle of largest interval 
-> all intervals are evenly large, so one of them will be taken
 v    x    v         v         v
 |---------|---------|---------|---------|
-pi     -pi/2        0        pi/2       pi

Not a very nice solution, as it could have been much better distributed (see above: -pi/6 and +pi/6).

Sorry for the long question, I hope you understand what I want to archieve.

I dont need a complete working algorithm, but rather the right idea for developing one. Maybe some pseudocode if you like. Would be very grateful for some hints to push me in the right direction. Thanks in advance!

Update: Completed algorithm:

Thank you all for your answers! It showed up I basically just needed a non-greedy version of my already existing algorithm. I really liked haydenmuhls idea to simplify the problem a little bit by encapsulating an interval/segment class:

class Segment:
    def __init__(self, angle, prev_angle, wrap_around):
        self.angle = angle
        self.length = abs(angle - prev_angle + \
                          (2*math.pi if wrap_around else 0))
        self.num_points = 0

    def sub_length(self):
        return self.length / (self.num_points + 1)

    def next_sub_length(self):
        return self.length / (self.num_points + 2)

    def add_point(self):
        self.num_points += 1

This makes the actual algorithm incredibly easy and readable:

def distribute(angles, n):
    # No points given? Evenly distribute them around the circle
    if len(angles) == 0:
        return [2*math.pi / n * i - math.pi for i in xrange(n)]

    # Sort the angles and split the circle into segments
    s, pi, ret = sorted(angles), math.pi, []
    segments = [Segment(s[i], s[i-1], i == 0) for i in xrange(len(s))]

    # Calculate the length of all subsegments if the point
    # would be added; take the largest to add the point to
    for _ in xrange(n):
        max(segments, key = lambda x: x.next_sub_length()).add_point()

    # Split all segments and return angles of the points
    for seg in segments:
        for k in xrange(seg.num_points):
            a = seg.angle - seg.sub_length() * (k + 1)
            # Make sure all returned values are between -pi and +pi
            ret.append(a - 2*pi if a > pi else a + 2*pi if a < -pi else a)

    return ret

You could use an Interval object. An interval is an arc of the circle between two of the original, immovable points.

The following is just pseudo-code. Don't expect it to run anywhere.

class Interval {

  private length;
  private point_count;

  constructor(length) {
    this.length = length;
    this.point_count = 0;
  }

  public add_point() {
    this.point_count++;
  }

  public length() {
    return this.length;
  }

  // The current length of each sub-interval
  public sub_length() {
    return this.length / (this.point_count + 1);
  }

  // The sub-interval length if you were to add another point
  public next_sub_length() { 
    return this.length / (this.point_count + 2);
  }

  public point_count() {
    return this.point_count;
  }
}

Create a list of these objects corresponding to the intervals between points on your circle. Each time you add a point, select the interval with the largest next_sub_length(). When you're done, it won't be hard to reconstruct the new circle.

This should give you the spacing with the the largest possible minimum interval. That is, if you score a solution by the length of its smallest interval, this will give you the highest score possible. I think that's what you've been shooting for.

Edit: Just realized that you specifically asked about this in Python. I'm quite a Python n00b, but you should be able to convert this to a Python object easily enough, though you won't need the getters, since everything in Python is public.

is there need for a more declarative way of expressing regular expressions ? :)

14 votes

I am trying to create a Python function that can take an plain English description of a regular expression and return the regular expression to the caller.

Currently I am thinking of the description in YAML format. So, we can store the description as a raw string variable, which is passed on to this another function and output of that function is then passed to the 're' module. Following is a rather simplistic example:

# a(b|c)d+e*
re1 = """
- literal: 'a'
- one_of: 'b,c'
- one_or_more_of: 'd'
- zero_or_more_of: 'e'
"""
myre = re.compile(getRegex(re1))
myre.search(...)

etc.

Does anyone think something of this sort would be of wider use? Do you know already existing packages that can do it? What are the limitations that you see to this approach? Does anyone think, having the declarative string in code, would make it more maintainable?

For developers trying to write regular expressions that are easy to grok and maintain, I wonder whether this sort of approach would offer anything that re.VERBOSE does not provide already.

For beginners, your idea might have some appeal. However, before you go down this path, you might try to mock up what your declarative syntax would look like for more complicated regular expressions using capturing groups, anchors, look-ahead assertions, and so forth. One challenge is that you might end up with a declarative syntax that is just as difficult to remember as the regex language itself.

You might also think about alternative ways to express things. For example, the first thought that occurred to me was to express a regex using functions with short, easy-to-remember names. For example:

from refunc import *

pattern = Compile(
    'a',
    Capture(
        Choices('b', 'c'),
        N_of( 'd', 1, Infin() ),
        N_of( 'e', 0, Infin() ),
    ),
    Look_ahead('foo'),
)

But when I see that in action, it looks like a pain to me. There are many aspects of regex that are quite intuitive -- for example, + to mean "one or more". One option would be a hybrid approach, allowing your user to mix those parts of regex that are already simple with functions for the more esoteric bits.

pattern = Compile(
    'a',
    Capture(
        '[bc]',
        'd+',
        'e*',
    ),
    Look_ahead('foo'),
)

I would add that in my experience, regular expressions are about leaning a thought process. Getting comfortable with the syntax is the easy part.

Explain C++ SFINAE to a Python Guy

13 votes

What is SFINAE in C++? Can you please explain it in words understandable to a Python programmer? Also, what concept in Python does SFINAE in C++ correspond to?

Warning: this is a really long explanation, but hopefully it really explains not only what SFINAE does, but gives some idea of when and why you might use it.

Okay, to explain this we probably need to back up and explain templates a bit. As we all know, Python uses what's commonly referred to as duck typing -- for example, when you invoke a function, you can pass an object X to that function as long as X provides all the operations used by the function.

In C++, a normal (non-template) function requires that you specify the type of a parameter. If you defined a function like:

int plus1(int x) { return x + 1; }

You can only apply that function to an int. The fact that it uses x in a way that could just as well apply to other types like long or float makes no difference -- it only applies to an int anyway.

To get something closer to Python's duck typing, you can create a template instead:

template <class T>
T plus1(T x) { return x + 1; }

Now our plus1 is a lot more like it would be in Python -- in particular, we can invoke it equally well to an object x of any type for which x + 1 is defined.

Now, consider, for example, that we want to write some objects out to a stream. Unfortunately, some of those objects get written to a stream using stream << object, but others use object.write(stream); instead. We want to be able to handle either one without the user having to specify which. Now, template specialization allows us to write the specialized template, so if it was one type that used the object.write(stream) syntax, we could do something like:

template <class T>
std::ostream &write_object(T object, std::ostream &os) {
    return os << object;
}

template <>
std::ostream &write_object(special_object object, std::ostream &os) { 
    return object.write(os);
}

That's fine for one type, and if we wanted to badly enough we could add more specializations for all the types that don't support stream << object -- but as soon as (for example) the user adds a new type that doesn't support stream << object, things break again.

What we want is a way to use the first specialization for any object that supports stream << object;, but the second for anything else (though we might sometime want to add a third for objects that use x.print(stream); instead).

We can use SFINAE to make that determination. To do that, we typically rely on a couple of other oddball details of C++. One is to use the sizeof operator. sizeof determines the size of a type or an expression, but it does so entirely at compile time by looking at the types involved, without evaluating the expression itself. For example, if I have something like:

int func() { return -1; }

I can use sizeof(func()). In this case, func() returns an int, so sizeof(func()) is equivalent to sizeof(int).

The second interesting item that's frequently used is the fact that the size of an array must to be positive, not zero.

Now, putting those together, we can do something like this:

// stolen, more or less intact from: 
//     http://stackoverflow.com/questions/2127693/sfinae-sizeof-detect-if-expression-compiles
template<class T> T& ref();
template<class T> T  val();

template<class T>
struct has_inserter
{
    template<class U> 
    static char test(char(*)[sizeof(ref<std::ostream>() << val<U>())]);

    template<class U> 
    static long test(...);

    enum { value = 1 == sizeof test<T>(0) };
    typedef boost::integral_constant<bool, value> type;
};

Here we have two overloads of test. The second of these takes a variable argument list (the ...) which means it can match any type -- but it's also the last choice the compiler will make in selecting an overload, so it'll only match if the first one does not. The other overload of test is a bit more interesting: it defines a function that takes one parameter: an array of pointers to functions that return char, where the size of the array is (in essence) sizeof(stream << object). If stream << object isn't a valid expression, the sizeof will yield 0, which means we've created an array of size zero, which isn't allowed. This is where the SFINAE itself comes into the picture. Attempting to substitute the type that doesn't support operator<< for U would fail, because it would produce a zero-sized array. But, that's not an error -- it just means that function is eliminated from the overload set. Therefore, the other function is the only one that can be used in such a case.

That then gets used in the enum expression below -- it looks at the return value from the selected overload of test and checks whether it's equal to 1 (if it is, it means the function returning char was selected, but otherwise, the function returning long was selected).

The result is that has_inserter<type>::value will be l if we could use some_ostream << object; would compile, and 0 if it wouldn't. We can then use that value to control template specialization to pick the right way to write out the value for a particular type.

What’s a good Python profanity filter library?

13 votes

Like http://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services.

(And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know profanity filtering can’t pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn’t a particularly big issue. I know you need some human input to deal with issues of content. I’d just like to find a good library, and see what use I can make of it.)

I didn't found any Python profanity library, so I made one myself.

Parameters


filterlist

A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.

Example: ['bad', 'un\w+']

ignore_case

Default: True

Self-explanatory.

replacements

Default: "$@%-?!"

A string with characters from which the replacements strings will be randomly generated.

Examples: "%&$?!" or "-" etc.

complete

Default: True

Controls if the entire string will be replaced or if the first and last chars will be kept.

inside_words

Default: False

Controls if words are searched inside other words too. Disabling this

Module source


(examples at the end)

"""
Module that provides a class that filters profanities

"""

__author__ = "leoluk"
__version__ = '0.0.1'

import random
import re

class ProfanitiesFilter(object):
    def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!", 
                 complete=True, inside_words=False):
        """
        Inits the profanity filter.

        filterlist -- a list of regular expressions that
        matches words that are forbidden
        ignore_case -- ignore capitalization
        replacements -- string with characters to replace the forbidden word
        complete -- completely remove the word or keep the first and last char?
        inside_words -- search inside other words?

        """

        self.badwords = filterlist
        self.ignore_case = ignore_case
        self.replacements = replacements
        self.complete = complete
        self.inside_words = inside_words

    def _make_clean_word(self, length):
        """
        Generates a random replacement string of a given length
        using the chars in self.replacements.

        """
        return ''.join([random.choice(self.replacements) for i in
                  range(length)])

    def __replacer(self, match):
        value = match.group()
        if self.complete:
            return self._make_clean_word(len(value))
        else:
            return value[0]+self._make_clean_word(len(value)-2)+value[-1]

    def clean(self, text):
        """Cleans a string from profanity."""

        regexp_insidewords = {
            True: r'(%s)',
            False: r'\b(%s)\b',
            }

        regexp = (regexp_insidewords[self.inside_words] % 
                  '|'.join(self.badwords))

        r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)

        return r.sub(self.__replacer, text)


if __name__ == '__main__':

    f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")    
    example = "I am doing bad ungood badlike things."

    print f.clean(example)
    # Returns "I am doing --- ------ badlike things."

    f.inside_words = True    
    print f.clean(example)
    # Returns "I am doing --- ------ ---like things."

    f.complete = False    
    print f.clean(example)
    # Returns "I am doing b-d u----d b-dlike things."

non-technical benefits of having string-type immutable

13 votes

I am wondering about the benefits of having the string-type immutable from the programmers point-of-view.

Technical benefits (on the compiler/language side) can be summarized mostly that it is easier to do optimisations if the type is immutable. Read here for a related question.

Also, in a mutable string type, either you have thread-safety already built-in (then again, optimisations are harder to do) or you have to do it yourself. You will have in any case the choice to use a mutable string type with built-in thread safety, so that is not really an advantage of immutable string-types. (Again, it will be easier to do the handling and optimisations to ensure thread-safety on the immutable type but that is not the point here.)

But what are the benefits of immutable string-types in the usage? What is the point of having some types immutable and others not? That seems very inconsistent to me.

In C++, if I want to have some string to be immutable, I am passing it as const reference to a function (const std::string&). If I want to have a changeable copy of the original string, I am passing it as std::string. Only if I want to have it mutable, I am passing it as reference (std::string&). So I just have the choice about what I want to do. I can just do this with every possible type.

In Python or in Java, some types are immutable (mostly all primitive types and strings), others are not.

In pure functional languages like Haskell, everything is immutable.

Is there a good reason why it make sense to have this inconsistency? Or is it just purely for technical lower level reasons?

What is the point of having some types immutable and others not?

Without some mutable types, you'd have to go the whole hog to pure functional programming -- a completely different paradigm than the OOP and procedural approaches which are currently most popular, and, while extremely powerful, apparently very challenging to a lot of programmers (what happens when you do need side effects in a language where nothing is mutable, and in real-world programming of course you inevitably do, is part of the challenge -- Haskell's Monads are a very elegant approach, for example, but how many programmers do you know that fully and confidently understand them and can use them as well as typical OOP constructs?-).

If you don't understand the enormous value of having multiple paradigms available (both FP one and ones crucially relying on mutable data), I recommend studying Haridi's and Van Roy's masterpiece, Concepts, Techniques, and Models of Computer Programming -- "a SICP for the 21st Century", as I once described it;-).

Most programmers, whether familiar with Haridi and Van Roy or not, will readily admit that having at least some mutable data types is important to them. Despite the sentence I've quoted above from your Q, which takes a completely different viewpoint, I believe that may also be the root of your perplexity: not "why some of each", but rather "why some immutables at all".

The "thoroughly mutable" approach was once (accidentally) obtained in a Fortran implementation. If you had, say,

  SUBROUTINE ZAP(I)
  I = 0
  RETURN

then a program snippet doing, e.g.,

  PRINT 23
  ZAP(23)
  PRINT 23

would print 23, then 0 -- the number 23 had been mutated, so all references to 23 in the rest of the program would in fact refer to 0. Not a bug in the compiler, technically: Fortran had subtle rules about what your program is and is not allowed to do in passing constants vs variables to procedures that assign to their arguments, and this snippet violates those little-known, non-compiler-enforceable rules, so it's a but in the program, not in the compiler. In practice, of course, the number of bugs caused this way was unacceptably high, so typical compilers soon switched to less destructive behavior in such situations (putting constants in read-only segments to get a runtime error, if the OS supported that; or, passing a fresh copy of the constant rather than the constant itself, despite the overhead; and so forth) even though technically they were program bugs allowing the compiler to display undefined behavior quite "correctly";-).

The alternative enforced in some other languages is to add the complication of multiple ways of parameter passing -- most notably perhaps in C++, what with by-value, by-reference, by constant reference, by pointer, by constant pointer, ... and then of course you see programmers baffled by declarations such as const foo* const bar (where the rightmost const is basically irrelevant if bar is an argument to some function... but crucial instead if bar is a local variable...!-).

Actually Algol-68 probably went farther along this direction (if you can have a value and a reference, why not a reference to a reference? or reference to reference to reference? &c -- Algol 68 put no limitations on this, and the rules to define what was going on are perhaps the subtlest, hardest mix ever found in an "intended for real use" programming language). Early C (which only had by-value and by-explicit-pointer -- no const, no references, no complications) was no doubt in part a reaction to it, as was the original Pascal. But const soon crept in, and complications started mounting again.

Java and Python (among other languages) cut through this thicket with a powerful machete of simplicity: all argument passing, and all assignment, is "by object reference" (never reference to a variable or other reference, never semantically implicit copies, &c). Defining (at least) numbers as semantically immutable preserves programmers' sanity (as well as this precious aspect of language simplicity) by avoiding "oopses" such as that exhibited by the Fortran code above.

Treating strings as primitives just like numbers is quite consistent with the languages' intended high semantic level, because in real life we do need strings that are just as simple to use as numbers; alternatives such as defining strings as lists of characters (Haskell) or as arrays of characters (C) poses challenges to both the compiler (keeping efficient performance under such semantics) and the programmer (effectively ignoring this arbitrary structuring to enable use of strings as simple primitives, as real life programming often requires).

Python went a bit further by adding a simple immutable container (tuple) and tying hashing to "effective immutability" (which avoids certain surprises to the programmer that are found, e.g., in Perl, with its hashes allowing mutable strings as keys) -- and why not? Once you have immutability (a precious concept that saves the programmer from having to learn about N different semantics for assignment and argument passing, with N tending to increase with time;-), you might as well get full mileage out of it;-).

Reason why there is no "if Empty" in Python

12 votes

The Zen of Python says "Explicit is better than implicit."

I find that an is Empty to check whether some sequence is empty is so much more explicit than implicit booleanness.

if some_sequence is Empty:
  some_sequence.fill_sequence()

Compare with

if not some_sequence:
  some_sequence.fill_sequence()

This gets even more confusing with some unfavorably choosen variable names.

if saved:
  mess_up()

Compare with

if saved is not Empty:
  mess_up()

See also: "Python: What is the best way to check if a list is empty?". I find it ironic that the most voted answer claims that implicit is pythonic.

So is there a higher reason why there is no is Empty in Python?

Polymorphism in if foo: and if not foo: isn't a violation of "implicit vs explicit": it explicitly delegates to the object being checked the task of knowing whether it's true or false. What that means (and how best to check it) obviously does and must depend on the object's type, so the style guide mandates the delegation -- having application-level code arrogantly asserts it knows better than the object would be the height of folly.

Moreover, X is Whatever always, invariably means that X is exactly the same object as Whatever. Making a totally unique exception for Empty or any other specific value of Whatever would be absurd -- hard to imagine a more unPythonic approach. And "being exactly the same object" is obviously transitive -- so you could never any more have distinct empty lists, empty sets, empty dicts... congratulations, you've just designed a completely unusable and useless language, where every empty container crazily "collapses" to a single empty container object (just imagine the fun when somebody tries to mutate an empty container...?!).

Why use Abstract Base Classes in Python?

12 votes

Being used to the old ways of duck typing in Python, I failed to understand the need for ABC (abstract base classes). The help is good on how to use them.

I tried to read the rationale in the PEP, but it went over my head. If I was looking for a mutable sequence container, I would check for __setitem__, or more likely tried to use it (EAFP). I haven't come across a real life use for the numbers module, which does use ABCs, but that is the closest I have to understanding.

Can anyone explain to me the rationale, please?

Short version: ABCs offer a higher level of semantic contract between clients and the implemented classes.

Long version:

There is a contract between a class and its callers. The class promises to do certain things and have certain properties.

There are different levels to the contract.

At a very low level, the contract might include the name of a method or its number of parameters.

In a staticly-typed language, that contract would actually be enforced by the compiler. In Python, you can use EAFP or introspection to confirm that the unknown object meets this expected contract.

But there are also higher-level, semantic promises in the contract.

For example, if there is a __str__() method, it is expected to return a string representation of the object. It could delete all contents of the object, commit the transaction and spit a blank page out of the printer... but there is a common understanding of what it should do, described in the Python manual.

That's a special case, where the semantic contract is described in the manual. What should the print() method do? Should it write the object to a printer or a line to the screen, or something else? It depends - you need to read the comments to understand the full contract here. A piece of client code that simply checks that the print() method exists has confirmed part of the contract - that a method call can be made, but not that there is agreement on the higher level semantics of the call.

Defining an Abstract Base Class (ABC) is a way of producing a contract between the class implementers and the callers. It isn't just a list of method names, but a shared understanding of what those methods should do. If you inherit from this ABC, you are promising to follow all the rules described in the comments, including the semantics of the print() method.

Python's duck-typing has many advantages in flexibility over static-typing, but it doesn't solve all the problems. ABCs offer an intermediate solution between the free-form of Python and the bondage-and-discipline of a staticly-typed language.

Performance differences between Python and C

11 votes

Working on different projects I have the choice of selecting different programming languages, as long as the task is done.

I was wondering what the real difference is, in terms of performance, between writing a program in Python, versus doing it in C.

The tasks to be done are pretty varied, e.g. sorting textfiles, disk access, network access, textfile parsing.

Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?

And in your experience, given the power of current CPU's (i7), is it really a noticeable difference (Consider that its a program that doesnt bring the system to its knees).

Thanks! :)

Use python until you have a performance problem. If you ever have one figure out what the problem is (often it isn't what you would have guessed up front). Then solve that specific performance problem which will likely be an algorithm or data structure change. In the rare case that your problem really needs C then you can write just that portion in C and use it from your python code.

Why does pip install matplotlib version 0.91.1 when PyPi shows version 1.0.0?

11 votes

PyPi shows matplotlib 1.0.0. However, when I install matplotlib via pip into a virtualenv, version 0.91.1 is installed.

  • Why the difference in versions?
  • Is there a way to pip install matplotlib 1.0.0?

Research

It appears that matplotlib's DOAP record on PyPi is pointing to the correct version. Below is the DOAP record for reference:

<?xml version="1.0" encoding="UTF-8" ?>
<rdf:RDF xmlns="http://usefulinc.com/ns/doap#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Project><name>matplotlib</name>
<shortdesc>Python plotting package</shortdesc>
<description>matplotlib strives to produce publication quality 2D graphics
      for interactive graphing, scientific publishing, user interface
      development and web application servers targeting multiple user
      interfaces and hardcopy output formats.  There is a 'pylab' mode
      which emulates matlab graphics</description>
<download-page>https://sourceforge.net/projects/matplotlib/files/matplotlib/matplotlib-1.0</download-page>
<homepage rdf:resource="http://matplotlib.sourceforge.net" />
<maintainer><foaf:Person><foaf:name>John D. Hunter</foaf:name>
<foaf:mbox_sha1sum>4b099b4a7f50a1f39642ce59c2053c00d4de6416</foaf:mbox_sha1sum></foaf:Person></maintainer>
<release><Version><revision>1.0.0</revision></Version></release>
</Project></rdf:RDF>

Configuration

  • OS: Mac OS X 10.6.6
  • Python 2.7
  • virtualenv 1.5.1
  • pip 0.8.1

Update 24-Aug-10 7:09 AM

Installing from the PyPi mirror also installs version 0.91.1:

$ pip install -i http://d.pypi.python.org/simple matplotlib

Update January 14, 2011 4:54 PM

Even though matplotlib 1.0.1 has been release, this issue still persists.

I've experienced the same problem. I have no idea why it happens, but I do have a fix; use the -f option in pip to tell it where to find the matplotlib sources. (This works in requirements.txt as well).

pip install -f http://downloads.sourceforge.net/project/matplotlib/matplotlib/matplotlib-1.0/matplotlib-1.0.0.tar.gz matplotlib

How should I represent a bit flags int field in django admin?

10 votes

I have a data model with a bitfield defined something like this:

alter table MemberFlags add column title varchar(50) not null default '';
alter table MemberFlags add column value integer( 3) not null default 0;

insert into MemberFlags (title, value) values
    ("Blacklisted",             1),
    ("Special Guest",           2),
    ("Attend Ad-hoc Sessions",  4),
    ("Attend VIP Sessions",     8),
    ("Access Facility A",      16),
    ("Access Facility B",      32)

And used like this:

alter table Membership add column title varchar(50) not null default '';
alter table Membership add column flags integer( 3) not null default 0;

insert into Membership (title, flags) values
    ("Guest Pass",          4+2 ),
    ("Silver Plan",    16+  4   ),
    ("Gold Plan",   32+16+  4+2 ),
    ("VIP Pass",    32+16+8+4+2 )

My questions are:

A) What's the easiest way to represent the different bitflags as separate items in the admin site? Should I override the template, or do something with forms?

B) How about the search list? I could create functions in the model to represent each bit, but how would searching and sorting be done?

I'm new to Django.

Working off the snippet in Andrew's answer, here are the changes you'd need to make:

from django.db import models
from django import forms

class BitFlagFormField(forms.MultipleChoiceField):
    widget = forms.CheckboxSelectMultiple

    def __init__(self, *args, **kwargs):
        super(BitFlagFormField, self).__init__(*args, **kwargs)

class BitFlagField(models.Field):
    __metaclass__ = models.SubfieldBase

    def get_internal_type(self):
        return "Integer"

    def get_choices_default(self):
        return self.get_choices(include_blank=False)

    def _get_FIELD_display(self, field):
        value = getattr(self, field.attname)
        choicedict = dict(field.choices)

    def formfield(self, **kwargs):
        # don't call super, as that overrides default widget if it has choices
        defaults = {'required': not self.blank, 'label': capfirst(self.verbose_name), 
                    'help_text': self.help_text, 'choices':self.choices}
        if self.has_default():
            defaults['initial'] = self.get_default()
        defaults.update(kwargs)
        return BitFlagFormField(**defaults)

    def get_db_prep_value(self, value):
        if isinstance(value, int):
            return value
        elif isinstance(value, list):
            return sum(value)

    def to_python(self, value):
        result = []
        n = 1
        while value > 0:
            if (value % 2) > 0:
                result.append(n)
            n *= 2
            value /= 2
        return sorted(result)


    def contribute_to_class(self, cls, name):
        super(BitFlagField, self).contribute_to_class(cls, name)
        if self.choices:
            func = lambda self, fieldname = name, choicedict = dict(self.choices):" and ".join([choicedict.get(value,value) for value in getattr(self,fieldname)])
            setattr(cls, 'get_%s_display' % self.name, func)

A Python 2.7, 3.0 Question for Pythonistas - Best Practices?

10 votes

I am learning to program in Python and we have version 2.7 installed at work. Whenever I try to dive deep into Python, I never liked the dea of diving into a deprecated version (2.7). My work is not ready for 3.0 just yet.

My question is : I want to write code in 2.7 which can easily be converted to 3.0 using 2to3. Honestly, I want to style my 2.7 code in a 3.0 manner.

Any helpful pointers, books that exist to achieve this style? I prefer the cutting edge features of 3.0 by not compromising my 2.7 knowledge required at work. Possible?

I really appreciate your answers (:

3.0 is totally obsolete and should be avoided like the plague; the current version of Python 3 is 3.1, and soon there will be a 3.2. Please have nothing to do with 3.0!

2.7 is easily written in a way that helps 2to3 rewrite the code easily for 3.x -- Mark Pilgrim explains the details well, so I guess you could say that his "Dive Into Python 3" online book come reasonably close to meeting your request (and is also easy to check out, since it is a free online book;-).

Generate multiple random numbers to equal a value in python

10 votes

So here is the deal: I want to (for example) generate 4 pseudo-random numbers, that when added together would equal 40. How could this be dome in python? I could generate a random number 1-40, then generate another number between 1 and the remainder,etc, but then the first number would have a greater chance of "grabbing" more.

b = random.randint(2, 38)
a = random.randint(1, b - 1)
c = random.randint(b + 1, 39)
return [a, b - a, c - b, 40 - c]

(I assume you wanted integers since you said "1-40", but this could be easily generalized for floats.)

Here's how it works:

  • cut the total range in two randomly, that's b. The odd range is because there are going to be at least 2 below the midpoint and at least 2 above. (This comes from your 1 minimum on each value).
  • cut each of those ranges in two randomly. Again, the bounds are to account for the 1 minimum.
  • return the size of each slice. They'll add up to 40.

How to rewrite python script output in terminal?

8 votes

I have a python script and I want to make it display a increasing number from 0 to 100% in the terminal, I know how to print the numbers on the terminal but how can I "rewrite" them so 0 turns into 1, 1 into 2, and so on until 100?

Printing a carriage return (\r) without a newline resets the cursor to the beginning of the line, making the next print overwriting what's already printed:

import time
import sys
for i in range(100):
    print i,
    sys.stdout.flush()
    time.sleep(1)
    print "\r",

Simulate Mouse Clicks on Python

7 votes

I'm currently in the process of making my Nintendo Wiimote (Kinda sad actually) to work with my computer as a mouse. I've managed to make the nunchuk's stick control actually move the mouse up and down, left and right on the screen! This was so exciting. Now I'm stuck.

I want to left/right click on things via python when i click A, When I went to do a search, All it came up with was tkinker?

So my question is, What do I call to make python left/right click on the desktop, and if it's possible, maybe provide a snippet?

Thank you for your help!

NOTE: I guess I forgot to mention that this is for Linux.

python-uinput is very easy to use.

http://codegrove.org/projects/python-uinput

Here's an example http://github.com/tuos/python-uinput/blob/master/examples/mouse.py

Python 2.5 Windows Binaries?

6 votes

I need to test an issue that occurs on Windows with Python 2.5, but the releases page doesn't link to a binary for 2.5.

Is there anywhere I could find a copy?

It's on their FTP server still, it's just the link that's gone:

http://www.python.org/ftp/python/2.5/

You'll want one of the MSI files, depending on your Windows version (32-bits or 64-bits).