Best python questions in March 2012

Seeking clarification on apparent contradictions regarding weakly typed languages

52 votes

I think I understand strong typing, but every time I look for examples for what is weak typing I end up finding examples of programming languages that simply coerce/convert types automatically.

For instance, in this article named Typing: Strong vs. Weak, Static vs. Dynamic says that Python is strongly typed because you get an exception if you try to:

Python

1 + "1"
Traceback (most recent call last):
File "", line 1, in ? 
TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, such thing is possible in Java and in C#, and we do not consider them weakly typed just for that.

Java

  int a = 10;
  String b = "b";
  String result = a + b;
  System.out.println(result);

C#

int a = 10;
string b = "b";
string c = a + b;
Console.WriteLine(c);

In this another article named Weakly Type Languages the author says that Perl is weakly typed simply because I can concatenate a string to a number and viceversa without any explicit conversion.

Perl

$a=10;
$b="a";
$c=$a.$b;
print $c; #10a

So the same example makes Perl weakly typed, but not Java and C#?.

Gee, this is confusing enter image description here

The authors seem to imply that a language that prevents the application of certain operations on values of different types is strongly typed and the contrary means weakly typed.

Therefore, at some point I have felt prompted to believe that if a language provides a lot of automatic conversions or coercion between types (as perl) may end up being considered weakly typed, whereas other languages that provide only a few conversions may end up being considered strongly typed.

I am inclined to believe, though, that I must be wrong in this interepretation, I just do not why or how to explain it.

So, my questions are:

  • What does it really mean for a language to be truly weakly typed?
  • Could you mention any good examples of weakly typing that are not related to automatic conversion/automatic coercion done by the language?
  • Can a language be weakly typed and strongly typed at the same time?

Thanks in advance for any references, use cases or examples that you provide that can lead me into the right direction.

What does it really mean for a language to be "weakly typed"?

It means "this language uses a type system that I find distasteful". A "strongly typed" language by contrast is a language with a type system that I find pleasant.

The terms are essentially meaningless and you should avoid them. Wikipedia lists eleven different meanings for "strongly typed", several of which are contradictory. This indicates that the odds of confusion being created are high in any conversation involving the term "strongly typed" or "weakly typed".

All that you can really say with any certainty is that a "strongly typed" language under discussion has some additional restriction in the type system, either at runtime or compile time, that a "weakly typed" language under discussion lacks. What that restriction might be cannot be determined without further context.

Instead of using "strongly typed" and "weakly typed", you should describe in detail what kind of type safety you mean. For example, C# is a statically typed language and a type safe language and a memory safe language, for the most part. C# allows all three of those forms of "strong" typing to be violated. The cast operator violates static typing; it says to the compiler "I know more about the runtime type of this expression than you do". If the developer is wrong, then the runtime will throw an exception in order to protect type safety. If the developer wishes to break type safety or memory safety, they can do so by turning off the type safety system by making an "unsafe" block. In an unsafe block you can use pointer magic to treat an int as a float (violating type safety) or to write to memory you do not own. (Violating memory safety.)

C# imposes type restrictions that are checked at both compile-time and at runtime, thereby making it a "strongly typed" language compared to languages that do less compile-time checking or less runtime checking. C# also allows you to in special circumstances do an end-run around those restrictions, making it a "weakly typed" language compared with languages which do not allow you to do such an end-run.

Which is it really? It is impossible to say; it depends on the point of view of the speaker and their attitude towards the various language features.

In Python, why tuples, an immutable type, can contain mutable items such as lists?

50 votes

In Python, why tuples, an immutable type, can contain mutable items such as lists?

It is seemingly a contradiction that when a mutable item such as a list does get modified, the tuple it belongs to maintains being immutable.

That's an excellent question.

Tuples are characterized less by their immutability and more by their intended purpose. Tuples are Python's way of collecting heterogenous pieces of information under one roof. For example, s = ('www.python.org', 80) brings together a string and a number so that the host/port pair can be passed around as a socket, a composite object. Viewed in that light, it is perfectly reasonable to have mutable components.

Immutability goes hand-in-hand with another property, hashability. But hashability isn't an absolute property. If one of the tuple's components isn't hashable, then the overall tuple isn't hashable either. For example, t = ('red', [10, 20, 30]) isn't hashable.

The last example shows a 2-tuple that contains a string and a list. The tuple itself isn't mutable (i.e. it doesn't have any methods that for changing its contents). Likewise, the string is immutable because strings don't have any mutating methods. The list object does have mutating methods, so it can be changed. This shows that mutability is a property of an object type -- some objects have mutating methods and some don't. This doesn't change just because the objects are nested.

Remember, immutability is not magic. It is just an object that is read-only (ie. it doesn't have any update methods).

Hope, this was useful to you :-)

Library to check if two regular expressions are equal/isomorphic

25 votes

I need a library which will take in two regular expressions and determine whether they are isomorphic (i.e. match exactly the same set of strings or not) For example a|b is isomorphic to [ab]

As I understand it, a regular expression can be converted to an NFA which in some cases can be efficiently converted to a DFA. The DFA can then be converted to a minimal DFA, which, if I understand it correctly, is unique and so these minimal DFA's can then be compared for equality. I realize that not all regular expression NFA's can be efficently transformed into DFA's (especially when they were generate from Perl Regexps which are not truly "regular") in which case ideally the library would just return an error or some other indication that the conversion is not possible.

I see tons of articles and academic papers on-line about doing this (and even some programming assignments for classes asking students to do this) but I can't seem to find a library which implements this functionality. I would prefer a Python and/or C/C++ library, but a library in any language will do. Does anyone know if such a library? If not, does someone know of a library that gets close that I can use as a starting point?

Haven't tried it, but Regexp:Compare for Perl looks promising: two regex's are equivalent if the language of the first is a subset of the second, and vice verse.

Porting Python to an embedded system

20 votes

I am working with an ARM Cortex M3 on which I need to port Python (without operating system). What would be my best approach? I just need the core Python and basic I/O.

There are a few projects that have attempted to port Python to the situation you mention, take a look at python-on-a-chip, PyMite or tinypy. These are aimed at lower power microcontrollers without an OS and tend to focus on slightly older versions of the Python language and reduced library support.

Is writing a daemon in Python a good idea?

15 votes

I have to write a daemon program that constantly runs in the background and performs some simple tasks. The logic is not complicated at all, however it has to run for extended periods of time and be stable.

I think C++ would be a good choice for writing this kind of application, however I'm also considering Python since it's easier to write and test something quickly in it.

The problem that I have with Python is that I'm not sure how its runtime environment is going to behave over extended periods of time. Can it eat up more and more memory because of some GC quirks? Can it crash unexpectedly? I've never written daemons in Python before, so if anyone here did, please share your experience. Thanks!

I've written a number of daemons in Python for my last company. The short answer is, it works just fine. As long as the code itself doesn't have some huge memory bomb, I've never seen any gradual degradation or memory hogging. Be mindful of anything in the global or class scopes, because they'll live on, so use del more liberally than you might normally. Otherwise, like I said, no issues I can personally report.

And in case you're wondering, they ran for months and months (let's say 6 months usually) between routine reboots with zero problems.

Determine non-convex hull of collection of line segments

15 votes

I have a computational geometry problem that I feel should have a relatively simple solution, but I can't quite figure it out.

I need to determine the non-convex outline of a region defined by several line segments.

I'm aware of various non-convex hull algorithms (e.g. alpha shapes), but I don't need a fully general algorithm, as the line segments define a unique solution in most cases.


As @Jean-FrançoisCorbett has pointed out, there are cases where there are multiple solutions. I clearly need to think more about my definition.

However, what I'm trying to do is reverse-engineer and use a proprietary file format so that I can run basic analyses on data collected by myself and others. The file format is simple enough, but determining the algorithm they use to define the boundary is considerably harder.

Putting in many of the edge cases that would result in a non-unique solution causes the software in question to either crash without warning or silently fail to read the file.

Therefore, when there are multiple solutions, either generating one of the acceptable solutions or being able to determine that there are multiple solutions would be acceptable.


Problem Definition:

The polygon's outline should never cross any of the segments and should be formed of lines joining all of the segments' endpoints. All segments must lie entirely inside or along the boundary of the polygon. No endpoint may be used more than once in the outline (Ignoring "closing" the polygon by adding the first point at the end for software libraries that require polygons to close.).

In cases where there are multiple solutions that meet this criteria, any one of those solutions would be acceptable. (It would be nice to be able to determine when the solution is non-unique, but this isn't strictly necessary.)


Examples:

As an example, I have something along these lines: Segments Defining the Area

And I'd like to delineate the following area: Desired Outline

It also should work for non-intersecting segments. E.g.

enter image description here enter image description here

I think (?) there's a unique solution in either case, subject to the criteria outline earlier. (Edit: There isn't a unique solution in general, as @Jean-FrançoisCorbett pointed out. However, I'm still interested in an algorithm that would either generate one of the acceptable solutions.)

Test Cases

For a test case, here's the code to generate the above figures. I'm using python here, but the question is language-agnostic.

import matplotlib.pyplot as plt

def main():
    test1()
    test2()
    plt.show()

def test1():
    """Intersecting segments."""
    segments = [[(1, 1), (1, 3)],
                [(3.7, 1), (2, 4)],
                [(2, 0), (3.7, 3)],
                [(4, 0), (4, 4)],
                [(4.3, 1), (4.3, 3)],
                [(0, 2), (6, 3)]]

    desired_outline = [segments[0][0], segments[5][0], segments[0][1], 
                       segments[1][1], segments[2][1], segments[3][1],
                       segments[4][1], segments[5][1], segments[4][0],
                       segments[3][0], segments[1][0], segments[2][0],
                       segments[0][0]]

    plot(segments, desired_outline)

def test2():
    """Non-intersecting segments."""
    segments = [[(0, 1), (0, 3)],
                [(1, 0), (1, 4)],
                [(2, 1), (2, 3)],
                [(3, 0), (3, 4)]]

    desired_outline = [segments[0][0], segments[0][1], segments[1][1],
                       segments[2][1], segments[3][1], segments[3][0], 
                       segments[2][0], segments[1][0], segments[0][0]]

    plot(segments, desired_outline)


def plot(segments, desired_outline):
    fig, ax = plt.subplots()
    plot_segments(ax, segments)
    ax.set_title('Segments')

    fig, ax = plt.subplots()
    ax.fill(*zip(*desired_outline), facecolor='gray')
    plot_segments(ax, segments)
    ax.set_title('Desired Outline')

def plot_segments(ax, segments):
    for segment in segments:
        ax.plot(*zip(*segment), marker='o', linestyle='-')
    xmin, xmax, ymin, ymax = ax.axis()
    ax.axis([xmin - 0.5, xmax + 0.5, ymin - 0.5, ymax + 0.5])

if __name__ == '__main__':
    main()

Any ideas?

I'm beginning to suspect that the software whose results I'm trying to reproduce uses a radial-sweep algorithm in some sort of "internal" coordinate system (e.g. A coordinate system with x-prime and y-prime scaled and rotated along the principal axes defined by the spread of points. This makes the problem more "circular".) However, this produces solutions where the outline intersects line segments in many cases. It's easy enough to detect this and brute force it from there, but surely there's a better way?

  1. Pick a safe starting point. Can be e.g. the endpoint with maximum x.
  2. March along the line segment.
  3. Upon encountering any intersection, always turn left and march along this new segment.
  4. Upon encountering an endpoint, record it. Goto 2.
  5. Stop when you have returned to your starting point. Your list of recorded endpoints now makes up the ordered list of vertices of your concave hull.

NB: This will fail if there is a "free-floating" outlying line segment that does not intersect any other line segment. However, you specify that "the bars uniquely define a solution", which rules out this fail condition. (Outlying segments make possible two unique solutions.)

EDIT ... or rather, outlying segments can make two unique solutions possible -- depending on the exact layout. Proof: Below is an example where the yellow segment that I added makes two solutions possible (blue and grey horribly hand-drawn lines). Were the yellow segment orientated perpendicular to the way it's drawn now, only one solution would be possible. Sounds like your problem is poorly defined.

enter image description here

EDIT Actually this can also fail if your segment collection is "very concave", i.e. if there are endpoints tucked away in recluse corners of your pile of segments. In the figure below I added a black segment. My algorithm would illegally join its endpoint to another endpoint (dashed grey line). I'll leave my answer be in case others are inclined to build upon it.

EDIT after giving this some more thought: Even in the "very concave" case, this solution will definitely give you all of the points of your concave hull in the proper order, but they may be interspersed with extra, inappropriate points such as the black one. So there may be too many points.

The answer is then, of course, to do some pruning. It would be fairly complicated pruning especially if you can have multiple, consecutive "recluse points" like the black one, so I don't have a smart algorithm in mind. But even blind, brute force could be feasible. Each point can be either accepted or rejected (boolean), so if you have N properly ordered candidate points in your concave hull, then there are only 2^N possibilities to check. This is way, way fewer possibilities than brute force for your original problem of permutations, which would have SUM of (n!/(n-k)!) for k=1:(n-1) possibilities (pardon my notation). So this algorithm narrows down your problem significantly.

I think this is the way to go.

enter image description here

Why is it possible to iterate along a string?

13 votes

I'm trying to understand why I can iterate along the string. What I see in the documentation is:

One method needs to be defined for container objects to provide iteration support:

container.__iter__()

Return an iterator object. The object is required to support the iterator protocol described below. If a container supports different types of iteration, additional methods can be provided to specifically request iterators for those iteration types. (An example of an object supporting multiple forms of iteration would be a tree structure which supports both breadth-first and depth-first traversal.) This method corresponds to the tp_iter slot of the type structure for Python objects in the Python/C API.

The iterator objects themselves are required to support the following two methods, which together form the iterator protocol:

iterator.__iter__()

Return the iterator object itself. This is required to allow both containers and iterators to be used with the for and in statements. This method corresponds to the tp_iter slot of the type structure for Python objects in the Python/C API.

iterator.next()

Return the next item from the container. If there are no further items, raise the StopIteration exception. This method corresponds to the tp_iternext slot of the type structure for Python objects in the Python/C API.

But...

>>> dir('aa')
['__add__', '__class__', '__contains__', '__delattr__', '__doc__',
 '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
 '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__',
 '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__',
 '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__',
 '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
 '_formatter_field_name_split', '_formatter_parser', 'capitalize',
 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs',
 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower',
 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip',
 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition',
 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip',
 'swapcase', 'title', 'translate', 'upper', 'zfill']

I don't see here any __iter__() or next(). So why does it work?

Iterators were new in Python 2.2. The old method was the sequence protocol (implements __getitem__ with 0-based indices) and still works.

In practice, what are the main uses for the new "yield from" syntax in Python 3.3?

13 votes

I'm having a hard time wrapping my brain around PEP 380.

  1. What are the situations where "yield from" is useful?
  2. What is the classic use case?
  3. Why is it compared to micro-threads?

Thanks in advance!

[ update ]

Now I understand the cause of my difficulties. I've used generators, but never used coroutines (introduced by PEP-342). Despite some similarities, generators and coroutines are basically two different concepts. Understanding coroutines (not only generators) is the key to understanding the new syntax.

IMHO coroutines are the most obscure Python feature, books make it look useless and uninteresting.

Thanks for the great answers, but special thanks to agf for the link to David Beazley presentations. David rocks.

What are the situations where "yield from" is useful?

Every situation where you have a loop like that in your code:

for x in subgenerator:
  yield x

As the PEP describes, this is a rather naive attempt at using the subgenerator, it's missing several aspects, especially the proper handling of the throw()/send()/close() mechanisms introduced by PEP 342. To do this properly, rather complicated code is necessary.

What is the classic use case?

For example, if you want to extract information from a recursive data structure. Let's say we want to get all leaf nodes in a tree:

def traverse_tree(node):
  if not node.children:
    yield node
  for child in node.children:
    yield from traverse_tree(child)

Even more important is the fact that today, there doesn't exist a simple method of refactoring generator code. Suppose you have a (senseless) generator like this:

def get_list_values(lst):
  for item in lst:
    yield int(item)
  for item in lst:
    yield str(item)
  for item in lst:
    yield float(item)

Now you decide to factor out these loops into separate generators. Without yield from, this is ugly, up to the point where you think twice whether you actually want to do it. With yield from, it's actually nice to look at:

def get_list_values(lst):
  for sub in [get_list_values_as_int, 
              get_list_values_as_str, 
              get_list_values_as_float]:
    yield from sub(lst)

Why is it compared to micro-threads?

I think what this section in the PEP is talking about is that every generator does have it's own, isolated execution context. Together with the fact that execution is switched between the generator-iterator and the caller using yield and __next__(), respectively, this is similar to threads, where the operating system switches the executing thread from time to time, along with the execution context (stack, registers, ...).

The effect of this is also comparable: Both the generator-iterator and the caller progress in their execution state at the same time. For example, if the generator does some kind of computation and the caller prints out the results, you'll see the results as soon as they're available. This is a form of parallelization.

That analogy isn't anything specific to yield from, though, it's rather a general property of generators in Python.

How to handle utf8 on the command line (using Perl or Python)?

12 votes

How can I handle utf8 using Perl (or Python) on the command line?

I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:

$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c   d e f

But with utf8 it doesn't work, of course:

$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5>   <D0> <B7> <D0> <B0>

because it doesn't know about the 2-byte characters.

It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.

The "-C" flag controls some of the Perl Unicode features (see perldoc perlrun):

$ echo "одобрение за" | perl -C -pe 's/.\K/ /g'
о д о б р е н и е   з а 

To specify encoding used for stdin/stdout you could use PYTHONIOENCODING environment variable:

$ echo "одобрение за" | PYTHONIOENCODING=utf-8 python -c'import sys
for line in sys.stdin:
    print " ".join(line.decode(sys.stdin.encoding)),
'
о д о б р е н и е   з а 

If you'd like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use /\X/ regular expression:

$ echo "одобрение за" | perl -C -pe 's/\X\K/ /g'
о д о б р е н и е   з а 

See Grapheme Cluster Boundaries

In Python \X is supported by regex module.

What is Python's equivalent of "perl -V"

11 votes

The output produced by running perl -V is packed with useful information (see example below). Is there anything like it for Python?


Example output:

% perl -V
Summary of my perl5 (revision 5 version 10 subversion 1) configuration:

  Platform:
    osname=linux, osvers=2.6.32-5-amd64, archname=x86_64-linux-gnu-thread-multi
    uname='linux brahms 2.6.32-5-amd64 #1 smp tue jun 14 09:42:28 utc 2011 x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.5', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.11.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1
    gnulibc_version='2.11.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'


Characteristics of this binary (from libperl): 
  Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
                        PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP USE_64_BIT_ALL
                        USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
                        USE_PERLIO USE_REENTRANT_API
  Locally applied patches:
    DEBPKG:debian/arm_thread_stress_timeout - http://bugs.debian.org/501970 Raise the timeout of ext/threads/shared/t/stress.t to accommodate slower build hosts
    DEBPKG:debian/cpan_config_path - Set location of CPAN::Config to /etc/perl as /usr may not be writable.

    <snip-- iow patches galore --you get the picture>

    DEBPKG:fixes/safe-reval-rdo-cve-2010-1447 - [PATCH] Wrap by default coderefs returned by rdo and reval
    DEBPKG:patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.10.1-17squeeze2 in patchlevel.h
  Built under linux
  Compiled at Jun 30 2011 22:28:00
  @INC:
    /etc/perl
    /usr/local/lib/perl/5.10.1
    /usr/local/share/perl/5.10.1
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    .

Not to be confused with the much less informative perl -v:

% perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 53 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

python -c 'import sysconfig, pprint; pprint.pprint(sysconfig.get_config_vars())'

Why do python exceptions typically not print offending values?

11 votes

For instance, consider the case:

>>> a = []
>>> a[12]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range

the exception does not print the value that was out of range.

My guess is that we don't know if the __str__ function of whatever was passed in raises an exception, so we don't touch it?

The value could be printed, it just isn't.

However, there is a tb module on PyPI that actually will print the values of the variables in the stack trace.

Convert rgb color to english color name, like 'green'

11 votes

I want to convert a color tuple to a color name, like 'yellow' or 'blue'

>>> im = Image.open("test.jpg")
>>> n, color = max(im.getcolors(im.size[0]*im.size[1]))
>>> print color
(119, 172, 152)

Is there a simple way in python to do this?

It looks like webcolors will allow you to do this:

rgb_to_name(rgb_triplet, spec='css3')

Convert a 3-tuple of integers, suitable for use in an rgb() color triplet, to its corresponding normalized color name, if any such name exists; valid values are html4, css2, css21 and css3, and the default is css3.

Example:

>>> rgb_to_name((0, 0, 0))
'black'

it is vice-versa-able:

>>> name_to_rgb('navy')
(0, 0, 128)

To find the closest colour name:

However webcolors raises an exception if it can't find a match for the requested colour. I've written a little fix that delivers the closest matching name for the requested RGB colour. It matches by Euclidian distance in the RGB space.

import webcolors

def closest_colour(requested_colour):
    min_colours = {}
    for key, name in webcolors.css3_hex_to_names.items():
        r_c, g_c, b_c = webcolors.hex_to_rgb(key)
        rd = (r_c - requested_colour[0]) ** 2
        gd = (g_c - requested_colour[1]) ** 2
        bd = (b_c - requested_colour[2]) ** 2
        min_colours[(rd + gd + bd)] = name
    return min_colours[min(min_colours.keys())]

def get_colour_name(requested_colour):
    try:
        closest_name = actual_name = webcolors.rgb_to_name(requested_colour)
    except ValueError:
        closest_name = closest_colour(requested_colour)
        actual_name = None
    return actual_name, closest_name

requested_colour = (119, 172, 152)
actual_name, closest_name = get_colour_name(requested_colour)

print "Actual colour name:", actual_name, ", closest colour name:", closest_name

Output:

Actual colour name: None , closest colour name: cadetblue

Why is math.factorial much slower in Python 2.x than 3.x?

11 votes

I get the following results on my machine:

Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('factorial(10000)', 'from math import factorial', number=100)
1.9785256226699202
>>>

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> timeit.timeit('factorial(10000)', 'from math import factorial', number=100)
9.403801111593792
>>>

I thought this might have something to do with int/long conversion, but factorial(10000L) isn't any faster in 2.7.

Python 2 uses the naive factorial algorithm:

1121 for (i=1 ; i<=x ; i++) {
1122     iobj = (PyObject *)PyInt_FromLong(i);
1123     if (iobj == NULL)
1124         goto error;
1125     newresult = PyNumber_Multiply(result, iobj);
1126     Py_DECREF(iobj);
1127     if (newresult == NULL)
1128         goto error;
1129     Py_DECREF(result);
1130     result = newresult;
1131 }

Python 3 uses the divide-and-conquer factorial algorithm:

1229 * factorial(n) is written in the form 2**k * m, with m odd. k and m are
1230 * computed separately, and then combined using a left shift.

See the Python Bugtacker issue for the discussion. Thanks DSM for pointing that out.

Why raising a tuple works if first element is an Exception?

11 votes

I have a hard time figuring this one out, it's about mistakes that can be done when raising an exception in Python 2.7:

try:
  raise [1, 2, 3, 4]
except Exception as ex:
  print ex

the message here is "exceptions must be old-style classes or derived from BaseException, not list" - This part is ok, but when I change it to tuple, I am getting confused:

try:
  raise (1, 2, 3, 4)
except Exception as ex:
  print ex

the message here is "exceptions must be old-style classes or derived from BaseException, not int" - why is it interpreted as raising an int, not a tuple?

Futhermore:

try:
  raise (Exception, 'a message')
except Exception as ex:
  print ex

Here we are actually rising an Exception (consistent behaviour when compared with previous example, where we were raising an int) - I briefly thought that this is just an alternate way for this:

try:
  raise Exception, 'a message'
except Exception as ex:
  print ex

But in this case, 'a message' is being passed to Exceptions ctor (as documented on docs.python.org)

Can someone explain the 2nd and 3rd cases, and possible point me to code in interpreter that is responsible for this?

As documented in the python reference, the raise statement takes up to 3 expressions to create the exception being raised:

raise_stmt ::= "raise" [expression ["," expression ["," expression]]]

In python 2, if the first expression is a tuple, python will 'unwrap' the tuple recursively, taking the first element until it finds something other than a tuple. This behavior is being removed from Python 3 (see PEP 3109. The following is legal:

>>> raise ((Exception, 'ignored'), 'ignored'), 'something', None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: something

The documentation explains the rest in more detail, but the raise statement expects the first value to be a Exception class, the second value is seen as the value of the exception (the message) and the third value is a traceback. Python fills in None for the latter two values if missing.

If the first value is a instance instead, the second value must be None:

>>> raise Exception('something'), 'something', None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: instance exception may not have a separate value

If you use a tuple of more than 3 items, it'll raise a syntax error:

>>> raise Exception, 'something', None, None
  File "<stdin>", line 1
    raise Exception, 'something', None, None
                                      ^
SyntaxError: invalid syntax

In your case however, you raised neither a class nor an instance, so that's what Python found to be incorrect first; if I use a string it'll complain too:

>>> raise 'not an exception', 'something', None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: exceptions must be old-style classes or derived from BaseException, not str

The correct syntax is of course:

>>> raise Exception, 'something', None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: something

Bytes in a unicode Python string

11 votes

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I found a hacky way to solve the problem, however:

>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

In Python 2, Unicode strings may contain both unicode and bytes:

No, they may not. They contain Unicode characters.

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

List modification in a loop

10 votes

I'm a newcomer to Python but I understand that things should not be done this way, so please consider the following code snippets as purely educational :-)

I'm currently reading 'Learning Python' and trying to fully understand the following example:

>>> L = [1, 2, 3, 4, 5]
>>> for x in L:
...    x += 1
...
>>> L
[1, 2, 3, 4, 5]

I did not understand if this behavior was somewhat related to the immutability of the numeric types, so I've run the following test:

>>> L = [[1], [2], [3], [4], [5]]
>>> for x in L:
...    x += ['_']
...
>>> L
[[1, '_'], [2, '_'], [3, '_'], [4, '_'], [5, '_']]

Question: what makes the list unchanged in the first code and changed in the second ?

My intuition is that the syntax is misleading and that:

  • x += 1 for an integer really means x = x + 1 (thus assigning a new reference)
  • x += ['_'] for a list really means x.extend('_') (thus changing the list in place)

Question: what makes the list unchanged in the first code and changed in the second ?

In the first code, the list is a sequence of (immutable) integers. The loop sets x to refer to each element of the sequence in turn. x += 1 changes x to refer to a different integer that is one more than the value x previously referred to. The element in the original list is unchanged.

In the second code, the list if a sequence of (mutable) lists. The loop sets x to refer to each element of the sequence in turn. x += ['_'] as x refers to a list, this extends the list referred to by x with ['_'].

Good sound libraries?

8 votes

I need to take an audio signal, and extract overlapping audio frames from it. I then need to convert these to frequency data (FFT stuff / like a spectrogram) and analyze the frequency information.

For example, if I have a 1 minute mp3 file, I want split the file into smaller files, from 00:00.000 to 00:03.000, from 00:00.010 to 00:03.010. Then I need to see the frequency breakdown of each sub-file.

Which programming languages have good audio tools that could help me do this? Are there linux command-line tools I could use? Bonus points for Node.js (yeah right) or Haskell, which I'm most familiar with.

Haskell:

http://hackage.haskell.org/package/hsndfile. Then it's mainly just math, I'd imagine, with hmatrix and soforth.

What OCR options exist beyond Tesseract?

7 votes

I've used Tesseract a bit and it's results leave much to be desired. I'm currently detecting very small images (35x15, without border, but have tried adding one with imagemagick with no ocr advantage); they range from 2 chars to 5 and are a pretty reliable font, however the characters are variable enough that simply using an image size checksum or such is not going to work.

I actually have tried: http://www.free-ocr.co.uk/ and surprisingly it has 100% accuracy. The problem I have with utilizing it is that I cannot rely on another outside service's reliability for this particular use case. I need to be able to control uptime to a higher degree.

What options exist for OCR besides sticking with Tesseract or doing a complete custom training of it? Also, it would be VERY helpful if this were compatible with Heroku style hosting (at least where I can compile the bins and shove them over).

I have successfully used GOCR in the past for small image OCR. I would say accuracy was around 85%, after getting the grayscale options set properly, on fairly regular fonts. It fails miserably when the fonts get complicated and has trouble with multiline layouts.

Also have a look at Ocropus, which is maintained by Google. Its related to Tesseract, but from what I understand, its OCR engine is different. With just the default models included, it achieves near 99% accuracy on high-quality images, handles layout pretty well and provides HTML output with information concerning formatting and lines. However, in my experience, its accuracy is very low when the image quality is not good enough. That being said, training is relatively simple and you might want to give it a try.

Both of them are easily callable from the command line. GOCR usage is very straightforward; just type gocr -h and you should have all the information you need. Ocropus is a bit more tricky; here's a usage example, in Ruby:

require 'fileutils'
tmp = 'directory'
file = 'file.png'

`ocropus book2pages #{tmp}/out #{file}`
`ocropus pages2lines #{tmp}/out`
`ocropus lines2fsts #{tmp}/out`
`ocropus buildhtml #{tmp}/out > #{tmp}/output.html`

text = File.read("#{tmp}/output.html")
FileUtils.rm_rf(tmp)

What is the best way to get a semi long unique id (non sequential) key for Database objects

7 votes

Iam building a web app and I would like my URL scheme to look something like this:

someurl.com/object/FJ1341lj

Currently I just use the primary key from my SQL Alchemy objects, but the problem is that I dont want the Urls to be sequential or low numbers. For instance my URLs look like this:

someurl.com/object/1
someurl.com/object/2

Encoding the integers

You could use a reversible encoding for your integers:

def int_str(val, keyspace):
    """ Turn a positive integer into a string. """
    assert val >= 0
    out = ""
    while val > 0:
        val, digit = divmod(val, len(keyspace))
        out += keyspace[digit]
    return out[::-1]

def str_int(val, keyspace):
    """ Turn a string into a positive integer. """
    out = 0
    for c in val:
        out = out * len(keyspace) + keyspace.index(c)
    return out

Quick testing code:

keyspace = "fw59eorpma2nvxb07liqt83_u6kgzs41-ycdjh" # Can be anything you like - this was just shuffled letters and numbers, but...
assert len(set(keyspace)) == len(keyspace) # each character must occur only once

def test(v):
    s = int_str(v, keyspace)
    w = str_int(s, keyspace)
    print "OK? %r -- int_str(%d) = %r; str_int(%r) = %d" % (v == w, v, s, s, w)

test(1064463423090)
test(4319193500)
test(495689346389)
test(2496486533)

outputs

OK? True -- int_str(1064463423090) = 'antmgabi'; str_int('antmgabi') = 1064463423090
OK? True -- int_str(4319193500) = 'w7q0hm-'; str_int('w7q0hm-') = 4319193500
OK? True -- int_str(495689346389) = 'ev_dpe_d'; str_int('ev_dpe_d') = 495689346389
OK? True -- int_str(2496486533) = '1q2t4w'; str_int('1q2t4w') = 2496486533

Obfuscating them and making them non-continuous

To make the IDs non-contiguous, you could, say, multiply the original value with some arbitrary value, add random "chaff" as the digits-to-be-discarded - with a simple modulus check in my example:

def chaffify(val, chaff_size = 150, chaff_modulus = 7):
    """ Add chaff to the given positive integer.
    chaff_size defines how large the chaffing value is; the larger it is, the larger (and more unwieldy) the resulting value will be.
    chaff_modulus defines the modulus value for the chaff integer; the larger this is, the less chances there are for the chaff validation in dechaffify() to yield a false "okay".
    """
    chaff = random.randint(0, chaff_size / chaff_modulus) * chaff_modulus
    return val * chaff_size + chaff

def dechaffify(chaffy_val, chaff_size = 150, chaff_modulus = 7):
    """ Dechaffs the given chaffed value. The chaff_size and chaff_modulus parameters must be the same as given to chaffify() for the dechaffification to succeed.
    If the chaff value has been tampered with, then a ValueError will (probably - not necessarily) be raised. """
    val, chaff = divmod(chaffy_val, chaff_size)
    if chaff % chaff_modulus != 0:
        raise ValueError("Invalid chaff in value")
    return val

for x in xrange(1, 11):
    chaffed = chaffify(x)
    print x, chaffed, dechaffify(chaffed)

outputs (with randomness):

1 262 1
2 440 2
3 576 3
4 684 4
5 841 5
6 977 6
7 1197 7
8 1326 8
9 1364 9
10 1528 10

EDIT: On second thought, the randomness of the chaff may not be a good idea, as you lose the canonicality of each obfuscated ID -- this lacks the randomness but still has validation (changing one digit will likely invalidate the whole number if chaff_val is Large Enough).

def chaffify2(val, chaff_val = 87953):
    """ Add chaff to the given positive integer. """
    return val * chaff_val

def dechaffify2(chaffy_val, chaff_val = 87953):
    """ Dechaffs the given chaffed value. chaff_val must be the same as given to chaffify2(). If the value does not seem to be correctly chaffed, raises a ValueError. """
    val, chaff = divmod(chaffy_val, chaff_val)
    if chaff != 0:
        raise ValueError("Invalid chaff in value")
    return val

Putting it all together

document_id = random.randint(0, 1000000)
url_fragment = int_str(chaffify(document_id))
print "URL for document %d: http://example.com/%s" % (document_id, url_fragment)
request_id = dechaffify(str_int(url_fragment))
print "Requested: Document %d" % request_id

outputs (with randomness)

URL for document 831274: http://example.com/w840pi
Requested: Document 831274

Sort list of tuples considering locale

6 votes

Apparently PostgreSQL 8.4 and Ubuntu 10.04 cannot handle the updated way to sort W and V for Swedish alphabet. That is, it's still ordering them as the same letter like this (old definition for Swedish ordering):

  • Wa
  • Vb
  • Wc
  • Vd

it should be (new definition for Swedish ordering):

  • Vb
  • Vd
  • Wa
  • Wc

I need to order this correctly for a Python/Django website I'm building. I have tried various ways to just order a list of tuples created from a Django QuerySet using *values_list*. But since it's Swedish also å, ä and ö letters needs to be correctly ordered. Now I have either one or the other way both not both..

list_of_tuples = [(u'Wa', 1), (u'Vb',2), (u'Wc',3), (u'Vd',4), (u'Öa',5), (u'äa',6), (u'Åa',7)]

print '########## Ordering One ##############'
ordered_list_one = sorted(list_of_tuples, key=lambda t: tuple(t[0].lower()))
for item in ordered_list_one:
    print item[0]

print '########## Ordering Two ##############'
locale.setlocale(locale.LC_ALL, "sv_SE.utf8")
list_of_names = [u'Wa', u'Vb', u'Wc', u'Vd', u'Öa', u'äa', u'Åa']
ordered_list_two = sorted(list_of_names, cmp=locale.strcoll)
for item in ordered_list_two:
    print item

The examples gives:

########## Ordering One ##############
Vb
Vd
Wa
Wc
äa
Åa
Öa
########## Ordering Two ##############
Wa
Vb
Wc
Vd
Åa
äa
Öa

Now, What I want is a combination of these so that both V/W and å,ä,ö ordering are correct. To be more precise. I want Ordering One to respect locale. By then using the second item (object id) in each tuple I could fetch the correct object in Django.

I'm starting to doubt that this will be possible? Would upgrading PostgreSQL to a newer version that handles collations better and then use raw SQL in Django be possible?

When running LC_ALL=sv_SE.UTF-8 sort on your example on Ubuntu-10.04, it comes out with Wa before Vb (the "old way"), so Ubuntu does not seem to agree with the "new way". Since PostgreSQL relies on the operating system for this, it will behave just the same as the OS given the same lc_collate.

There is actually a patch in debian glibc related to this particular sort issue: http://sourceware.org/bugzilla/show_bug.cgi?id=9724 But it was objected to and not accepted. If you only need this behavior on a system you administer, you can still apply the change of the patch to /usr/share/i18n/locales/sv_SE and rebuild the se_SV locale by running locale-gen sv_SE.UTF-8. Or better yet, create your own alternative locale derived from it to avoid messing with the original.