Best python questions in April 2012

Which classes cannot be subclassed?

46 votes

Is there any rule about which built-in and standard library classes are not subclassable ("final")?

As of Python 3.3, here are a few examples:

  • bool
  • function
  • operator.itemgetter
  • slice

I found a question which deals with the implementation of "final" classes, both in C and pure Python.

I would like to understand what reasons may explain why a class is chosen to be "final" in the first place.

There seems to be two reasons for a class to be "final" in Python.

1. Violation of Class Invariant

Classes that follow Singleton pattern have an invariant that there's a limited (pre-determined) number of instances. Any violation of this invariant in a subclass will be inconsistent with the class' intent, and would not work correctly. Examples:

  • bool: True, False; see Guido's comments
  • NoneType: None
  • NotImplementedType: NotImplemented
  • ellipsis: Ellipsis

There may be cases other than the Singleton pattern in this category but I'm not aware of any.

2. No Persuasive Use Case

A class implemented in C requires additional work to allow subclassing (at least in CPython). Doing such work without a convincing use case is not very attractive, so volunteers are less likely to come forward. Examples:

Note 1:

I originally thought there were valid use cases, but simply insufficient interest, in subclassing of function and operator.itemgetter. Thanks to @agf for pointing out that the use cases offered here and here are not convincing (see @agf comments to the question).

Note 2:

My concern is that another Python implementation might accidentally allow subclassing a class that's final in CPython. This may result in non-portable code (a use case may be weak, but someone might still write code that subclasses function if their Python supports it). This can be resolved by marking in Python documentation all built-in and standard library classes that cannot be subclassed, and requiring that all implementations follow CPython behavior in that respect.

Note 3:

The message produced by CPython in all the above cases is:

TypeError: type 'bool' is not an acceptable base type

It is quite cryptic, as numerous questions on this subject show. I'll submit a suggestion to add a paragraph to the documentation that explains final classes, and maybe even change the error message to:

TypeError: type 'bool' is final (non-extensible)

Difference between the built-in pow() and math.pow() for floats, in Python?

32 votes

Is there a difference in the results returned by Python's built-in pow(x, y) (no third argument) and the values returned by math.pow()?

Edit: I am only interested in the case of two float arguments.

I am asking this question because the documentation for math.pow() implies that pow(x, y) (i.e. x**y) is essentially the same as math.pow(x, y):

math.pow(x, y)

Return x raised to the power y. Exceptional cases follow Annex ‘F’ of the C99 standard as far as possible. In particular, pow(1.0, x) and pow(x, 0.0) always return 1.0, even when x is a zero or a NaN. If both x and y are finite, x is negative, and y is not an integer then pow(x, y) is undefined, and raises ValueError.

Changed in version 2.6: The outcome of 1**nan and nan**0 was undefined.

Note the last line: the documentation implies that the behavior of math.pow() is that of the exponentiation operator ** (and therefore of pow(x, y)). Is this officially guaranteed?

Background: My goal is to provide an implementation of both the built-in pow() and of math.pow() for numbers with uncertainty that behaves in the same way as with regular Python floats (same numerical results, same exceptions, same results for corner cases, etc.). I have already implemented something that works quite well, but there are some corner cases that need to be handled.

Quick Check

From the signatures, we can tell that they are different:

pow(x, y[, z])

math.pow(x, y)

Also, trying it in the shell will give you a quick idea:

>>> pow is math.pow
False

Testing the differences

Another way to understand the differences in behaviour between the two functions is to test for them:

import math
import traceback
import sys

inf = float("inf")
NaN = float("nan")

vals = [inf, NaN, 0.0, 1.0, 2.2, -1.0, -0.0, -2.2, -inf, 1, 0, 2]

tests = set([])

for vala in vals:
  for valb in vals:
    tests.add( (vala, valb) )
    tests.add( (valb, vala) )


for a,b in tests:
  print("math.pow(%f,%f)"%(a,b) )
  try:
    print("    %f "%math.pow(a,b))
  except:
    traceback.print_exc()

  print("__builtins__.pow(%f,%f)"%(a,b) )
  try:
    print("    %f "%__builtins__.pow(a,b))
  except:
    traceback.print_exc()

We can then notice some subtle differences. For example:

math.pow(0.000000,-2.200000)
    ValueError: math domain error

__builtins__.pow(0.000000,-2.200000)
    ZeroDivisionError: 0.0 cannot be raised to a negative power

There are other differences, and the test list above is not complete (no long numbers, no complex, etc...), but this will give us a pragmatic list of how the two functions behave differently. I would also recommend extending the above test to check for the type that each function returns. You could probably write something similar that creates a report of the differences between the two functions.

math.pow()

math.pow() handles its arguments very differently from the builtin ** or pow(). This comes at the cost of flexibility. Having a look at the source, we can see that the arguments to math.pow() are cast directly to doubles:

static PyObject *
math_pow(PyObject *self, PyObject *args)
{
    PyObject *ox, *oy;
    double r, x, y;
    int odd_y;

    if (! PyArg_UnpackTuple(args, "pow", 2, 2, &ox, &oy))
        return NULL;
    x = PyFloat_AsDouble(ox);
    y = PyFloat_AsDouble(oy);
/*...*/

The checks are then carried out against the doubles for validity, and then the result is passed to the underlying C math library.

builtin pow()

The built-in pow() (same as the ** operator) on the other hand behaves very differently, it actually uses the Objects's own implementation of the ** operator, which can be overridden by the end user if need be by replacing a number's __pow__(), __rpow__() or __ipow__(), method.

For built-in types, it is instructive to study the difference between the power function implemented for two numeric types, for example, floats, long and complex.

Overridding the default behaviour

Emulating numeric types is described here. essentially, if you are creating a new type for numbers with uncertainty, what you will have to do is provide the __pow__(), __rpow__() and possibly __ipow__() methods for your type. This will allow your numbers to be used with the operator:

class Uncertain:
  def __init__(self, x, delta=0):
    self.delta = delta
    self.x = x
  def __pow__(self, other):
    return Uncertain(
      self.x**other.x, 
      Uncertain._propagate_power(self, other)
    )
  @staticmethod
  def _propagate_power(A, B):
    return math.sqrt(
      ((B.x*(A.x**(B.x-1)))**2)*A.delta*A.delta +
      (((A.x**B.x)*math.log(B.x))**2)*B.delta*B.delta
    )

In order to override math.pow() you will have to monkey patch it to support your new type:

def new_pow(a,b):
    _a = Uncertain(a)
    _b = Uncertain(b)
    return _a ** _b

math.pow = new_pow

Note that for this to work you'll have to wrangle the Uncertain class to cope with an Uncertain instance as an input to __init__()

why does python `any` return a bool instead of the value

21 votes

and and or return the last element they evaluated, but why doesn't Python's built-in function any?

I mean it's pretty easy to implement oneself like this, but I'm still left wondering why.

def any(l):
    for x in l:
        if x:
            return x
    return x

edit:

To add to the answers below, here's an actual quote from that same mailing list of ye mighty emperor on the issue:

Whether to always return True and False or the first faling / passing element? I played with that too before blogging, and realized that the end case (if the sequence is empty or if all elements fail the test) can never be made to work satisfactory: picking None feels weird if the argument is an iterable of bools, and picking False feels weird if the argument is an iterable of non-bool objects.

Guido van Rossum (home page: http://www.python.org/~guido/)

This very issue came up up on the Python developer's mailing list in 2005, when Guido Van Rossum proposed adding any and all to Python 2.5.

Bill Janssen requested that they be implemented as

def any(S):
    for x in S:
        if x:
            return x
    return S[-1]

def all(S):
    for x in S:
        if not x:
            return x
    return S[-1]

Raymond Hettinger, who implemented any and all, responded specifically addressing why any and all don't act like and and or:

Over time, I've gotten feedback about these and other itertools recipes. No one has objected to the True/False return values in those recipes or in Guido's version.

Guido's version matches the normal expectation of any/all being a predicate. Also, it avoids the kind of errors/confusion that people currently experience with Python's unique implementation of "and" and "or".

Returning the last element is not evil; it's just weird, unexpected, and non-obvious. Resist the urge to get tricky with this one.

The mailing list largely concurred, leaving the implementation as you see it today.

Any reason not to use + to concatenate two strings?

20 votes

A common antipattern in Python is to concatenate a sequence of strings using + in a loop. This is bad because the Python interpreter has to create a new string object for each iteration, and it ends up taking quadratic time. (Recent versions of CPython can apparently optimize this in some cases, but other implementations can't, so programmers are discouraged from relying on this.) ''.join is the right way to do this.

However, I've heard it said (including here on Stack Overflow) that you should never, ever use + for string concatenation, but instead always use ''.join or a format string. I don't understand why this is the case if you're only concatenating two strings. If my understanding is correct, it shouldn't take quadratic time, and I think a + b is cleaner and more readable than either ''.join((a, b)) or '%s%s' % (a, b).

Is it good practice to use + to concatenate two strings? Or is there a problem I'm not aware of?

There is nothing wrong in concatenating two strings with +. Indeed it's easier to read than ''.join(a, b).

You are right though that concatenating more than 2 strings with + is an O(n^2) operation (compared to O(n) for join) and thus becomes inefficient. However this has not to do with using a loop. Even a + b + c + ... is O(n^2), the reason being that each concatenation produces a new string.

CPython2.4 and abone try to mitigate that, but it's still advisable to use join when concatenating more than 2 strings.

Is there a Python equivalent of range(n) for multidimensional ranges?

20 votes

On Python, range(3) will return [0,1,2]. Is there an equivalent for multidimensional ranges?

range((3,2)) # [(0,0),(0,1),(1,0),(1,1),(2,0),(2,1)]

So, for example, looping though the tiles of a rectangular area on a tile-based game could be written as:

for x,y in range((3,2)):

Note I'm not asking for an implementation. I would like to know if this is a recognized pattern and if there is a built-in function on Python or it's standard/common libraries.

In numpy, it's numpy.ndindex. Also have a look at numpy.ndenumerate.

E.g.

import numpy as np
for x, y in np.ndindex((3,2)):
    print x, y

This yields:

0 0
0 1
1 0
1 1
2 0
2 1

Python Literal r'\' Not Accepted

17 votes

r'\' in Python does not work as expected. Instead of returning a string with one character (a backslash) in it, it raises a SyntaxError. r"\" does the same.

This is rather cumbersome if you have a list of Windows paths like these:

paths = [ r'\bla\foo\bar',
          r'\bla\foo\bloh',
          r'\buff',
          r'\',
          # ...
        ]

Is there a good reason why this literal is not accepted?

The answer to my question ("Why is a backslash not allowed as last character in raw strings?") actually to me seems to be "That's a design decision", furthermore a questionable one.

Some answers tried to reason that the lexer and some syntax highlighters are simpler this way. I don't agree (and I have some background on writing parsers and compiler as well as IDE development). It would be simpler to define raw strings with the semantics that a backslash has no special meaning whatsoever. Both lexer and IDE would benefit from this simplification.

The current situation also is a wart: In case I want a quote in a raw string, I cannot use this anyway. I only can use it if I happen to want a backslash followed by a quote inside my raw string.

I would propose to change this, but I also see the problem of breaking existing code :-/

Why is creating a class in Python so much slower than instantiating a class?

17 votes

I found that creation of a class is way slower than instantiation of a class.

>>> from timeit import Timer as T
>>> def calc(n):
...     return T("class Haha(object): pass").timeit(n)

<<After several these 'calc' things, at least one of them have a big number, eg. 100000>>

>>> calc(9000)
15.947055101394653
>>> calc(9000)
17.39099097251892
>>> calc(9000)
18.824054956436157
>>> calc(9000)
20.33335590362549

Yeah, create 9000 classes took 16 secs, and becomes even slower in the subsequent calls.

And this:

>>> T("type('Haha', b, d)", "b = (object, ); d = {}").timeit(9000)

gives similar results.

But instantiation don't suffer:

>>> T("Haha()", "class Haha(object): pass").timeit(5000000)
0.8786070346832275

5000000 instances in less than one sec.

What makes the creation this expensive?

And why the creation process become slower?

EDIT:

How to reproduce:

start a fresh python process, the initial several "calc(10000)"s give a number of 0.5 on my machine. And try some bigger values, calc(100000), it can't end in even 10secs, interrupt it, and calc(10000), gives a 15sec.

EDIT:

Additional fact:

If you gc.collect() after 'calc' becomes slow, you can get the 'normal' speed at beginning, but the timing will increasing in subsequent calls

>>> from a import calc
>>> calc(10000)
0.4673938751220703
>>> calc(10000)
0.4300072193145752
>>> calc(10000)
0.4270968437194824
>>> calc(10000)
0.42754602432250977
>>> calc(10000)
0.4344758987426758
>>> calc(100000)
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "a.py", line 3, in calc
    return T("class Haha(object): pass").timeit(n)
  File "/usr/lib/python2.7/timeit.py", line 194, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
KeyboardInterrupt
>>> import gc
>>> gc.collect()
234204
>>> calc(10000)
0.4237039089202881
>>> calc(10000)
1.5998330116271973
>>> calc(10000)
4.136359930038452
>>> calc(10000)
6.625348806381226

This might give you the intuition:

>>> class Haha(object): pass
...
>>> sys.getsizeof(Haha)
904
>>> sys.getsizeof(Haha())
64

Class object is much more complex and expensive structure that an instance of that class.

What is a generative method?

16 votes

I'm familiar with Python generators, however I've just come across the term "generative method" which I am not familiar with and cannot find a satisfactory definition.

To put it in context, I found the term in SQLAlchemy's narrative documentation:

Full control of the “autocommit” behavior is available using the generative Connection.execution_options() method provided on Connection, Engine, Executable, using the “autocommit” flag which will turn on or off the autocommit for the selected scope.

What is a generative method? Trying to iterate the object returned by Connection.execution_options() doesn't work so I'm figuring it's something other than a standard generator.

It doesn't appear to be a common database concept, but SQLAlchemy uses the term generative in the sense "generated by your program iteratively at runtime". (So, no connection to python generators). An example from the tutorial:

The Query object is fully generative, meaning that most method calls return a new Query object upon which further criteria may be added. For example, to query for users named “ed” with a full name of “Ed Jones”, you can call filter() twice, which joins criteria using AND:

>>> for user in session.query(User).\
...   filter(User.name=='ed').\
...   filter(User.fullname=='Ed Jones'):
...     print user

This call syntax is more commonly known as "method chaining", and the design that allows it as a "fluent interface".

So, in the case of Connection.execution_options(), "generative" means that it returns the modified connection object, so that you can chain the calls as above.

Why do -1 and -2 both hash to -2 in Python?

14 votes

Possible Duplicate:
When is a python object's hash computed and why is the hash of -1 different?

Why do -1 and -2 both hash to the same number if Python?

Since they do, how does Python tell these two numbers apart?

>>> -1 is -2
False
>>> hash(-1) is hash(-2)
True
>>> hash(-1)
-2
>>> hash(-2)
-2

Update - i think this now a wiki, please feel free to add more info.

-1 is a reserved value at the C level (of CPython - see comments from DSM about this being "as expected" in ironpython and pypy) .

See this Quora answer:

If you write a type in a C extension module and provide a tp_hash method, you have to avoid -1 -- if you return -1, Python will assume you meant to throw an error.

If you write a class in pure Python and provide a __hash__ method, there's no such requirement, thankfully. But that's because the C code that invokes your __hash__ method does that for you -- if your __hash__ returns -1, then hash() applied to your object will actually return -2.

which really just repackages info from the effbot:

The hash value -1 is reserved (it’s used to flag errors in the C implementation). If the hash algorithm generates this value, we simply use -2 instead.

Also, from agf in the comments (where people are asking for more info), the source.


Since they do, how does Python tell these two numbers apart?

Since all hash functions map a large input space to a smaller input space, collisions are always expected, no matter how good the hash function is. Think of hashing strings, for example. If hash codes are 32-bit integers, you have 2^32 (a little more than 4 billion) hash codes. If you consider all ASCII strings of length 6, you have (2^7)^6 (just under 4.4 trillion) different items in your input space. With only this set, you are guaranteed to have many, many collisions no matter how good you are. Add Unicode characters and strings of unlimited length to that!

Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.

This hash code and equality duality is hinted at in the CPython documentation on hashable objects. In other languages/frameworks, there is a guideline/rule that if you provide a custom hash code function, you must also provide a custom equality test (performed on the same fields as the hash code function).


Indeed, the Python release today address exactly this, with a security patch that addresses the efficiency issue when this (identical hash values, but on a massive scale) is used as a denial of service attack - http://mail.python.org/pipermail/python-list/2012-April/1290792.html

String immutability in CPython violated

14 votes

This is more of an 'interesting' phenomena I encountered in a Python module that I'm trying to understand, rather than a request for help (though a solution would also be useful).

>>> import fuzzy
>>> s = fuzzy.Soundex(4)
>>> a = "apple"
>>> b = a
>>> sdx_a = s(a)
>>> sdx_a
'A140'
>>> a
'APPLE'
>>> b
'APPLE'

Yeah, so the fuzzy module totally violates the immutability of strings in Python. Is it able to do this because it is a C-extension? And does this constitute an error in CPython as well as the module, or even a security risk?

Also, can anyone think of a way to get around this behaviour? I would like to be able to keep the original capitalisation of the string.

Cheers,

Alex

This bug was resolved back in February; update your version.

To answer your question, yes, there are several ways to modify immutable types at the C level. The security implications are unknown, and possibly even unknowable, at this point.

Python: Why do int.numerator and int.denominator exist?

13 votes

int.numerator and int.denominator are a mystery to me.

help(int.numerator) states:

the numerator of a rational number in lowest terms

But as far as I know, int is not a rational number. So why do these properties exist?

See http://docs.python.org/library/numbers.html - int (numbers.Integral) is a subtype of numbers.Rational.

>>> import numbers
>>> isinstance(1337, numbers.Integral)
True
>>> isinstance(1337, numbers.Rational)
True
>>> issubclass(numbers.Integral, numbers.Rational)
True

The denominator of an int is always 1 while its numerator is the value itself.

In PEP 3141 you find details about the implementation of the various number types, e.g. proving the previous statement:

@property
def numerator(self):
    """Integers are their own numerators."""
    return +self

@property
def denominator(self):
    """Integers have a denominator of 1."""
    return 1

Parsing large (20GB) text file with python - reading in 2 lines as 1

13 votes

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.

inputFileHandle = open(inputFileName, 'r')

row = 0

for line in inputFileHandle:
    row =  row + 1
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.

Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.

It's a bug in Python.

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.

And the work-around:

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

Python good programming practice for enumerating lists

12 votes

I'm pretty new to Python and programming in general, and I was wondering if it is a good programming practice to write long statements with many logic operators - for example, in a for loop.

For example, here's a function I made that gets all the vowels from a word and returns a list containing those vowels.

def getVowels(word):
    vowel_list = []
    index = 0
    for i in word:
        if i == "a" or i == "e" or i == "i" or i == "o" or i == "u" or i == "A" or i == "E" or i == "I" or i == "O" or i == "U":
            vowel_list.append(word[index])
        index += 1
    return vowel_list

As you can see, the if statement has gotten very long. Is it considered good programming? If it's not, is there a better way to code this function?

No it is not considered good practice, there are always better ways :D

if i.upper() in "AEIOU"

Here is a much shorter version of your function using list comprehensions:

def get_vowels(word):
    return [c for c in word if c.upper() in 'AEIOU']

Rolling out a web authentication system

12 votes

I'm building a site that requires user authentication and authorization. My initial idea was to write the application using the Flask framework. However, I learned that Flask doesn't have a built in authentication system. It does have an extension flask-login, but I haven't gotten any confirmation as to whether this extension is well-written. I've read that it's very easy to get authentication wrong.

  1. Might it be a good idea to switch to something like Django or web2py, which do have built in authentication systems?

  2. Do good web programmers typically choose to use tried and true authentication systems as opposed to trying to roll out their own?

  3. Is there a good guide that demonstrates best practices with respect building an authentication/authorization system?

Thanks guys!

If you want to add user login capability to your website (that is, authentication and authorization) use an existing framework. Unless you're an expert in security, the probability of writing secure a system is close to 0. This applies regardless of the language you program in and regardless of the framework you are using.

Unfortunately, I have seen people answer this question as follows: "This is not that hard, and fun to code, as a beginner. You need a place to store your data (let's say a mysql database). You should store ENCRYPTED versions of the passwords, etc. etc." This answer is COMPLETELY INCORRECT.

There are free authentication frameworks available, USE THEM. For example, Django and web2py have built-in authentication systems. Many frameworks do not come with built-in authentication (e.g. Flask, webpy); this does not mean you should roll out your own framework. It just means you should find 3rd party authentication software. For Flask, there is Flask-login.

So, the answer to my original questions are:

1.) Yes, or integrate a verified 3rd-party software system.

2.) Yes, the best practices says DO NOT WRITE YOUR OWN AUTHENTICATION SYSTEM. Use one that's already been built.

3.) You shouldn't need a guide to building an authentication system, because you shouldn't be building one. However, OWASP is a great security resource.

Here's a brief summary: Do not write your own authentication system. Use a pre-built authentication system that is trusted. If anyone has any recommendations for authentication systems that can be used with Python, feel free to post them.

How do you count cardinality of very large datasets efficiently in Python?

12 votes

I have been playing at work with some very very large sets of data, typically several billions of elements, that are all maintained in a memcached cloud and periodically dumped into files, and for one of my tasks I'm trying to count the cardinality of this set.

For some context, each item contains an IP and some other attributes identifying a person and is encoded in base64, item size is 20 bytes. Reducing the size of an item by removing some fields is not a possibility.

Here is something that emulates my dataset as an in-memory version (thanks to this post for string generation):

import base64, os

dataset_size = 10000000000 # that's 10 billion, be careful if you run it !
big_dataset = [base64.b64encode(os.urandom(10)) for i in range(dataset_size)]

My first approach was to use a hashset like this:

uniques = set(big_dataset)
print "Cardinality: %d" % len(uniques)

While this in theory works fine on a small dataset, as you can guess there is a hiccup:

  • I can't make any assumption on the uniqueness of my data. I could have 50% of my dataset that is unique, or I could have 100% just as well. This is generated dynamically at regular time intervals and varies depending on a lot of factors (time of day for example)
  • Dataset size in 10 billion. Each item encoded in base 64 is 20 bytes, times 10 billion is a few hundredids gigabytes in average. Unfortunately, I don't have access to a machine with that much RAM !

I've done my homework and found at best some research papers, or some obscure libraries, but part of the goal of this is to understand what approach works and why.

So I'm calling to you Python users, do you know of any algorithm that would help me estimate cardinality efficiently? By complexity I mean I don't care that much about running time complexity, but I'm more focused about space complexity. I don't mind sacrificing a bit of accuracy if it boosts performance tremendously (so I don't necessarily need to know the exact number of uniques, even if that would be ideal, but probably not a viable approach). I would say up to 5% would be acceptable. I'm looking for something specifically in Python for this project.

Thanks for any help you can provide !

As some people noted, I could use Hadoop/MR, but for this specific projects we don't want to go the MR way, and would like to explore algorithms to do this on a single machine efficiently, as this could be applied to a few other different projects.

I would recommend the usage of Hash Sketches, namely (Super)Log Log sketches or Hyper Log Sketches.

You can check and perhaps use and improve the simple python implementation that I made: https://github.com/goncalvesnelson/Log-Log-Sketch

12 votes

I got one puzzle and i want to solve it using Python.

Puzzle:

A merchant has a 40 kg weight which he used in his shop. Once, it fell from his hands and was broken into 4 pieces. But surprisingly, now he can weigh any weight between 1 kg to 40 kg with the combination of these 4 pieces.

So question is, what are weights of those 4 pieces?

Now i wanted to solve this in Python.

The only constraint i got from the puzzle is that sum of 4 pieces is 40. With that i could filter all the set of 4 values whose sum is 40.

import itertools as it

weight = 40
full = range(1,41)
comb = [x for x in it.combinations(full,4) if sum(x)==40]

length of comb = 297

Now i need to check each set of values in comb and try all the combination of operations.

Eg if (a,b,c,d) is the first set of values in comb, i need to check a,b,c,d,a+b,a-b, .................a+b+c-d,a-b+c+d........ and so on.

I tried a lot, but i am stuck at this stage, ie how to check all these combination of calculations to each set of 4 values.

Question :

1) I think i need to get a list all possible combination of [a,b,c,d] and [+,-].

2) does anyone have a better idea and tell me how to go forward from here?

Also, i want to do it completely without help of any external libraries, need to use only standard libraries of python.

EDIT : Sorry for a late info. Its answer is (1,3,9,27), which i found a few years back. I have checked and verified the answer.

EDIT : At present, fraxel's answer work perfect with time = 0.16 ms. A more better and faster approach is always welcome.

Regards

ARK

Earlier walk-through anwswer:

We know a*A + b*B + c*C + d*D = x for all x between 0 and 40, and a, b, c, d are confined to -1, 0, 1. Clearly A + B + C + D = 40. The next case is x = 39, so clearly the smallest move is to remove an element (it is the only possible move that could result in successfully balancing against 39):

A + B + C = 39, so D = 1, by neccessity.

next:

A + B + C - D = 38

next:

A + B + D = 37, so C = 3

then:

A + B = 36

then:

A + B - D = 35

A + B - C + D = 34

A + B - C = 33

A + B - C - D = 32

A + C + D = 31, so A = 9

Therefore B = 27

So the weights are 1, 3, 9, 27

Really this can be deduced immediately from the fact that they must all be multiples of 3.

Interesting Update:

So here is some python code to find a minimum set of weights for any dropped weight that will span the space:

def find_weights(W):
    weights = []
    i = 0
    while sum(weights) < W:
        weights.append(3 ** i)
        i += 1
    weights.pop()
    weights.append(W - sum(weights))
    return weights

print find_weights(40)
#output:
[1, 3, 9, 27]

To further illustrate this explaination, one can consider the problem as the minimum number of weights to span the number space [0, 40]. It is evident that the number of things you can do with each weight is trinary /ternary (add weight, remove weight, put weight on other side). So if we write our (unknown) weights (A, B, C, D) in descending order, our moves can be summarised as:

    ABCD:   Ternary:
40: ++++     0000
39: +++0     0001
38: +++-     0002
37: ++0+     0010
36: ++00     0011
35: ++0-     0012
34: ++-+     0020
33: ++-0     0021
32: ++--     0022
31: +0++     0100
etc.

I have put ternary counting from 0 to 9 alongside, to illustrate that we are effectively in a trinary number system (base 3). Our solution can always be written as:

3**0 + 3**1 +3**2 +...+ 3**N >= Weight

For the minimum N that this holds true. The minimum solution will ALWAYS be of this form.

Furthermore, we can easily solve the problem for large weights and find the minimum number of pieces to span the space:

A man drops a known weight W, it breaks into pieces. His new weights allow him to weigh any weight up to W. How many weights are there, and what are they?

#what if the dropped weight was a million Kg:
print find_weights(1000000)
#output:
[1, 3, 9, 27, 81, 243, 729, 2187, 6561, 19683, 59049, 177147, 531441, 202839]

Try using permutations for a large weight and unknown number of pieces!!

How does Python distinguish callback function which is a member of a class?

11 votes

Please look at the simple example:

class A:
  def __init__(self, flag):
    self.flag = flag

  def func(self):
    print self.flag

a = A(1)
b = A(2)
callback_a = a.func
callback_b = b.func

callback_a()
callback_b()

The result is:

1
2

It runs as expected. But I have a question. In C, the callback function is passed as a pointer. In Python, it should have a similar way to do this, so the caller knows the address of the function. But in my example, not only the function pointer is passed, but also the parameter (self) is passed, because the same method of the same class prints different results. So my questions are:

  1. Does such a method in Python only has one copy in memory? My meaning is that the code of any method only has one copy, and in my example the method won't be cloned itself. I think it should have only one copy, but here I still make this question in order to get more inputs.

  2. I remember everything in Python is an object. So in my example, are there two function instances with different parameters but only one copy of code?

Specific to CPython, there is only one copy of the function object. During instance creation, the class wraps the unbound functions in its namespace as bound methods. But they all wrap the same function.

Here's your example expanded to show what's going on.

class A(object):
  def __init__(self, flag):
    self.flag = flag

  def func(self):
    print self.flag

a = A(1)
b = A(2)

callback_a = a.func
callback_b = b.func

print "typeof(callback_a) = {0}".format(type(callback_a))
print "typeof(callback_b) = {0}".format(type(callback_b))

print "typeof(callback_a.__func__) = {0}".format(type(callback_a.__func__))
print "typeof(callback_b.__func__) = {0}".format(type(callback_b.__func__))

print "'callback_a.__func__ is callback_b.__func__'  is {0}".format(callback_a.__func__ is callback_b.__func__)

callback_a()
callback_b()

This code outputs

typeof(callback_a) = <type 'instancemethod'>
typeof(callback_b) = <type 'instancemethod'>
typeof(callback_a.__func__) = <type 'function'>
typeof(callback_b.__func__) = <type 'function'>
'callback_a.__func__ is callback_b.__func__'  is True

You can clearly see, using the is operator, that both instancemethod classes are sharing the same function object.

When should I implement __call__

11 votes

In python you can make instances callable by implementing the __call__ method. For example

class Blah:
    def __call__(self):
        print "hello"

obj = Blah()
obj()

But I can also implement a method of my own, say 'run':

class Blah:
    def run(self):
        print "hello"

obj = Blah()
obj.run()

When should I implement __call__?

This is hard to answer. My opinion is that you should never define __call__ unless your actual goal is to create a function. It's not something you would do after you've already created a traditional object.

In other words, if you're starting out thinking "I'm going to create an object" you should never end up implementing __call__. If, on the other hand, you start out thinking "I'm going to create a function... but I wish I could use the object framework to let my function have some state" then you could create the function as an object with state, and define __call__ to make it act like a function.

I want to include one final bit of advice which was provided by @Paul Manta in a comment to the question, where he wrote:

If you're unsure if you need __call__, then you don't need __call__

I think that's sound advice.

Impossible lookbehind with a backreference

10 votes

From my understanding,

(.)(?<!\1)

should never match. Actually, php's preg_replace even refuses to compile this and so does ruby's gsub. The python re module seems to have a different opinion though:

import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))

Result:

(x)AAAA(A)(y)BBB(B)(z)

Can anyone provide a reasonable explanation for this behavior?

Update

This behavior appears to be a limitation in the re module. The alternative regex module seems to handle groups in assertions correctly:

import regex

test = 'xAAAAAyBBBBz'

print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz

print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)

Note that unlike pcre, regex also allows variable-width lookbehinds:

print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)

Eventually, regex is going to be included in the standard library, as mentioned in PEP 411.

This does look like a limitation (nice way of saying "bug", as I learned from a support call with Microsoft) in the Python re module.

I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it's not clever enough to figure out that \1 will always be fixed-length. Why it doesn't complain about this when compiling the regex, I can't say.

Funnily enough:

>>> print (re.sub(r'.(?<!\0)', r'(\g<0>)', test))
(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)
>>>
>>> re.compile(r'(.*)(?<!\1)') # This should trigger an error but doesn't!
<_sre.SRE_Pattern object at 0x00000000026A89C0>

So better don't use backreferences in lookbehind assertions in Python. Positive lookbehind isn't much better (it also matches here as if it was a positive lookahead):

>>> print (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))
x(A)(A)(A)(A)Ay(B)(B)(B)Bz

And I can't even guess what's going on here:

>>> print (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))
x(AA)(A)(A)Ay(BB)(B)Bz

Training Naive Bayes Classifier on ngrams

6 votes

I've been using the Ruby Classifier library to classify privacy policies. I've come to the conclusion that the simple bag-of-words approach built into this library is not enough. To increase my classification accuracy, I want to train the classifier on n-grams in addition to individual words.

I was wondering whether there's a library out there for preprocessing documents to get relevant n-grams (and properly deal with punctuation). One thought was that I could preprocess the documents and feed pseudo-ngrams into the Ruby Classifier like:

wordone_wordtwo_wordthree

Or maybe there's a better way to be doing this, such as a library that has ngram based Naive Bayes Classification built into it from the getgo. I'm open to using languages other than Ruby here if they get the job done (Python seems like a good candidate if need be).

If you're ok with python, I'd say nltk would be perfect for you.

For example:

>>> import nltk
>>> s = "This is some sample data.  Nltk will use the words in this string to make ngrams.  I hope that this is useful.".split()
>>> model = nltk.NgramModel(2, s)
>>> model._ngrams
set([('to', 'make'), ('sample', 'data.'), ('the', 'words'), ('will', 'use'), ('some', 'sample'), ('', 'This'), ('use', 'the'), ('make', 'ngrams.'), ('ngrams.', 'I'), ('hope', 'that'
), ('is', 'some'), ('is', 'useful.'), ('I', 'hope'), ('this', 'string'), ('Nltk', 'will'), ('words', 'in'), ('this', 'is'), ('data.', 'Nltk'), ('that', 'this'), ('string', 'to'), ('
in', 'this'), ('This', 'is')])

You even have a method nltk.NaiveBayesClassifier