Best string questions in May 2012

str performance in python

68 votes

While profiling a piece of python code (python 2.6 up to 3.2), I discovered that the str method to convert an object (in my case an integer) to a string is almost an order of magnitude slower than using string formatting.

Here is the benchmark

>>> from timeit import Timer
>>> Timer('str(100000)').timeit()
0.3145311339386332
>>> Timer('"%s"%100000').timeit()
0.03803517023435887

Does anyone know why this is the case? Am I missing something?

'%s' % 100000 is evaluated by the compiler and is equivalent to a constant at run-time.

>>> import dis
>>> dis.dis(lambda: str(100000))
  8           0 LOAD_GLOBAL              0 (str)
              3 LOAD_CONST               1 (100000)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE        
>>> dis.dis(lambda: '%s' % 100000)
  9           0 LOAD_CONST               3 ('100000')
              3 RETURN_VALUE        

% with a run-time expression is not (significantly) faster than str:

>>> Timer('str(x)', 'x=100').timeit()
0.25641703605651855
>>> Timer('"%s" % x', 'x=100').timeit()
0.2169809341430664

Do note that str is still slightly slower, as @DietrichEpp said, this is because str involves lookup and function call operations, while % compiles to a single immediate bytecode:

>>> dis.dis(lambda x: str(x))
  9           0 LOAD_GLOBAL              0 (str)
              3 LOAD_FAST                0 (x)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE        
>>> dis.dis(lambda x: '%s' % x)
 10           0 LOAD_CONST               1 ('%s')
              3 LOAD_FAST                0 (x)
              6 BINARY_MODULO       
              7 RETURN_VALUE        

Of course the above is true for the system I tested on (CPython 2.7); other implementations may differ.

Why is this valid C# code?

54 votes

This is valid C# code

var bob = "abc" + null + null + null + "123";  // abc123

This is not valid C# code

var wtf = null.ToString(); // compiler error

Why is the first statement valid?

The reason for first one working:

From MSDN:

In string concatenation operations,the C# compiler treats a null string the same as an empty string, but it does not convert the value of the original null string.

More information on the + binary operator:

The binary + operator performs string concatenation when one or both operands are of type string.

If an operand of string concatenation is null, an empty string is substituted. Otherwise, any non-string argument is converted to its string representation by invoking the virtual ToString method inherited from type object.

If ToString returns null, an empty string is substituted.

The reason of the error in second is:

null (C# Reference) - The null keyword is a literal that represents a null reference, one that does not refer to any object. null is the default value of reference-type variables.

String.Format - how it works and how to implement custom formatstrings

45 votes

With String.Format() it is possible to format for example DateTime objects in many different ways. Every time I am looking for a desired format I need to search around on Internet. Almost always I find an example I can use. For example:

String.Format("{0:MM/dd/yyyy}", DateTime.Now);          // "09/05/2012"

But I don't have any clue how it works and which classes support these 'magic' additional strings.

So my questions are:

  1. How does String.Format map the additional information MM/dd/yyyy to a string result?
  2. Do all Microsoft objects support this feature?
    Is this documented somewhere?
  3. Is it possible to do something like this:
    String.Format("{0:MyCustomFormat}", new MyOwnClass())

String.Format matches each of the tokens inside the string ({0} etc) against the corresponding object: http://msdn.microsoft.com/en-us/library/system.string.format.aspx

A format string is optionally provided:

{ index[,alignment][ : formatString] }

If formatString is provided, the corresponding object must implement IFormattable and specifically the ToString method that accepts formatString and returns the corresponding formatted string: http://msdn.microsoft.com/en-us/library/system.iformattable.tostring.aspx

An IFormatProvider may also used which can be used to capture basic formatting standards/defaults etc. Examples here and here.

So the answers to your questions in order:

  1. It uses the IFormattable interface's ToString() method on the DateTime object and passes that the MM/dd/yyyy format string. It is that implementation which returns the correct string.

  2. Any object that implement IFormattable supports this feature. You can even write your own!

  3. Yes, see above.

Why isn't string.Normalize consistent depending on the context?

16 votes

I have the following code:

string input = "ç";
string normalized = input.Normalize(NormalizationForm.FormD);
char[] chars = normalized.ToCharArray();

I build this code with Visual studio 2010, .net4, on a 64 bits windows 7.

I run it in a unit tests project (platform: Any CPU) in two contexts and check the content of chars:

  • Visual Studio unit tests : chars contains { 231 }.
  • ReSharper : chars contains { 231 }.
  • NCrunch : chars contains { 99, 807 }.

In the msdn documentation, I could not find any information presenting different behaviors.

So, why do I get different behaviors? For me the NCrunch behavior is the expected one, but I would expect the same for others.

Edit: I switched back to .Net 3.5 and still have the same issue.

In String.Normalize(NormalizationForm) documentation it says that

binary representation is in the normalization form specified by the normalizationForm parameter.

which means you'd be using FormD normalization on both cases, so CurrentCulture and such should not really matter.

The only thing that could change, then, what I can think of is the "ç" character. That character is interpreted as per character encoding that is either assumed or configured for Visual Studio source code files. In short, I think NCrunch is assuming different source file encoding than the others.

Based on quick searching on NCrunch forum, there was a mention of some UTF-8 -> UTF-16 conversion, so I would check that.

Secure string compare function

15 votes

I just came across this code in the HTTP Auth library of the Zend Framework. It seems to be using a special string compare function to make it more secure. However, I don't quite understand the comments. Could anybody explain why this function is more secure than doing $a == $b?

/**
 * Securely compare two strings for equality while avoided C level memcmp()
 * optimisations capable of leaking timing information useful to an attacker
 * attempting to iteratively guess the unknown string (e.g. password) being
 * compared against.
 *
 * @param string $a
 * @param string $b
 * @return bool
 */
protected function _secureStringCompare($a, $b)
{
    if (strlen($a) !== strlen($b)) {
        return false;
    }
    $result = 0;
    for ($i = 0; $i < strlen($a); $i++) {
        $result |= ord($a[$i]) ^ ord($b[$i]);
    }
    return $result == 0;
}

It looks like they're trying to prevent timing attacks.

In cryptography, a timing attack is a side channel attack in which the attacker attempts to compromise a cryptosystem by analyzing the time taken to execute cryptographic algorithms. Every logical operation in a computer takes time to execute, and the time can differ based on the input; with precise measurements of the time for each operation, an attacker can work backwards to the input.

Basically, if it takes a different amount of time to compare a correct password and an incorrect password, then you can use the timing to figure out how many characters of the password you've guessed correctly.

Consider an extremely flawed string comparison (this is basically the normal string equality function, with an obvious wait added):

function compare(a, b) {
    if(len(a) !== len(b)) { 
        return false;
    }
    for(i = 0; i < len(a); ++i) {
        if(a[i] !== b[i]) {
            return false;
        }
        wait(10); // wait 10 ms
    }
    return true;
}

Say you give a password and it (consistently) takes some amount of time for one password, and about 10 ms longer for another. What does this tell you? It means the second password has one more character correct than the first one.

This lets you do movie hacking -- where you guess a password one character at a time (which is much easier than guessing every single possible password).

In the real world, there's other factors involved, so you have to try a password many, many times to handle the randomness of the real world, but you can still try every one character password until one is obviously taking longer, then start on two character password, and so on.

This function still has a minor problem here:

if(strlen($a) !== strlen($b)) { 
    return false;
}

It lets you use timing attacks to figure out the correct length of the password, which lets you not bother guessing any shorter or longer passwords. In general, you want to hash your passwords first (which will create equal-length strings), so I'm guessing they didn't consider it to be a problem.

Converting a string representation of a list into an actual list object

11 votes

I have a string that looks identical to a list, let's say:

fruits = "['apple', 'orange', 'banana']"

What would be the way to convert that to a list object?

>>> fruits = "['apple', 'orange', 'banana']"
>>> import ast
>>> fruits = ast.literal_eval(fruits)
>>> fruits
['apple', 'orange', 'banana']
>>> fruits[1]
'orange'

As pointed out in the comments ast.literal_eval is safe. From the docs:

Safely evaluate an expression node or a string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None.

This can be used for safely evaluating strings containing Python expressions from untrusted sources without the need to parse the values oneself.

Unicode, regular expressions and PyPy

6 votes

I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (1.5.0-alpha0 1.8.0, implementing Python 2.7.1 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u" and ur". The full source is here, and the relevant bits are:

# -*- coding:utf-8 -*-
import re

# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {
    ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',
    ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',
    ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',
    ...
    ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',
    ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',
    ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}

def hack_regexp(regexp_string):
    for (k,v) in unicode_categories.items():
        regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)
    return regexp_string

def regex(regexp_string,flags=0):
    """Shortcut for re.compile that also translates and add the UNICODE flag

    Example usage:
        >>> from unicode_hack import regex
        >>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
        >>> print result.group(0)
        áÇñ
        >>> 
    """
    return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)

(on PyPy there is no match in the "Example usage", so result is None)

Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest and directly typing it in the command line). The source file encoding is also correct, and the coding directive in the header seems to be recognized by Python.

Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding header, different encodings in the command line, different interpretations of r and u) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.

Seems PyPy has some encoding problems, both when reading a source file (unrecognized coding header, maybe) and when inputting/outputting in the command line. I replaced my example code with the following:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0) == u'áÇñ'
True
>>>

And it kept working on CPython and failing on PyPy. Replacing the "áÇñ" for its escaped characters - u'\xe1\xc7\xf1' - OTOH did the trick:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'\xe1\xc7\xf1123')
>>> print result.group(0) == u'\xe1\xc7\xf1'
True
>>>

That worked fine on both. I believe the problem is restricted to these two scenarios (source loading and command line), since trying to open an UTF-8 file using codecs.open works fine. When I try to input the string "áÇñ" in the command line, or when I load the source code of "unicode_hack.py" using codecs, I get the same result on CPython:

>>> u'áÇñ'
u'\xe1\xc7\xf1'
>>> import codecs
>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

but different results on PyPy:

>>>> u'áÇñ'
u'\xa0\u20ac\xa4'
>>>> import codecs
>>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

Update: Issue1139 submitted on PyPy bug tracking system, let's see how that turns out...