Best python questions in June 2012

Is there any pythonic way to combine two dicts?

28 votes

For example I have two dicts:

Dict A: {'a':1, 'b':2, 'c':3}
Dict B: {'b':3, 'c':4, 'd':5}

I need a pythonic way of 'combining' two dicts such that the result is :

{'a':1, 'b':5, 'c':7, 'd':5}

That is to say: if a key appears in both dicts, add their values, if it appears in only one dict, keep its value.

Thanks in advance.

Use collections.Counter:

>>> from collections import Counter
>>> A = Counter({'a':1, 'b':2, 'c':3})
>>> B = Counter({'b':3, 'c':4, 'd':5})
>>> A + B
Counter({'c': 7, 'b': 5, 'd': 5, 'a': 1})

Counters are basically a subclass of dict, so you can still do everything else with them you'd normally do with that type, such as iterate over their keys and values.

Determining the number of parameters in a lambda

25 votes

I am wondering if there is a way to determine (given a variable containing a lambda) the number of parameters the lambda it contains. The reason being, I wish to call a function conditionally dependent on the number of parameters.

What I'm looking for

def magic_lambda_parameter_counting_function(lambda_function):
    """Returns the number of parameters in lambda_function

    Args:
        lambda_function - A lambda of unknown number of parameters
    """

So I can do something like

def my_method(lambda_function):

    # ... 
    # (say I have variables i and element)

    parameter_count = magic_lambda_parameter_counting_function(lambda_function)

    if parameter_count == 1:
        lambda_function(i)
    elif parameter_count == 2:
        lambda_function(i, element)

I'm skipping the part about how to count the arguments, because I don't know how you want to consider varargs and keywords. But this should get you started.

>>> import inspect
>>> foo = lambda x, y, z: x + y + z
>>> inspect.getargspec(foo)
ArgSpec(args=['x', 'y', 'z'], varargs=None, keywords=None, defaults=None)

Python: What's the difference between __builtin__ and __builtins__?

22 votes

i was coding today and noticed something. If I open a new interpreter session (IDLE) and check what's defined with the dir function I get this:

$ python
>>> dir()
['__builtins__', '__doc__', '__name__', '__package__']
>>> dir(__builtins__)
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError', 'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_', '__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'apply', 'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip']
>>> import __builtin__
['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BufferError', 'BytesWarning', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'NameError', 'None', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'ReferenceError', 'RuntimeError', 'RuntimeWarning', 'StandardError', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'ZeroDivisionError', '_', '__debug__', '__doc__', '__import__', '__name__', '__package__', 'abs', 'all', 'any', 'apply', 'basestring', 'bin', 'bool', 'buffer', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'cmp', 'coerce', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'execfile', 'exit', 'file', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'intern', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'long', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'raw_input', 'reduce', 'reload', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip']
>>> dir(__builtin__) == dir(__builtins__) # They seem to have the same things
True

Please note the last line.

So, my question is:

  • Is any an alias of the other one?

  • Are the pyhton guys planning to get rid of one of those?

  • What should I use for my own programs?

  • What about Python 3?

  • Any information is valuable!

Thank you!

Important:

I'm using Python 2.7.2+ on Ubuntu.

Straight from the python documentation: http://docs.python.org/reference/executionmodel.html

By default, when in the __main__ module, __builtins__ is the built-in module __builtin__ (note: no 's'); when in any other module, __builtins__ is an alias for the dictionary of the __builtin__ module itself.

__builtins__ can be set to a user-created dictionary to create a weak form of restricted execution.

CPython implementation detail: Users should not touch __builtins__; it is strictly an implementation detail. Users wanting to override values in the builtins namespace should import the __builtin__ (no 's') module and modify its attributes appropriately. The namespace for a module is automatically created the first time a module is imported.

Is there any nicer way to write successive "or" statements in Python?

20 votes

Simple question to which I can't find any "nice" answer by myself:

Let's say I have the following condition:

if 'foo' in mystring or 'bar' in mystring or 'hello' in mystring:
    # Do something
    pass

Where the number of or statement can be quite longer depending on the situation.

Is there a "nicer" (more Pythonic) way of writing this, without sacrificing performance ?

If thought of using any() but it takes a list of boolean-like elements, so I would have to build that list first (giving-up short circuit evaluation in the process), so I guess it's less efficient.

Thank you very much.

A way could be

if any(s in mystring for s in ('foo', 'bar', 'hello')):
    pass

The thing you iterate over is a tuple, which is built upon compilation of the function, so it shouldn't be inferior to your original version.

If you fear that the tuple will become too long, you could do

def mystringlist():
    yield 'foo'
    yield 'bar'
    yield 'hello'
if any(s in mystring for s in mystringlist()):
    pass

Find longest repetitive sequence in a string

19 votes

I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

then I would like the value "helloworld" to be returned.

I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.

Hope this helps!

What are the rules regarding chaining of "==" and "!=" in Python

16 votes

This morning, I find myself writing something like:

if (a == b == c):
  # do something

And was surprised that it gave me the expected result.

I thought it would behave as:

if ((a == b) == c):
  # do something

But it obviously didn't. It seems Python is treating the first statement differently from the second, which is nice but I couldn't find any documentation or explanation regarding this.

I tested and got this:

In [1]: 2 == 2 == 2
Out[1]: True

In [2]: (2 == 2) == 2
Out[2]: False

Would someone care to explain me what are the rules regarding such "chaining" of == (or !=) ?

Thank you very much.

This works with all comparison operators - eg, you can also do:

>>> 4 < 5 < 6
True
>>> 4 < 5 !=2
True

In general, according to the documentation, a op1 b op2 c where op1 and op2 are any of: <, >, !=, ==, <=, >=, is , is not, in or not in will give the same result as:

a op1 b and b op2 c

The docs also say that this can work with arbitrarily many comparisons, so:

>>> 5 != '5' != 'five' != (3+2)
True

Which can be a slightly confusing result sometimes since it seems to say 5 != (3+2) - each operand is only compared with the ones immediately adjacent to it, rather than doing all possible combinations (which mightn't be clear from examples using only ==, since it won't affect the answer if everything defines __eq__ sanely).

Python: How to toggle between two values

14 votes

I want to toggle between two values in Python, that is, between 0 and 1.

For example, when I run a function the first time, it yields the number 0. Next time, it yields 1. Third time it's back to zero, and so on.

Sorry if this doesn't make sense, but does anyone know a way to do this?

You can accomplish that with a generator like this:

>>> def alternate():
...   while True:
...     yield 0
...     yield 1
...
>>>
>>> alternator = alternate()
>>>
>>> alternator.next()
0
>>> alternator.next()
1
>>> alternator.next()
0

Tuples readability : [0,0] vs (0,0)

14 votes

I'm using Python since some times and I am discovering the "pythonic" way to code. I am using a lot of tuples in my code, most of them are polar or Cartesian positions.

I found myself writing this :

window.set_pos([18,8])

instead of this :

window.set_pos((18,8))

to get rid of the double parenthesis I found hard to read.

It seems that python is automatically doing the type conversion from list to tuple, as my code works properly.

But is it a good way to code ? Do you have any presentation tip I could use to write readable code ?

Thank you in advance for your surely enlightening answers.

I'd be careful deciding to eschew tuples in favor of lists everywhere. Have you ever used the dis module? Watch what Python is doing at the bytecode level when you make a list verses making a tuple:

>>> def f():
...     x = [1,2,3,4,5,6,7]
...     return x
... 
>>> def g():
...     x = (1,2,3,4,5,6,7)
...     return x
... 
>>> import dis
>>> dis.dis(f)
  2           0 LOAD_CONST               1 (1)
              3 LOAD_CONST               2 (2)
              6 LOAD_CONST               3 (3)
              9 LOAD_CONST               4 (4)
             12 LOAD_CONST               5 (5)
             15 LOAD_CONST               6 (6)
             18 LOAD_CONST               7 (7)
             21 BUILD_LIST               7
             24 STORE_FAST               0 (x)

  3          27 LOAD_FAST                0 (x)
             30 RETURN_VALUE     
>>>
>>>   
>>> dis.dis(g)
  2           0 LOAD_CONST               8 ((1, 2, 3, 4, 5, 6, 7))
              3 STORE_FAST               0 (x)

  3           6 LOAD_FAST                0 (x)
              9 RETURN_VALUE   

Though it will probably never be an issue in a GUI application (as your example seems to be), for performance reasons you may want to be careful about doing it everywhere in your code.

EXP(ORT) ciphers and M2Crypto/OpenSSL

13 votes

I am having a hard time running a M2Crypto SSLServer with EXPORT grade ciphers.

LOW/MEDIUM/HIGH grade ciphers work without any problems, but EXPORT just won't. Also, when OpenSSL is run in a server mode from a command line it accepts EXPORT grade ciphers without any problems.

So, either I am missing something or there is a problem in a M2Crypto module. Any help is appreciated.

Used python code (ssl-server.py) looks like this:

import M2Crypto
import socket

CERTFILE = "dummy_cert.pem"
KEYFILE = "dummy_key.pem"
PROTOCOL = "sslv3"
HOST = "0.0.0.0"
PORT = 4433

def main():
    print "[i] Initializing context ..."
    ctx = M2Crypto.SSL.Context(protocol=PROTOCOL, weak_crypto=True)
    ctx.load_cert_chain(certchainfile=CERTFILE, keyfile=KEYFILE)
    ctx.set_options(M2Crypto.m2.SSL_OP_ALL)
    ctx.set_cipher_list("ALL")

    print "[i] Initializing socket ..."
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.bind((HOST, PORT))
    sock.listen(1)
    conn, addr = sock.accept()

    print "[i] SSL handshake ..."
    ssl_conn = M2Crypto.SSL.Connection(ctx=ctx, sock=conn)
    ssl_conn.setup_ssl()
    try:
        ssl_conn_res = ssl_conn.accept_ssl()
    except Exception, ex:
        print "[x] SSL connection failed: '%s'" % str(ex)
    else:
        if ssl_conn_res == 1:
            print "[i] SSL connection accepted"
        else:
            print "[x] SSL handshake failed: '%s'" % ssl_conn.ssl_get_error(ssl_conn_res)

if __name__ == "__main__":
    main()

Symptoms are:

$ uname -a
Linux XYZ 2.6.38-15-generic #59-Ubuntu SMP Fri Apr 27 16:03:32 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=11.04
DISTRIB_CODENAME=natty
DISTRIB_DESCRIPTION="Ubuntu 11.04"

$ python -c "import M2Crypto;print M2Crypto.version_info"
(0, 20, 1)

$ openssl version
OpenSSL 0.9.8o 01 Jun 2010

1) NOT OK
SERVER (terminal 1): $ python ssl-server.py
CLIENT (terminal 2): $ openssl s_client -connect localhost:4433 -cipher EXPORT
CONNECTED(00000003)
28131:error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure:s23_clnt.c:602:

2) OK
SERVER (terminal 1): $ openssl s_server -cert dummy_cert.pem -key dummy_key.pem -ssl3 -no_tls1 -no_ssl2 -cipher EXPORT
CLIENT (terminal 2): $ openssl s_client -connect localhost:4433 -cipher EXPORT
CONNECTED(00000003)
depth=0 C = BE, CN = www.example.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = BE, CN = www.example.com
verify error:num=27:certificate not trusted
verify return:1
depth=0 C = BE, CN = www.example.com
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:/C=BE/CN=www.example.com
   i:/C=BE/CN=test-ca
---
Server certificate
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
subject=/C=BE/CN=www.example.com
issuer=/C=BE/CN=test-ca
---
No client certificate CA names sent
---
SSL handshake has read 1141 bytes and written 242 bytes
---
New, TLSv1/SSLv3, Cipher is EXP-EDH-RSA-DES-CBC-SHA
Server public key is 1024 bit
Secure Renegotiation IS supported
Compression: zlib compression
Expansion: zlib compression
SSL-Session:
    Protocol  : SSLv3
    Cipher    : EXP-EDH-RSA-DES-CBC-SHA
    Session-ID: B052D5D5A436F9A0B9D3FB24F2E32A8A06A0B6828230621C4CFAEB82A0A9AE0C
    Session-ID-ctx: 
    Master-Key:     47F6E3720D06518B961FE389F13BCDE42C37F703099ABBB9B3DA35383C420F519D4F4773D35E470CF6FF7BB243B29069
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Compression: 1 (zlib compression)
    Start Time: 1340644713
    Timeout   : 300 (sec)
    Verify return code: 21 (unable to verify the first certificate)
---

Content of a dummy_cert.pem is as follows:

-----BEGIN CERTIFICATE-----
MIICkTCCAfqgAwIBAgIBAjANBgkqhkiG9w0BAQUFADAfMQswCQYDVQQGEwJCRTEQ
MA4GA1UEAxMHdGVzdC1jYTAeFw0xMjA1MDYwODQyNDlaFw0yMjA1MDMwODQyNDla
MCcxCzAJBgNVBAYTAkJFMRgwFgYDVQQDEw93d3cuZXhhbXBsZS5jb20wgZ8wDQYJ
KoZIhvcNAQEBBQADgY0AMIGJAoGBAL7OBv9wRwtNjN984XSy22/rw6tHM6Lq/Ccf
NoHKbqwC+PsxgmgJJiGBGewrzBR42toqHJi7EjHhuvrgqV9s2duPQBAANh7tzY1h
6VekrwhIIt4o1h0F2KB16VXA8s918d+8pRGt2T11GUh/QT3m9yY1VzqdIBeAfklC
ET6ncPK/AgMBAAGjgdQwgdEwCQYDVR0TBAIwADARBglghkgBhvhCAQEEBAMCBkAw
KwYJYIZIAYb4QgENBB4WHFRpbnlDQSBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYD
VR0OBBYEFNGQArEZPKprJTn7A64qEFfl0m4xME8GA1UdIwRIMEaAFFuITOUJlGrJ
9lKufs8cm1MpwXrroSOkITAfMQswCQYDVQQGEwJCRTEQMA4GA1UEAxMHdGVzdC1j
YYIJALimgW7YUgdrMAkGA1UdEgQCMAAwCQYDVR0RBAIwADANBgkqhkiG9w0BAQUF
AAOBgQDWh8A0eBxI9XHy68xdjFsk2oerJeV6qqlcmtPZgz3GlarRcWcKsRJOyLLL
dCOe7tY5isWQAoLt6XALzDWjbQkTJnxBaKHif1MIikuajaYKT7LA1MvFn50Qrm6n
f9hG7gvdTpm1rlPcs0qibp1vJVubkU51mT6JT4UnLfeVIjtL7Q==
-----END CERTIFICATE-----

Content of a dummy_key.pem is as follows:

-----BEGIN RSA PRIVATE KEY-----
MIICXgIBAAKBgQC+zgb/cEcLTYzffOF0sttv68OrRzOi6vwnHzaBym6sAvj7MYJo
CSYhgRnsK8wUeNraKhyYuxIx4br64KlfbNnbj0AQADYe7c2NYelXpK8ISCLeKNYd
BdigdelVwPLPdfHfvKURrdk9dRlIf0E95vcmNVc6nSAXgH5JQhE+p3DyvwIDAQAB
AoGBAIZldIRkP4Z0n2+j9OJQQUS6Wl7AjlyJDAc6cxhE0GOUzG+S1foVx6f92ZaC
2wLoha75zp691fkQuLWRnXu7nk9QwxQdOppKijIPHdL2cYtUc9UCedN5rExjpcOP
4Hjwf17YOxK2J0zzmG1djTBB47BKGUedSQ7E1QxGcrESS2XxAkEA+6ey2jy8etWi
QmCdJJIxXwKRVHCmt5LVwj+IOk/u3sr1AGfBm7spKGU3boCiFt4FmjGMax7B9r/e
zPaMb34guwJBAMIZX7Vv5gfjvWtgp6pyE/UkjRSOKBpuy9gyiqtLBJwehj/qsBqr
O6tFmjMFiudVusnVSrEFGAPLV52xf0U4580CQQDkEQ1UH2spX2dYBLslo6A+3NLc
1eMhx18WVgGd50cyfnkfzuh1vF8GjwR3jvhXBQvKvFDn284pU6YV1vNbL9F1AkEA
o2CwSwyRV3q+6i9Fchbr7aCCkBbIctdoBeclCeHvU2nuHsbwzMHtS9EeZmv365kh
zNoYMMDU4fy7FyVct2ua0QJASXtIwYKZ2CAP+lAQqfh+knRRqtqdLt4Lt0mpML5m
UtsECS8frKeF3mynXfsyRkvC8F2WFiJVJ3+D+y3zYNGlZg==
-----END RSA PRIVATE KEY-----

Update: at the low level handshake packets seem to be the same except that random[32] field making this even more strange.

SSL dump (ssldump -a -A -H -i lo) for both cases can be found here:

http://pastebin.com/YuC7d8zg (NOT OK case)

http://pastebin.com/U6YGQmv9 (OK case)

I needed the following two tweaks to the python script to make it work with export cipher suites:

PROTOCOL = "sslv23"
...
    print "[i] Initializing context ..."
    ctx = M2Crypto.SSL.Context(protocol=PROTOCOL, weak_crypto=True)
    ctx.load_cert_chain(certchainfile=CERTFILE, keyfile=KEYFILE)
    ctx.set_options(M2Crypto.m2.SSL_OP_ALL)
    ctx.set_tmp_rsa(M2Crypto.RSA.gen_key(512, 65537))
    ctx.set_cipher_list("ALL")

That is:

  1. Use SSLv23 as protocol identifier (SSLv2/v3 compat mode). Not sure why it is needed in this case, but it seems not to work otherwise.
  2. Set a temporary, ephemeral RSA key on the context using set_tmp_rsa(). This is required because with export ciphers, the provided (1024-bit) RSA key is only used for authentication (signing), while a temporary, export-crippled 512-bit RSA key is used for confidentiality (encryption). OpenSSL requires you to set up this key on the context (see the documentation of SSL_set_tmp_rsa()).

Weirdly enough, it also works in SSLv2-only mode (using -ssl2 on openssl s_client when testing) without setting a temporary RSA key (call to set_tmp_rsa commented out in the script). I have no idea why.

In general, some cipher suites require special keys be added to the context, e.g. suites using DH (group parameters) or ECDH (curve). To see exactly what is used for each cipher suite, openssl ciphers -v can be insightful, e.g.:

% openssl ciphers -v EXPORT
EXP-EDH-RSA-DES-CBC-SHA SSLv3 Kx=DH(512)  Au=RSA  Enc=DES(40)   Mac=SHA1 export
EXP-EDH-DSS-DES-CBC-SHA SSLv3 Kx=DH(512)  Au=DSS  Enc=DES(40)   Mac=SHA1 export
EXP-ADH-DES-CBC-SHA     SSLv3 Kx=DH(512)  Au=None Enc=DES(40)   Mac=SHA1 export
EXP-DES-CBC-SHA         SSLv3 Kx=RSA(512) Au=RSA  Enc=DES(40)   Mac=SHA1 export
EXP-RC2-CBC-MD5         SSLv3 Kx=RSA(512) Au=RSA  Enc=RC2(40)   Mac=MD5  export
EXP-RC2-CBC-MD5         SSLv2 Kx=RSA(512) Au=RSA  Enc=RC2(40)   Mac=MD5  export
EXP-ADH-RC4-MD5         SSLv3 Kx=DH(512)  Au=None Enc=RC4(40)   Mac=MD5  export
EXP-RC4-MD5             SSLv3 Kx=RSA(512) Au=RSA  Enc=RC4(40)   Mac=MD5  export
EXP-RC4-MD5             SSLv2 Kx=RSA(512) Au=RSA  Enc=RC4(40)   Mac=MD5  export

EDIT in respone to the question about DSS cipher suites:

DSS/DSA cipher suites need DH parameters, and of course a DSS/DSA based server certificate instead of (only) an RSA one. This is true not only for export cipher suites, but for all suites using DSS/DSA for authenticity. DSS/DSA can by design only be used for signatures, not for encryption, in order to allow for export into untrusted countries. Because DSS/DSA can only be used for signatures, it needs an ephemeral Diffie-Hellman key exchange to establish a shared session key. That's what the EDH in the cipher suite stands for. To set up DH parameters, you'd use the M2Crypto equivalents of the OpenSSL SSL_set_tmp_dh() API.

Note that OpenSSL allows to load both an RSA and a DSA/DSS cert/keypair into the same SSL context.

Avoiding repeat of code after loop?

13 votes

I often end up writing a bit of code twice when using a loops. For example, while going over the Udacity computer science course, I wrote the code (for a function to find the most sequentially repeated element):

def longest_repetition(l):
    if not l:
        return None
    most_reps = count = 0 
    longest = prv = None
    for i in l:
        if i == prv:
            count += 1
        else:
            if count > most_reps:
                longest = prv
                most_reps = count
            count = 1
        prv = i
    if count > most_reps:
        longest = prv
    return longest

In this case, I'm checking twice if the count is greater than the previously most repeated element. This happens both when the current element is different from the last and when I've reached the end of the list.

I've also run into this a few times when parsing a string character by character. There have also been a few times where it's been up to about 5 lines of code. Is this common, or a result of the way I think/code. What should I do?

edit: Similarly, in a contrived string splitting example:

def split_by(string, delimeter):
    rtn = []
    tmp = ''
    for i in string:
        if i == delimeter:
            if tmp != '':
                rtn.append(tmp)
                tmp = ''
        else:
            tmp += i
    if tmp != '':
        rtn.append(tmp)
    return rtn

edit: The exam this was from was written for students of the course who are not expected to have any outside knowledge of Python; only what was taught in the previous units. Although I do have prior experience in Python, I'm trying to adhere to these restrictions to get the most of the course. Things like str.split, lists, and a lot of the fundamentals of Python were taught, but nothing yet on imports - especially not things like groupby. That being said, how should it be written without any of the language features that probably wouldn't be taught in a programming introduction course.

Since you tagged language-agnostic, I see that you wont be much interested in python-specific stuff you could use to make your code efficient, compact and readable. For the same reason, I am not going to show how beautiful a code can be written in python.

In some of the cases that extra if at the end can be avoided depending on your algorithm, but most cases it's like "If it exists, it should be significant and/or efficient." I dont know about the how the python interpreter works, but in compiled languages like C/C++/etc. the compiler performs various kinds of loop optimisations, including moving the if-blocks out of a loop if it does the same thing.

I ran and compared the running time of various snippets:

  • @JFSebastian - 8.9939801693
  • @srgerg - 3.13302302361
  • yours - 2.8182990551.

It's not a generalisation that a trailing if gives you the best time. My point is: just follow your algorithm, and try to optimise it. There's nothing wrong with an if at the end. Probably alternative solutions are expensive.

About the second example you have put in: The check tmp == '' is done to ensure only non-empty strings are returned. That actually is a sort of additional condition over your splitting algorithm. In any case, you need an additional rtn.append after the loop because there's still something beyond the last delimiter. You could always push an if condition inside the loop like if curCharIndex == lastIndex: push items to list which will execute in every iteration, and its sort of the same case again.

My answer in short:

  • Your code is as efficient as your algorithm that you have in mind.
  • The ifs in the end are encountered in many cases -- no need to worry about them, they may be making the code more efficient than alternative approaches without such an if (examples are right here).
  • Additionally compilers can also spot and modify/move the blocks around your code.
  • If there's a language feature/library that makes your code fast and at the same time readable, use it. (Other answers here point out what python offers :))

Is it always faster to use string as key in a dict?

12 votes

On this page, I see something interesting:

Note that there is a fast-path for dicts that (in practice) only deal with str keys; this doesn't affect the algorithmic complexity, but it can significantly affect the constant factors: how quickly a typical program finishes.

So what does it exactly mean?

Does it mean using string as the key is always faster?

If yes, why?

Update:

Thanks for the suggestions about optimization! But I'm actually more interested in the plain truth, than whether or when we should do optimization.

Update 2:

Thanks for the great answers, I'll cite the content from the link provided by @DaveWebb here:

" ...

ma_lookup is initially set to the lookdict_string function (renamed to lookdict_unicode in 3.0), which assumes that both the keys in the dictionary and the key being searched for are standard PyStringObject's. It is then able to make a couple of optimiziations, such as mitigating various error checks, since string-to-string comparison never raise exceptions. There is also no need for rich object comparisons either, which means we avoid calling PyObject_RichCompareBool, and always use _PyString_Eq directly.

... "

Also, for the experiment numbers, I think the size of the difference will be even bigger if there is no int-to-string conversion

The C code that underlies the Python dict is optimisted for String keys. You can read about this here (and in the book the blog refers to).

If the Python runtime knows your dict only contains string keys it can do things such as not cater for errors that won't happen with a string to string comparison and ignore the rich comparison operators. This will make the common case of the string key only dict a little faster. (Update: timing shows it to be more than a little.)

However, it is unlikely that this would make a significant change to the run time of most Python programs. Only worry about this optimisation if you have measured and found dict lookups to be a bottleneck in your code. As the famous quote says, "Premature optimization is the root of all evil."

The only way to see how much faster things really are, is to time them:

>>> timeit.timeit('a["500"]','a ={}\nfor i in range(1000): a[str(i)] = i')
0.06659698486328125
>>> timeit.timeit('a[500]','a ={}\nfor i in range(1000): a[i] = i')
0.09005999565124512

So using string keys is about 30% faster even compared to int keys, and I have to admit I was surprised at the size of the difference.

Initializing empty Python data structures

11 votes

Is there any tangible difference between the two forms of syntax available for creating empty Python lists/dictionaries, i.e.

l = list()
l = []

and:

d = dict()
d = {}

I'm wondering if using one is preferable over the other.

The function form calls the constructor at runtime to return a new instance, whereas the literal form causes the compiler to "create" it (really, to emit bytecode that results in a new object) at compile time. The former can be useful if (for some reason) the classes have been locally rebound to different types.

>>> def f():
...   []
...   list()
...   {}
...   dict()
... 
>>> dis.dis(f)
  2           0 BUILD_LIST               0
              3 POP_TOP             

  3           4 LOAD_GLOBAL              0 (list)
              7 CALL_FUNCTION            0
             10 POP_TOP             

  4          11 BUILD_MAP                0
             14 POP_TOP             

  5          15 LOAD_GLOBAL              1 (dict)
             18 CALL_FUNCTION            0
             21 POP_TOP             
             22 LOAD_CONST               0 (None)
             25 RETURN_VALUE        

How does Python's Garbage Collector Detect Circular References?

10 votes

I'm trying to understand how Python's garbage collector detects circular references. When I look at the documentation, all I see is a statement that circular references are detected, except when the objects involved have a __del__ method.

If this happens, my understanding (possibly faulty) is that the gc module acts as a failsafe by (I assume) walking through all the allocated memory and freeing any unreachable blocks.

How does Python detect & free circular memory references before making use of the gc module?

I think I found the answer I'm looking for in some links provided by @SvenMarnich in comments to the original question:

Container objects are Python objects that can hold references to other Python objects. Lists, Classes, Tuples etc are container objects; Integers, Strings etc. are not. So, only container objects are at risk for being in a circular reference.

Each Python object has a field - *gc_ref*, which is (I believe) set to NULL for non-container objects. For container objects it is set equal to the number of non container objects that reference it

Any container object with a *gc_ref* count greater than 1 (? I would've thought 0, but OK for now ?) has references that are not container objects. So they are reachable and are removed from consideration of being unreachable memory islands.

Any container object reachable by an object known to be reachable (i.e. those we just recognized as having a *gc_ref* count greater than 1) also does not need to be freed.

The remaining container objects are not reachable (except by each other) and should be freed.

http://www.arctrix.com/nas/python/gc/ is a link providing a fuller explanation http://hg.python.org/cpython/file/2059910e7d76/Modules/gcmodule.c is a link to the source code, which has comments further explaining the thoughts behind the circular reference detection

How do I get the whole content between two xml tags in Python?

7 votes

I try to get the whole content between an opening xml tag and it's closing counterpart.

Getting the content in straight cases like title below is easy, but how can I get the whole content between the tags if mixed-content is used and I want to preserve the inner tags?

<?xml version="1.0" encoding="UTF-8"?>
<review>
  <title>Some testing stuff</title>
  <text sometimes="attribute">Some text with <extradata>data</extradata> in it.
  It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag> 
  or more</sometag>.</text>
</review>

What I want is the content between the two text tags, including any tags: Some text with <extradata>data</extradata> in it. It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag> or more</sometag>.

For now I use regular expressions but it get's kinda messy and I don't like this approach. I lean towards a XML parser based solution. I looked over minidom, etree, lxml and BeautifulSoup but couldn't find a solution for this case (whole content, including inner tags).

from lxml import etree
t = etree.XML(
"""<?xml version="1.0" encoding="UTF-8"?>
<review>
  <title>Some testing stuff</title>
  <text>Some text with <extradata>data</extradata> in it.</text>
</review>"""
)
(t.text + ''.join(map(etree.tostring, t))).strip()

The trick here is that t is iterable, and when iterated, yields all child nodes. Because etree avoids text nodes, you also need to recover the text before the first child tag, with t.text.

In [50]: (t.text + ''.join(map(etree.tostring, t))).strip()
Out[50]: '<title>Some testing stuff</title>\n  <text>Some text with <extradata>data</extradata> in it.</text>'

Or:

In [6]: e = t.xpath('//text')[0]

In [7]: (e.text + ''.join(map(etree.tostring, e))).strip()
Out[7]: 'Some text with <extradata>data</extradata> in it.'

is there any compiler that can convert regexp to fsm? or could convert to human words?

7 votes

Something that can convert

r"a+|(?:ab+c)"

to

{
    (1, 'a') : [2, 3],
    (2, 'a') : [2],
    (3, 'b') : [4, 3],
    (4, 'c') : [5]
}

or something similar

and accepting in 2 or 5

i have some code that will do this. it's not well documented and it's not supported, but if you're interested you're welcome to look at it.

the library is called rxpy and the repository is http://code.google.com/p/rxpy

the routine that does parsing is parse_pattern at http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/pattern.py#871

if you call repr(...) on the result from that you get a graph in the "dot language" - https://en.wikipedia.org/wiki/DOT_language

for example, see the tests as http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/_test/parser.py#47

to show what i mean ,let's look at the test at http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/_test/parser.py#234 which is for 'ab*c':

"""digraph {
 0 [label="a"]
 1 [label="...*"]
 2 [label="b"]
 3 [label="c"]
 4 [label="Match"]
 0 -> 1
 1 -> 2
 1 -> 3
 3 -> 4
 2 -> 1
}"""

that starts at 0 which can match an "a" to go to state 1. from there you can match a "b" to go to state 2 or a "c" to go to state 3. state 2 then has a transition back to 1 that can consume another "b", etc etc. it's a bit ugly to read by hand, but when the test fails you get a little graph displayed on the screen.

the library also has various "engines" which will match strings against this graph (and so do regular expression matching). but it is much slower than the python library (because it is pure python).

this is not supported and may not be very clear - sorry - but i think it's close to what you want and you're welcome to use it if it's useful (MPL or LGPL licence).

When is it appropriate to use a database , in Python

6 votes

I am making a little add-on for a game , and it needs to store information on a player's username ,ip-address ,location in game , and a list of alternate user names that have came from that ip or alternate ip addresses that come from that user name. I read an article a while ago that said that unless I am storing a large amount of information that can not be held in ram , that I should not use a database . So I tried using the shelve module in python , but I'm not sure if that is a good idea . When do you guys think it is a good idea to use a database, and when it better to store information in another way , also what are some other ways to store information besides databases and flat file databases.

Most importantly, unless you specifically need performance or high reliability, do whatever will make your code simplest/easiest to write.


If your data is extremely structured (and you know SQL or are willing to learn) then using a database like sqlite3 might be appropriate. (You should ignore the comment about database size and RAM: there are times when databases are appropriate for even very small data sets, because of how the data is structured.)

If the data is relatively simple and you don't need the reliability that a database (normally) has then storing it in one of the builtin datatypes while the program is running is probably fine.

If you'd like the data stored on disk to be human readable (and editable, with a bit of effort), then a format like JSON (there is builtin json module) is nice, since the basic Python objects serialise without any effort. If the data not so simple then YAML is essentially an extended version of JSON (PyYAML is very good.). Similarly, you could use CSV files (the csv modules), although this is not nearly as good as JSON or YAML, or just a custom text format (but this is quite a lot of effort to get error handling and so on implemented neatly).

Finally, if your data contains more advanced objects (e.g. recursive dictionaries, or complicated custom datatypes) then using one of the builtin binary serialisation techniques (pickle, shelve etc.) might be appropriate, however, YAML can handle many of these things (including recursive data structures).

Some general points:

  • Plain text formats are nice, as they allow values to be tweaked easily and debugging/testing is easy
  • Binary formats are nice, as they mean that values can't be tweaked without a little bit of extra effort (this is not saying they can't be adjusted though), and the file size is smaller (probably not relevant)

Simulating /dev/random on Windows

5 votes

I'm trying to port python code from linux to windows right now. In various places random numbers are generateted by reading from /dev/random. Is there a way to simulate /dev/random on Windows?

I'm looking for a solution that would keep the code useable on linux...

If you are using Python, why do you care about the specific implementation? Just use the random module and let it deal with it.

Beyond that, (if you can't rely on software state) os.urandom provides os-based random values:

On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom.

(Note that random.SystemRandom provides a nice interface for this).

If you are really serious about it being cryptographically random, you might want to check out PyCrypto.

What to do with pyc files when Django or python is used with Mercurial?

5 votes

Just started to use Mercurial. Wow, nice application. I moved my database file out of the code directory, but I was wondering about the .pyc files. I didn't include them on the initial commit. The documentation about the .hgignore file includes an example to exclude *.pyc, so I think I'm on the right track.

I am wondering about what happens when I decide to roll back to an older fileset. Will I need to delete all the .pyc files then? I saw some questions on Stack Overflow about the issue, including one gentleman that found old .pyc files were being used. What is the standard way around this?

As mentioned in ms4py's answer, *.pyc are compiled files that will be regenerated on the fly. You wouldn't want to include these when distributing a project.

However, if it happens you have modules that existed before when you roll back changes and *.pyc files are left lying around, strange bugs can appear as pyc files can be execute even if the original python file doesn't exist anymore. This has bitten me a few times in Django when adding and removing apps in a project and switching branches with git.

To clean things up, you can delete every compiled files in your project's directory by running the following shell command in you project's directory:

find . -name '*.pyc' -exec rm {} \;

python Socket.IO client for sending broadcast messages to TornadIO2 server

5 votes

I am building a realtime web application. I want to be able to send broadcast messages from the server-side implementation of my python application.

Here is the setup:

I can succesfully send socket.io messages from the client to the server. The server handles these and can send a response. In the following i will describe how i did that.

Current Setup and Code

First, we need to define a Connection which handles socket.io events:

class BaseConnection(tornadio2.SocketConnection):
    def on_message(self, message):
        pass

    # will be run if client uses socket.emit('connect', username)
    @event
    def connect(self, username):
        # send answer to client which will be handled by socket.on('log', function)
        self.emit('log', 'hello ' + username)

Starting the server is done by a Django management custom method:

class Command(BaseCommand):
    args = ''
    help = 'Starts the TornadIO2 server for handling socket.io connections'

    def handle(self, *args, **kwargs):
        autoreload.main(self.run, args, kwargs)

    def run(self, *args, **kwargs):
        port = settings.SOCKETIO_PORT

        router = tornadio2.TornadioRouter(BaseConnection)

        application = tornado.web.Application(
            router.urls,
            socket_io_port = port
        )

        print 'Starting socket.io server on port %s' % port
        server = SocketServer(application)

Very well, the server runs now. Let's add the client code:

<script type="text/javascript">    
    var sio = io.connect('localhost:9000');

    sio.on('connect', function(data) {
        console.log('connected');
        sio.emit('connect', '{{ user.username }}');
    });

    sio.on('log', function(data) {
        console.log("log: " + data);
    });
</script>

Obviously, {{ user.username }} will be replaced by the username of the currently logged in user, in this example the username is "alp".

Now, every time the page gets refreshed, the console output is:

connected
log: hello alp

Therefore, invoking messages and sending responses works. But now comes the tricky part.

Problems

The response "hello alp" is sent only to the invoker of the socket.io message. I want to broadcast a message to all connected clients, so that they can be informed in realtime if a new user joins the party (for example in a chat application).

So, here are my questions:

  1. How can i send a broadcast message to all connected clients?

  2. How can i send a broadcast message to multiple connected clients that are subscribed on a specific channel?

  3. How can i send a broadcast message anywhere in my python code (outside of the BaseConnection class)? Would this require some sort of Socket.IO client for python or is this builtin with TornadIO2?

All these broadcasts should be done in a reliable way, so i guess websockets are the best choice. But i am open to all good solutions.

I've recently written a very similar application on a similar setup, so I do have several insights.

The proper way of doing what you need is to have a pub-sub backend. There's only so much you can do with simple ConnectionHandlers. Eventually, handling class-level sets of connections starts to get ugly (not to mention buggy).

Ideally, you'd want to use something like Redis, with async bindings to tornado (check out brukva). That way you don't have to mess with registering clients to specific channels - Redis has all that out of the box.

Essentially, you have something like this:

class ConnectionHandler(SockJSConnection):
    def __init__(self, *args, **kwargs):
        super(ConnectionHandler, self).__init__(*args, **kwargs)
        self.client = brukva.Client()
        self.client.connect()
        self.client.subscribe('some_channel')

    def on_open(self, info):
        self.client.listen(self.on_chan_message)

    def on_message(self, msg):
        # this is a message broadcast from the client
        # handle it as necessary (this implementation ignores them)
        pass

    def on_chan_message(self, msg):
        # this is a message received from redis
        # send it to the client
        self.send(msg.body)

    def on_close(self):
        self.client.unsubscribe('text_stream')
        self.client.disconnect()

Note that I used sockjs-tornado which I found to be much more stable than socket.io.

Anyway, once you have this sort of setup, sending messages from any other client (such as Django, in your case) is as easy as opening a Redis connection (redis-py is a safe bet) and publishing a message:

import redis
r = redis.Redis()
r.publish('text_channel', 'oh hai!')

This answer turned out pretty long, so I went the extra mile and made a blog post out of it: http://blog.y3xz.com/post/24691592929/a-modern-python-stack-for-a-real-time-web-application

Django: Pass variable to logging in settings file

5 votes

I am trying to add a variable to my log line through my settings.py file.

This is the code in settings (the logging part):

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'handlers': {
        'mail_admins': {
                        'level': 'CRITICAL',
                         'class': 'django.utils.log.AdminEmailHandler'
                        },

        'customhandler':{
                        'level':'DEBUG',
                        'class':'logging.RotatingFileHandler',
                        'formatter':'custom_format',
                        'filename':LOG_LOCATION
                        },
                 },

     'loggers': {
         'django.request': {
                        'handlers': ['mail_admins'],
                         'level': 'CRITICAL',
                        'propagate': True,
                            },
         'Logger_Custom1': {
                        'handlers':['customhandler'],
                        'level':'DEBUG',
                        'propagate':True
                           },
                 },

    'formatters': {
        'verbose': {
            'format': '%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s'
                     },
         'simple': {
            'format': '%(levelname)s %(message)s'
                     },
         'custom_format':{
            'format':'[%(asctime)s %(levelname)s T:%(threadName)s F:%(funcName)s ] %(message)s '
                         },
                 }
}

The above code is working fine, but now I would like each log message to have a variable at the end. Something like:

MyVariable = "Somelines" 
[%(asctime)s %(levelname)s T:%(threadName)s F:%(funcName)s ] %(message)s 'MyVariable

So my log would have that variable's contents at the end of each logging line. I know we can do that inside the view function like this: logging.warning('% before you %','Look','Leap') But that will require us to include that line everywhere separately. Also, when we need to add or change that variable name, we will need to change that line everywhere in every file.

So I was wondering if there is any way to do that directly from settings.py, so that we can make one change and it will apply to all logging messages.

I found out the solution by myself. I don't know if this is a good practice, but it works.

All I did was assign a variable:

testvar = "MyVariable"

And then append this variable, like this:

'format': '%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s ' + testvar

The output will have the variable in your log entry merged with the log format. Thank you. Please let me know if there are more ways to do it.