Best django questions in May 2012

Django-Pinax : How do you use a pinax app apart from what you get with a pinax base project?

9 votes

I am trying to understand Pinax and plan to use it in my next project.

I have started with a pinax basic project, and now I have something to go with runserver.

Now, I understand that I can customize the initial setup that I got from pinax and customize the profiles, themes, etc as per my requirements.

But is that all that pinax provides ?

I am very confused here, like I want to use the pinax phileo app in my project, so how does pinax helps me do that ?

My Effort :

  • I searched and found that I have to install it with pip install phileo
  • Then, add it to INSTALLED_APPS and use it as required.

But what did pinax do in this ?

Pinax has phileo featured on its website, but why ? Since I could have used it just like any other app on my non-pinax django project.

So, my question in a nutshell is :

What does pinax provide after a base project and default templates that come with pinax ?

Right, now it feels like pinax just provides a base project with some apps already working with some default templates. [ That's it ? ]

Then, what about other apps featured on pinax's website that do not come with base projects ?

Please, help clear up the confusion !

Update My question is somewhat - What is the significance of pinax-ecosystem when we already have them listed somewhere like djangopackages.com ?

Pinax is just django with a blend of other django plugins. You have to enable them and set them up individually. To use each individual app within pinax, you have to read that specific app's documentation and set it up appropriately (list of apps and repos which likely contain documentation here: http://pinaxproject.com/ecosystem/)

Some people like pinax but I find that its more of a hassel than a solution. In the end pinax doesn't work out of the box. You have to customize everything, but at the same time you position yourself into using a bundle you dont need. I suggest instead starting a project and installing the packages you need individually, and even finding more here: http://djangopackages.com/. Especially, if its a big project because then if you bundle/setup everything on your own you will know the ins and outs of it all.

Can the Django ORM store an unsigned 64-bit integer (aka ulong64 or uint64) in a reliably backend-agnostic manner?

7 votes

All the docs I've seen imply that you might be able to do that, but there isn't anything official w/r/t ulong64/uint64 fields. There are a few off-the-shelf options that look quite promising in this arena:

  • BigIntegerField ... almost, but signed,
  • PositiveIntegerField ... suspiciously 32-bit-looking,
  • DecimalField ... a fixed-pointer represented with a python decimal type, according to the docs -- which presumably turns into an analogously pedantic and slow database field when socked away, รก la the DECIMAL or NUMERIC PostgreSQL types.

... all of which look like they might store a number like that. Except NONE OF THEM WILL COMMIT, much like every single rom-com character portrayed by Hugh Grant.

My primary criterion is that it works with Django's supported backends, without any if postgresql (...) elif mysql (...) type of special-case nonsense. After that, there is the need for speed -- this is for a model field in an visual-database application that will index image-derived data (e.g. perceptual hashes and extracted keypoint features), allowing ordering and grouping by the content of those images.

So: is there a good Django extension or app that furnishes some kind of PositiveBigIntegerField that will suit my purposes?

And, barring that: If there is a simple and reliable way to use Django's stock ORM to store unsigned 64-bit ints, I'd like to know it. Look, I'm no binary whiz; I have to do two's complement on paper -- so if this method of yours involves some bit-shifting trickery, don't hesitate to explain what it is, even if it strikes you as obvious. Thanks in advance.

Although I did not test it, but you may wish to just subclass BigIntegerField. The original BigIntegerField looks like that (source here):

class BigIntegerField(IntegerField):
    empty_strings_allowed = False
    description = _("Big (8 byte) integer")
    MAX_BIGINT = 9223372036854775807

    def get_internal_type(self):
        return "BigIntegerField"

    def formfield(self, **kwargs):
        defaults = {'min_value': -BigIntegerField.MAX_BIGINT - 1,
                    'max_value': BigIntegerField.MAX_BIGINT}
        defaults.update(kwargs)
        return super(BigIntegerField, self).formfield(**defaults)

Derived PositiveBigIntegerField may looks like this:

class PositiveBigIntegerField(BigIntegerField):
    empty_strings_allowed = False
    description = _("Big (8 byte) positive integer")

    def db_field(self, connection):
        """
        Returns MySQL-specific column data type. Make additional checks
        to support other backends.
        """
        return 'bigint UNSIGNED'

    def formfield(self, **kwargs):
        defaults = {'min_value': 0,
                    'max_value': BigIntegerField.MAX_BIGINT * 2 - 1}
        defaults.update(kwargs)
        return super(PositiveBigIntegerField, self).formfield(**defaults)

Although you should test it thoroughly, before using it. If you do, please share the results :)

EDIT:

I missed one thing - internal database representation. This is based on value returned by get_internal_type() and the definition of the column type is stored eg. here in case of MySQL backend and determined here. It looks like overwriting db_field() will give you control over how the field is represented in the database. However, you will need to find a way to return DBMS-specific value in db_field() by checking connection argument.

Specific complex SQL query and Django ORM?

6 votes

I have a set of tables that contain content that is created and voted on by users.

Table content_a

id         /* the id of the content */
user_id    /* the user that contributed the content */
content    /* the content */

Table content_b

id
user_id
content

Table content_c

id
user_id
content

Table voting

user_id         /* the user that made the vote */
content_id      /* the content the vote was made on */
content_type_id /* the content type the vote was made on */
vote            /* the value of the vote, either +1 or -1 */

I want to be able to select a set of users and order them by the sum of the votes on the content they have produced. For example,

SELECT * FROM users ORDER BY <sum of votes on all content associated with user>

Is there a specific way this can be achieved using Django's ORM, or do I have to use a raw SQL query? And what would the most efficient way be to achieve this in raw SQL?

Update

Assuming the models are

from django.contrib.contenttypes import generic
from django.contrib.contenttypes.models import ContentType


class ContentA(models.Model):
    user = models.ForeignKey(User)
    content = models.TextField()

class ContentB(models.Model):
    user = models.ForeignKey(User)
    content = models.TextField()

class ContentC(models.Model):
    user = models.ForeignKey(User)
    content = models.TextField()

class GenericVote(models.Model):
    content_type = models.ForeignKey(ContentType)
    object_id = models.PositiveIntegerField()
    content_object = generic.GenericForeignKey()
    user = models.ForeignKey(User)
    vote = models.IntegerField(default=1)

Option A. Using GenericVote

GenericVote.objects.extra(select={'uid':"""
CASE
WHEN content_type_id = {ct_a} THEN (SELECT user_id FROM {ContentA._meta.db_table} WHERE id = object_id)
WHEN content_type_id = {ct_b} THEN (SELECT user_id FROM {ContentB._meta.db_table} WHERE id = object_id)
WHEN content_type_id = {ct_c} THEN (SELECT user_id FROM {ContentC._meta.db_table} WHERE id = object_id)
END""".format(
ct_a=ContentType.objects.get_for_model(ContentA).pk,
ct_b=ContentType.objects.get_for_model(ContentB).pk,
ct_c=ContentType.objects.get_for_model(ContentC).pk,
ContentA=ContentA,
ContentB=ContentB,
ContentC=ContentC
)}).values('uid').annotate(vc=models.Sum('vote')).order_by('-vc')

The above ValuesQuerySet,(or use values_list()) gives you a sequence of IDs of User()s in the order of descending votes count. You could then use it to fetch top users.

Option B. Using User.objects.raw

When I use User.objects.raw, I got almost same query w/ the answer given by forsvarir :

User.objects.raw("""
SELECT "{user_tbl}".*, SUM("gv"."vc") as vote_count from {user_tbl},
    (SELECT id, user_id, {ct_a} AS ct FROM {ContentA._meta.db_table} UNION
     SELECT id, user_id, {ct_b} AS ct FROM {ContentB._meta.db_table} UNION
     SELECT id, user_id, {ct_c} as ct FROM {ContentC._meta.db_table}
    ) as c,
   (SELECT content_type_id, object_id, SUM("vote") as vc FROM {GenericVote._meta.db_table} GROUP BY content_type_id, object_id) as gv
WHERE {user_tbl}.id = c.user_id
    AND gv.content_type_id = c.ct
    AND gv.object_id = c.id
GROUP BY {user_tbl}.id
ORDER BY "vc" DESC""".format(
    user_tbl=User._meta.db_table, ContentA=ContentA, ContentB=ContentB,
    ContentC=ContentC, GenericVote=GenericVote, 
    ct_a=ContentType.objects.get_for_model(ContentA).pk,
    ct_b=ContentType.objects.get_for_model(ContentB).pk,
    ct_c=ContentType.objects.get_for_model(ContentC).pk
))

Option C. Other possible ways

  • De-normalize vote_count to User or profile model, for example, UserProfile, or other relative model, as suggested by Michael Dunn. This behaves much better if you access vote_count on-fly frequently.
  • Build a DB view which does the UNIONs for you, then map a model to it, this could make the construction of the query easier.
  • Sort in Python, usually it's best way to work for large-scale data, because of dozen of toolkits and extension ways.

You need some Django Models mapping those tables before use Django ORM to query. Assuming they are User and Voting models that matching users and voting tables, you could then

User.objects.annotate(v=models.Sum('voting__vote')).order_by('v')

How does pgBouncer help to speed up Django

5 votes

I have some management commands that are based on gevent. Since my management command makes thousands to requests, I can turn all socket calls into non-blocking calls using Gevent. This really speeds up my application as I can make requests simultaneously.

Currently the bottleneck in my application seems to be Postgres. It seems that this is because the Psycopg library that is used for connecting to Django is written in C and does not support asynchronous connections.

I've also read that using pgBouncer can speed up Postgres by 2X. This sounds great but it would be great if someone could explain how pgBouncer works and helps?

Thanks

PgBouncer reduces the latency in establishing connections by serving as a proxy which maintains a connection pool. This may help speed up your application if you're opening many short-lived connections to Postgres. If you only have a small number of connections, you won't see much of a win.

Enforce at least one value in a many-to-many relation, in Django?

5 votes

I have have a many-to-many relation in a Django(1.4) model.

class UserProfile(models.Model):
    foos = models.ManyToManyField(Foo)

I want to enforce that each User(Profile) has at least one Foo. Foos can have zero-or-more User(Profiles)s.

I would love this to be enforced at the model and admin levels, but just enforcing it in the admin would be sufficient.

If I understand correctly, 'many' in Django-speak is zero-or-more.

I want a ManyToOneOrMore relation. How can I do this?

Thanks,

Chris.

You can't enforce this on a the model level as @Greg details, but you can enforce it on a form by simply making the field required. This won't prevent anyone will shell-level access from manually creating a UserProfile without a foo, but it will force anyone using a browser-based form method of creation.

Maintain a large dictionary in memory for Django-Python?

5 votes

I have a big key-value pair dump, that I need to lookup for my django-Python webapp.

So, I have following options:

  • Store it as json dump and load it as a python dict.
  • Store it in a dump.py and import the dict from it.
  • Use some targeted systems for this problem: [ Are these really meant for this usecase ? ]
    • Mem-cache
    • Redis
    • Any other option ?

Which from above is the right way to go ?

How will you compare memcache and redis ?

Update:

  • My dictionary is about 5 MB in size and will grow over time.
  • Using Redis/Memcache adds an overhead of hitting a socket every-time, so would dump.py will be better since it would take time to load it to memory but after that it would only do memory lookups.

  • My dictionary needs to be updated every day, considering that dump.py will be problem, since we have to restart the django-server to reload where as I guess it would reflect on the fly in redis and memcache.

  • One uses a system like redis only when you have large amount of data and you have to lookup very frequently, in that case socket gives a overhead so, how do we achieve the advantage ?

Please share your experiences on this !

For choosing Memcache or REDIS, they are capable of tens of thousands request per second on low-end hardware (eg. 80,000 req/s for REDIS on C2D Q8300). With latencies of well below 1ms. You're saying that you're be doing something in order of 20 request a second, so performance wise it's really non-issue.

If you choose dump.py option, you don't need to restart Django to reload. You can make your own simple reloader:

dump.py:

[ dict code...]

mtime = 0

djago code:

import dump #this does nothing if it's already loaded
stat = os.stat(dump_filename)
if(stat.mtime > dump.mtime):
    reload(dump)
    dump.mtime = stat.mtime

Interpreting Django Source Code

4 votes

I was looking through some of the Django source code and came across this. What exactly does: encoding = property(lambda self: self.file.encoding) do?

There's nothing wrong with the other two answers, but they might be a little high-level. So here's the 101 version:

lambda

Although it's in their documentation for C#, I think Microsoft actually has the best explanation of the concept of lambda:

A lambda expression is an anonymous function that can contain expressions and statements

Most people without an official CS degree trip over lambda, but when you think of it as simply an "anonymous function", I think it becomes much easier to understand. The format for lambda in Python is:

lambda [argument]: [expression]

Where [argument] can be nothing, a single argument or a comma-delimited list of arguments and [expression] is essentially the method body. That's why @Jordan said the code you mentioned is roughly the equivalent of:

def encoding(self):
    return self.file.encoding

self is the argument passed into the method and the return value of the method (self.file.encoding) is the expression.

property

The property method allows you to create "getters" and "setters", basically, for an attribute on a class. In traditional OOP, "members", or the attributes of a class, are usually set as protected or private -- you never actually access the attribute directly. Instead, you access methods that in turn retrieve or manipulate the attribute. Chief among those would get the getter and the setter. As their names pretty much describe, they are methods that get and set the value of an attribute, respectively.

Now, Python OOP doesn't really have a concept of protected or private attributes in the truest sense. You are free to follow the rules, but there's nothing stopping you from accessing anything you want on a class. So, getters and setters are most normally, in Python, used in conjunction with property to "fake" an attribute, for lack of a better word. For example:

def get_foo(self):
    return self.bar

def set_foo(self, value):
    self.bar = value

foo = property(get_foo, set_foo)

With that I can now do things like instance.foo (no parenthesis) and instance.foo = 'something'. And it works just as if foo was a regular attribute on the class.

In the code you mention, they're only setting a getter, but it works the same. encoding will act like an attribute on the class and returns the value of file.encoding.

Python GIL: is django save() blocking?

4 votes

My django app saves django models to a remote database. Sometimes the saves are bursty. In order to free the main thread (*thread_A*) of the application from the time toll of saving multiple objects to the database, I thought of transferring the model objects to a separate thread (*thread_B*) using collections.deque and have *thread_B* save them sequentially.

Yet I'm unsure regarding this scheme. save() returns the id of the new database entry, so it "ends" only after the database responds, which is at the end of the transaction.

Does django.db.models.Model.save() really block GIL-wise and release other python threads during the transaction?

Django's save() does nothing special to the GIL. In fact, there is hardly anything you can do with the GIL in Python code -- when it is executed, the thread must hold the GIL.

There are only two ways the GIL could get released in save():

  • Python decides to switch threads (after sys.getcheckinterval() instructions)
  • Django calls a database interface routine that is implemented to release the GIL

The second point could be what you are looking for -- a SQL COMMITis executed and during that execution, the SQL backend releases the GIL. However, this depends on the SQL interface, and I'm not sure if the popular ones actually release the GIL*.

Moreover, save() does a lot more than just running a few UPDATE/INSERT statements and a COMMIT; it does a lot of bookkeeping in Python, where it has to hold the GIL. In summary, I'm not sure that you will gain anything from moving save() to a different thread.


UPDATE: From looking at the sources, I learned that both the sqlite module and psycopg do release the GIL when they are calling database routines, and I guess that other interfaces do the same.

Python database WITHOUT using Django (for Heroku)

4 votes

To my surprise, I haven't found this question asked elsewhere. Short version, I'm writing an app that I plan to deploy to the cloud (probably using Heroku), which will do various web scraping and data collection. The reason it'll be in the cloud is so that I can have it be set to run on its own every day and pull the data to its database without my computer being on, as well as so the rest of the team can access the data.

I used to use AWS's SimpleDB and DynamoDB, but I found SDB's storage limitations to be to small and DDB's poor querying ability to be a problem, so I'm looking for a database system (SQL or NoSQL) that can store arbitrary-length values (and ideally arbitrary data structures) and that can be queried on any field.

I've found many database solutions for Heroku, such as ClearDB, but all of the information I've seen has shown how to set up Django to access the database. Since this is intended to be script and not a site, I'd really prefer not to dive into Django if I don't have to.

Is there any kind of database that I can hook up to in Heroku with Python without using Django?

I'd use MongoDB. Heroku has support for it, so I think it will be really easy to start and scale out: https://addons.heroku.com/mongohq

About Python: MongoDB is a really easy database. The schema is flexible and fits really well with Python dictionaries. That's something really good.

You can use PyMongo

from pymongo import Connection
connection = Connection()

# Get your DB
db = connection.my_database

# Get your collection
cars = db.cars

# Create some objects
import datetime
car = {"brand": "Ford",
       "model": "Mustang",
       "date": datetime.datetime.utcnow()}

# Insert it
cars.insert(car)

Pretty simple, uh?

Hope it helps.

EDIT:

As Endophage mentioned, another good option for interfacing with Mongo is mongoengine. If you have lots of data to store, you should take a look at that.

Why accept kwargs but not use them?

4 votes

I was looking at the Django source code today and I noticed this:

class DjangoTestSuiteRunner(object):
    def __init__(self, verbosity=1, interactive=True, failfast=True, **kwargs):
        self.verbosity = verbosity
        self.interactive = interactive
        self.failfast = failfast

Why would they accept kwargs in the constructor but then not do anything with them?

This pattern can make backwards/forwards compatibility easier. If the newer/older version of the code has more/less parameters then you won't break everything.

Also, when you are inheriting this class (for example with mixins) it can be convenient to just accept everything.

Imho it's not a pretty pattern to use, but it works.