Best database questions in August 2010

How should I represent a bit flags int field in django admin?

10 votes

I have a data model with a bitfield defined something like this:

alter table MemberFlags add column title varchar(50) not null default '';
alter table MemberFlags add column value integer( 3) not null default 0;

insert into MemberFlags (title, value) values
    ("Blacklisted",             1),
    ("Special Guest",           2),
    ("Attend Ad-hoc Sessions",  4),
    ("Attend VIP Sessions",     8),
    ("Access Facility A",      16),
    ("Access Facility B",      32)

And used like this:

alter table Membership add column title varchar(50) not null default '';
alter table Membership add column flags integer( 3) not null default 0;

insert into Membership (title, flags) values
    ("Guest Pass",          4+2 ),
    ("Silver Plan",    16+  4   ),
    ("Gold Plan",   32+16+  4+2 ),
    ("VIP Pass",    32+16+8+4+2 )

My questions are:

A) What's the easiest way to represent the different bitflags as separate items in the admin site? Should I override the template, or do something with forms?

B) How about the search list? I could create functions in the model to represent each bit, but how would searching and sorting be done?

I'm new to Django.

Working off the snippet in Andrew's answer, here are the changes you'd need to make:

from django.db import models
from django import forms

class BitFlagFormField(forms.MultipleChoiceField):
    widget = forms.CheckboxSelectMultiple

    def __init__(self, *args, **kwargs):
        super(BitFlagFormField, self).__init__(*args, **kwargs)

class BitFlagField(models.Field):
    __metaclass__ = models.SubfieldBase

    def get_internal_type(self):
        return "Integer"

    def get_choices_default(self):
        return self.get_choices(include_blank=False)

    def _get_FIELD_display(self, field):
        value = getattr(self, field.attname)
        choicedict = dict(field.choices)

    def formfield(self, **kwargs):
        # don't call super, as that overrides default widget if it has choices
        defaults = {'required': not self.blank, 'label': capfirst(self.verbose_name), 
                    'help_text': self.help_text, 'choices':self.choices}
        if self.has_default():
            defaults['initial'] = self.get_default()
        defaults.update(kwargs)
        return BitFlagFormField(**defaults)

    def get_db_prep_value(self, value):
        if isinstance(value, int):
            return value
        elif isinstance(value, list):
            return sum(value)

    def to_python(self, value):
        result = []
        n = 1
        while value > 0:
            if (value % 2) > 0:
                result.append(n)
            n *= 2
            value /= 2
        return sorted(result)


    def contribute_to_class(self, cls, name):
        super(BitFlagField, self).contribute_to_class(cls, name)
        if self.choices:
            func = lambda self, fieldname = name, choicedict = dict(self.choices):" and ".join([choicedict.get(value,value) for value in getattr(self,fieldname)])
            setattr(cls, 'get_%s_display' % self.name, func)

Database Access in Android

9 votes

I am creating an android app that is basically a listing of information on Mushrooms. I get this information from an sqlite database. I have a global singleton with a services class inside it in which I use to access my db. Almost every activity accesses the db. Is it better to leave my db open all the time or open and close it as I need the data?

If the best practice is to leave it open all the time, where do I need to make sure to close it and what is the worst case scenario if I left it open when the activity was destroyed?

Based on my past experience in Java I would say it is better to close the connection, it probably doesn't matter in a small Android application, but if you have 10 applications running and all of them access the database, you have 10 pending connections. Start a few more and sooner or later another application will have to wait because the SQL server can't handle any more requests.

I guess you could think of it as a file on your computer. You read data from it, and then close it when your done. Why keep a file open in your application?

Now I'm very new to Android programming so I haven't got around to implement database calls. But when I faced the same problem in a Java application a few years ago I implemented a database object, in which I had the connection to the database. "Everyone else" (the classes) had to call the database object (singleton or final methods) to get data, sort of like stored procedures but in the application instead.

Because of this I knew when the calls where made and when they stopped. I then put in a timeout, so as if nothing happened in a few minutes, I would close the connection to the db. (This also took care of some timeout exceptions because the timeout of the connection would never happen.) When a new call entered, I could easily start a new connection and use the new db connection.

Basically I abstracted away SQL calls by having methods as public Fungus[] getAllFungus() and public Fungus[] getFilteredFungus(string where).

Is DbContext the same as DataContext?

9 votes

I'm following a tutorial by Scott Gu that refers to a class named DbContext. I can't find it on any namespace on framework 4 and it seems to me it was renamed from CT4 DbContext to .net4 System.Data.Linq.DataContext. Is my assumption correct?

DbContext is a new class that was added in the recent separate down by EF team. It is currently not part of the core EF 4.0. However DbContext moving forward would be the preferered way to interact with EF. So how is it different from Objectcontext? Well semantically there are exactly same but they reduced lot of extra noise that ObjectContext had. Like exposing a set required more work for instance

public ObjectSet<Customer> Customers
{
 get{return db.CreateObjectSet<Customer>();
}

in dbcontext u can do public DbSet Customers{get;set;}

Basically on the ObjectContext, when you do dot(.), everything is just right there which makes the list pretty huge. What the EF team actually wanted to expose on DbContext are entities which are only speicific to your domain and rest of ability of the framework is tucked in under different properties. It just makes the programming experience easier. This means if you are using ObjectContext write now, with little bit of code, you can easily move to DbContext.

Comparing SIFT features stored in a mysql database

9 votes

I'm currently extending an image library used to categorize images and i want to find duplicate images, transformed images, and images that contain or are contained in other images.
I have tested the SIFT implementation from OpenCV and it works very well but would be rather slow for multiple images. Too speed it up I thought I could extract the features and save them in a database as a lot of other image related meta data is already being held there.

What would be the fastest way to compare the features of a new images to the features in the database?
Usually comparison is done calculating the euclidean distance using kd-trees, FLANN, or with the Pyramid Match Kernel that I found in another thread here on SO, but haven't looked much into yet.

Since I don't know of a way to save and search a kd-tree in a database efficiently, I'm currently only seeing three options:
* Let MySQL calculate the euclidean distance to every feature in the database, although I'm sure that that will take an unreasonable time for more than a few images.
* Load the entire dataset into memory at the beginning and build the kd-tree(s). This would probably be fast, but very memory intensive. Plus all the data would need to be transferred from the database.
* Saving the generated trees into the database and loading all of them, would be the fastest method but also generate high amounts of traffic as with new images the kd-trees would have to be rebuilt and send to the server.

I'm using the SIFT implementation of OpenCV, but I'm not dead set on it. If there is a feature extractor more suitable for this task (and roughly equally robust) I'm glad if someone could suggest one.

So I basically did something very similar to this a few years ago. The algorithm you want to look into was proposed a few years ago by David Nister, the paper is: "Scalable Recognition with a Vocabulary Tree". They pretty much have an exact solution to your problem that can scale to millions of images.

Here is a link to the abstract, you can find a download link by googleing the title. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1641018

The basic idea is to build a tree with a hierarchical k-means algorithm to model the features and then leverage the sparse distribution of features in that tree to quickly find your nearest neighbors... or something like that, it's been a few years since I worked on it. You can find a powerpoint presentation on the authors webpage here: http://www.vis.uky.edu/~dnister/Publications/publications.html

A few other notes:

  • I wouldn't bother with the pyramid match kernel, it's really more for improving object recognition than duplicate/transformed image detection.

  • I would not store any of this feature stuff in an SQL database. Depending on your application it is sometimes more effective to compute your features on the fly since their size can exceed the original image size when computed densely. Histograms of features or pointers to nodes in a vocabulary tree are much more efficient.

  • SQL databases are not designed for doing massive floating point vector calculations. You can store things in your database, but don't use it as a tool for computation. I tried this once with SQLite and it ended very badly.

  • If you decide to implement this, read the paper in detail and keep a copy handy while implementing it, as there are many minor details that are very important to making the algorithm work efficiently.

Data object storage - Can table JOIN's do what single table SELECT's cannot?

8 votes

Now that "NOSQL" or "object only" storage systems like MongoDB or memcached are really picking up steam in the world. I was wondering if there are any requests that cannot be performed on them that can be performed using multiple object joins (in SQL that is JOIN "table"). In other words, are there any multi-table queries that cannot be handled by several single table queries in a row?

Basically, is there a use-case were a multi-table join cannot be replicated by accessing one table at a time in object based storage systems?

Here are some examples of normal 3NF queries using has_man and has_many_through relations. These aren't the most complex queries - but they should give you a starting point for the concept. Note that any value in {} means a value of the result of the last query.


Company Has Many Users

SELECT user.*, company.name as company_name FROM user 
LEFT JOIN company ON company.id = user.company_id
WHERE user.id = 4

vs

SELECT * FROM user WHERE id = 4
SELECT * FROM company WHERE id = {user.comany_id}

Club Has Many Students Through Memberships

SELECT student.* FROM student LEFT JOIN membership on
membership.student_id = sudent.id WHERE membership.club_id = 5

vs

SELECT * FROM membership WHERE club.id = 5
SELECT * FROM student WHERE id = {membership.student_id}

The reason I'm wondering is because I want to know if Object-based systems (that rely on accessing single table objects at a time) can do what RDBMS databases like PostgreSQL or MySQL can do.

So far the only thing wrong seems to be that more queries are necessary.

1 - running multiple separated queries leaves you with consurrency mess - by the time you got something from table 1 it could have been deleted and it might still be in table 2 - now assume 5 correlated tables.

2 - running queries with at least moderately complex logic over fields that are not mythical ID

3 - controling the amount of data fetched (you hardly ever need more than 50% of the data which is needed to deserialize/create valid objects and even worse whole trees of connected objects)

4 - correlated queries (nested selects) which SQL server will optimize like joins to additive complexity or better (|T1|+|T2|+|T3|+|T4|) while any ORM or nonSQL will have to keep repeating inner queries and giving rise to multiplicative complexity (|T1||T2||T3|*|T4|)

5 - dataset sizes, scalability not just in dataset sizes but also in handling concurrency under updates. Even ORM-s which maintain transactions make them so long that chances for deadlocks increase exponentially.

6 - blind updates (a lot more data touched for no reason) and their dependency and failure based on a blind instrument (mythical version which is realistically needed in say 1% of relational data model but ORM and alikes have to have it everywhere)

7 - lack of any standards and compatibility - this means that your system and data will always be at much higher risk and dependent on software changes driven by academic adventurism rather that any actual business responsibility and with expectation to invest a lot of resources just in testing changes.

8 - data integrity - oops some code just deleted half of today's order records from T1 since there was no foreign key to T2 to stop it. Prefecly normal thing to do with separated queries.

9 - negative maturity trend - keeps splintering instead of standardizing - give it 20 yr and maybe it will get stable

Last but not least - it doesn't reduce any compexity (the same correlation between data is still there) but it makes it very hard to track and manage complexity or have any realistic remedy or transparency when something goes wrong. And it adds the complexity of 1-2 layers. If something goes wrong in your SQL tables you have tools and queries to discover and even fix your data. What are you going to do when some ORM just tells you that it has "invalid pointer" and throws exception since surely you don't want "invalid object" ?

I think that's enough :-)

Are all modern RDBMS row oriented? Why?

8 votes

If one of relational databases paradigms is to be tuple oriented we have the biggest limitation here.

If one could design column oriented db, that would improve performance a lot. Vector operations would perform out of the box, indexing, hashing for simple symbol columns lookups, linked lists behind the scenes as engine.

Memory mapping: dumps in huge chunks in microseconds as well as loading those disk images.
And still have use well understood and standard language (SQL) that multiple vendors support.
Imagine how many tools could be designed for interfacing that thing, because of its simplicity.
Wouldn't it be more robust (and KISS at the same time)?

UPDATE
Thanks to all contributors.
Question has been unjustly closed, though i've found your all answers very informative.

Are all modern RDBMS row oriented?

No. They're designed for specific tasks, say OLTP vs OLAP. Even the popular ones like MySQL have column-store engines (ex: Infobright). And there are DBMS's that are built as a column-oriented DB from the ground up as well.

Here's a potentially interesting read for you: C-Store: A Column-oriented DBMS (PDF format)

LucidDB is a popular column-oriented database for data warehousing and BI:

LucidDB is the first and only open-source RDBMS purpose-built entirely for data warehousing and business intelligence. It is based on architectural cornerstones such as column-store, bitmap indexing, hash join/aggregation, and page-level multiversioning. Most database systems (both proprietary and open-source) start life with a focus on transaction processing capabilities, then get analytical capabilities bolted on as an afterthought (if at all). By contrast, every component of LucidDB was designed with the requirements of flexible, high-performance data integration and sophisticated query processing in mind. Moreover, comprehensiveness within the focused scope of its architecture means simplicity for the user: no DBA required.

See its list of features for those that overlap with your interests here: LucidDB Features

And still have use well understood and standard language (SQL) that multiple vendors support.

You can use SQL with LucidDB.

In database design what do "n:m" and "1:n" mean?

7 votes

Title says it all :)

Thanks ;)

m:n is used to denote a many-to-many relationship (m objects on the other side related to n on the other) while 1:n refers to a one-to-many relationship (1 object on the other side related to n on the other).

What database does Facebook use?

7 votes

Does it use any of the standard ones like Oracle, DB2, SQL Server, or have something of their own?

Considering the type of data (text + images + videos) that they have to manage, it would be interesting to know how they deal with it.

Is this information publicly available? Any links would also be helpful.

It should be no surprise that an site as high-scale as Facebook uses a variety of data management technology. Each database product has its strengths, and Facebook needs all of them.

They have also changed their data management from time to time, as they find solutions that meet their needs.

According to Exploring the software behind Facebook, the world’s largest site (2010/6/18):

  • MySQL
  • Memcached
  • Haystack for photo retrieval
  • Cassandra
  • Hadoop and Hive
  • Scribe for high-speed distributed logging

Keeping page changes history. A bit like SO does for revisions.

7 votes

I have a CMS system that stores data across tables like this:

Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+

Entries META Table
+----+----------+-------+-------+
| id | entry_id | value | param |
+----+----------+-------+-------+

Files Table
+----+----------+----------+
| id | entry_id | filename |
+----+----------+----------+

Entries-to-Tags Table
+----+----------+--------+
| id | entry_id | tag_id |
+----+----------+--------+

Tags Table
+----+-----+
| id | tag |
+----+-----+

I am in trying to implement a revision system, a bit like SO has. If I was just doing it for the Entries Table I was planning to just keep a copy of all changes to that table in a separate table. As I have to do it for at least 4 tables (the TAGS table doesn't need to have revisions) this doesn't seem at all like an elegant solution.

How would you guys do it?

Please notice that the Meta Tables are modeled in EAV (entity-attribute-value).

Thank you in advance.

Hi am currently working on solution to similar problem, I am solving it by splitting my tables into two, a control table and a data table. The control table will contain a primary key and reference into the data table, the data table will contain auto increment revision key and the control table's primary key as a foreign key.

taking your entries table as an example

Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+

becomes

entries             entries_data
+----+----------+   +----------+----+--------+------+--------+--------+
| id | revision |   | revision | id |  title | text | index1 | index2 |
+----+----------+   +----------+----+--------+------+--------+--------+

to query

select * from entries join entries_data on entries.revision = entries_data.revision;

instead of updating the entries_data table you use an insert statement and then update the entries table's revision with the new revision of the entries table.

The advantage of this system is that you can move to different revisions simply by changing the revision property within the entries table. The disadvantage is you need to update your queries. I am currently integrating this into an ORM layer so the developers don't have worry about writing SQL anyway. Another idea I am toying with is for there to be a centralised revision table which all the data tables use. This would allow you to describe the state of the database with a single revision number, similar to how subversion revision numbers work.

How to develop a web application compatible with multiple database management systems

7 votes

How do you design and manage the development of a web application that should be compatible with multiple database management system such as Oracle and MS SQL Server?

If you can't use ORM like NHibernate or EF, how do you maintain database schemas during the development?

My approach now is to have a development database on SQL Server and to port it to Oracle (with a tool) just before releasing a test patch, to test the software on both rdbms. (The tool also generates a file used by the application to upgrade the database)

Is it a good approach? What about a database project of Visual Studio, could it be a better way to keep my db schema?

EDIT: This question is not about designing the architecture of the application (I have already an abstract data access layer), but how to maintain database schemas for different kinds of rdbms during the development.

Model-driven architecture (MDA): use a generic Database modelling tool to design your database schema. You define the tables/relationships/primary keys/etc. in a generic fashion and then have the designer generate the necessary SQL script (most support output to a variety of databases). As you change the DB model, the tool will generate the necessary SQL code to update the database, or generate it from scratch. The tools also assist in generating documentation and assisting with database versioning, amongst many other things...

I use Context Database Designer and am exceptionally happy with the tool and price. Enterprise Architect also looks like an excellent tool, with the ability generate and reverse-engineer code.

Is normalizing the gender table going too far?

6 votes

I am not a database guy, but am trying to clean up another database. So my question is would normalizing the gender table be going too far?

User table:
userid int pk,
genderid char(1) fk
etc...

gender table:
genderid char(1) pk,
gender varchar(20)

Now at first it seemed silly to me, but then I considered it because i can then have a constant data source to populate from or bind from. I will be using WPF. If it was another framework I would probably avoid it, but what do you think?

Whether or not you choose to normalize your table structure to accomodate gender is going to depend on the requirements of your application and your business requirements.

I would normalize if:

  • You want to be able to manage the "description" of a gender in the database, and not in code.
    • This allows you to quickly change the description from Man/Woman to Male/Female, for example.
  • Your application currently must handle, or will possible handle in the future, localization requirements, i.e. being able to specify gender in different languages.
  • Your business requires that everything be normalized.

I would not normalize if:

  • You have a relatively simple application where you can easily manage the description of the gender in code rather than in the database.
  • You have tight programmatic control of the data going in and out of the gender field such that you can ensure consistency of the data in that field.
  • You only care about the gender field for information capture, meaning, you don't have a lot of programmatic need to update this field once it is set the first time.

Handling race condition in model.save() (Django)

6 votes

How should one handle a possible race condition in a model's save() method?

For example, the following example implements a model with an ordered list of related items. When creating a new Item the current list size is used as its position.

From what I can tell, this can go wrong if multiple Items are created concurrently.

class OrderedList(models.Model):
    # ....
    @property
    def item_count(self):
        return self.item_set.count()

class Item(models.Model):
    # ...
    name   = models.CharField(max_length=100)
    parent = models.ForeignKey(OrderedList)
    position = models.IntegerField()
    class Meta:
        unique_together = (('parent','position'), ('parent', 'name'))

    def save(self, *args, **kwargs):
        if not self.id:
            # use item count as next position number
            self.position = parent.item_count
        super(Item, self).save(*args, **kwargs)

I've come across @transactions.commit_on_success() but that seems to apply only to views. Even if it did apply to model methods, I still wouldn't know how to properly handle a failed transaction.

I am currenly handling it like so, but it feels more like a hack than a solution

def save(self, *args, **kwargs):
    while not self.id:
        try:
            self.position = self.parent.item_count
            super(Item, self).save(*args, **kwargs)
        except IntegrityError:
            # chill out, then try again
            time.sleep(0.5)

Any suggestions?

Update:

Another problem with the above solution is that the while loop will never end if IntegrityError is caused by a name conflict (or any other unique field for that matter).

For the record, here's what I have so far which seems to do what I need:

def save(self, *args, **kwargs):   
    # for object update, do the usual save     
    if self.id: 
        super(Step, self).save(*args, **kwargs)
        return

    # for object creation, assign a unique position
    while not self.id:
        try:
            self.position = self.parent.item_count
            super(Step, self).save(*args, **kwargs)
        except IntegrityError:
            try:
                rival = self.parent.item_set.get(position=self.position)
            except ObjectDoesNotExist: # not a conflict on "position"
                raise IntegrityError
            else:
                sleep(random.uniform(0.5, 1)) # chill out, then try again

It may feel like a hack to you, but to me it looks like a legitimate, reasonable implementation of the "optimistic concurrency" approach -- try doing whatever, detect conflicts caused by race conditions, if one occurs, retry a bit later. Some databases systematically uses that instead of locking, and it can lead to much better performance except under systems under a lot of write-load (which are quite rare in real life).

I like it a lot because I see it as a general case of the Hopper Principle: "it's easy to ask forgiveness than permission", which applies widely in programming (especially but not exclusively in Python -- the language Hopper is usually credited for is, after all, Cobol;-).

One improvement I'd recommend is to wait a random amount of time -- avoid a "meta-race condition" where two processes try at the same time, both find conflicts, and both retry again at the same time, leading to "starvation". time.sleep(random.uniform(0.1, 0.6)) or the like should suffice.

A more refined improvement is to lengthen the expected wait if more conflicts are met -- this is what is known as "exponential backoff" in TCP/IP (you wouldn't have to lengthen things exponentially, i.e. by a constant multiplier > 1 each time, of course, but that approach has nice mathematical properties). It's only warranted to limit problems for very write-loaded systems (where multiple conflicts during attempted writes happen quite often) and it may likely not be worth it in your specific case.

What is a good practice/tool for MSSQL database design and versioning?

6 votes

We work with a small development team and a system engineer who deploys the software at certain times.

At the moment we don't have any leading document for database changes. Changes are done in the staging database and added manually to production when the system engineer deploys a new release.

We started to work with DBDesigner Fork to create the design and generate the SQL create script, but we are still not happy with this solution. The solution is incomplete, because it's impossible to create views and stored procedures in DBDesigner Fork for example.

We would like to find a practice/tool to design the database, stored procedures, views, etc. and version the changes in Subversion.

When a new software release is created (tagged in SVN), we should be able to create a patch file for the database (a SQL script for example).

What is the best way to handle the issues described above?

We are a $2.5B solar manufacturing company that is using Visual Studio Database Edition for all our database management and versioning needs. It has been a very good tool for us. It version controls our databases with TFS integration, does schema and data compares (like Red-Gate), database validations and much more. We couldn't live without it now.

Django Deployment: Handling data in database

5 votes

Right now I am using git for Django deployment which seems satisfying to me. My only problem is still how to handle the data in the database properly. Eg. I need often to edit data coming from the prodution site locally and put the data back on the production site (please note I'm talking about data changes and not schema migrations!). I think the workflow should be somehow like the following: Dump data on production site > download data > load data in db > make changes locally > dump data > make diff for data > upload diff & apply changes on production site.

Important to me would be that this also works for changes to existing database rows, deletions etc...

So if I start experimenting with that on my own: 1. Will this work with any of the data dump formats offers? 2. Anybody else working like that, maybe having some (fabric) script solutions for that ready already?

The tables I want to dump/change/restore are quite small and they are read-only via public interface. The following approach is used:

  1. The data is dumped with ./manage.py dumpdata command on server.
  2. Then result file is commited to VCS on server.
  3. I pull changes and execute ./manage.py loaddata.
  4. After changes are made ./manage dumpdata is executed locally.
  5. The result file is commited to VCS and pushed back to server
  6. ./manage loaddata command is executed on server

This can be automated with Fabric, e.g

1 + 2 + 3 = fab dump_data:cities, 4+5+6 = fab push_data:cities

Diffs are produced internally by VCS. This approach won't work for everything but I found it useful for simple cases.