Best postgresql questions in October 2011

Is a 'blackhole' table evil?

14 votes

Reading to this question i've just learned the existence of the blackhole table trick: basically consist in using a single table to insert data, and then a trigger that split the data in many other tables.

Im wondering if this could cause problems, once the developers whos working on the project are aware of that.

What are the pro and cons of this tecnique?

Edit: The blink I got in mind when I saw the example, is about transactions: if for some reason the transaction fail, you'll find the blackhole row with the original data, for historical purpose and maybe a help with debug - but this seems to be the only +1 i can see with blackholes. Ideas?

I don't think blackhole has any real pros.

Writing the trigger code to move data around is probably not noticably less work than writing the code to insert the data in the right place in the first place.

As Christian Oudard writes, it doesn't reduce complexity - just moves it to a place where it's really hard to debug.

On the downside:

"Side effects" are usually a bad idea in software development. Triggers are side effects - I intend to do one thing (insert data in a table), and it actually does lots of other things. Now, when I'm debugging my code, I have to keep all the side effects in my head too - and the side effects could themselves have side effects.

most software spends far more time in maintenance than it does in development. Bringing new developers into the team and explaining the black hole trick is likely to increase the learning curve - for negligible benefit (in my view).

Because triggers are side effects, and it's relatively easy to set off a huge cascade of triggers if you're not careful, I've always tried to design my databases without a reliance on triggers; where triggers are clearly the right way to go, I've only let my most experienced developers create them. The black hole trick makes triggers into a normal, regular way of working. This is a personal point of view, of course.

SQL vs NOSQL: Which to use for this schema?

7 votes

I've got an upcoming project and I can't decide whether to stick with SQL or switch over to NoSQL. It's basically a reporting system with the main interface being reporting on the data entered in by users.

Here's the schema I've got mapped out:

enter image description here

Because this schema is so nested, I started thinking about NoSQL. With SQL, I'm afraid I'm going to have a crap-ton of joins to get to the bottom of the tree (the Record model).

My concerns, though, are two-fold:

  1. I'm only just starting to get into NoSQL and I'm worried my knowledge may limit me because of the tight timeframe.
  2. Although creating data at the bottom of the tree will probably be relatively simple, I'm worried that it may be hard to report on without getting into some heavy map/reduce stuff (that I have zero experience with)

My question: Given my concerns, do you think this schema -- because of how deeply nested it is -- lends itself more to NoSQL? If so, do you think the reporting on the "records" will be difficult?

I realize that it may be difficult to answer these questions without more info, so please let me know what other info may be helpful in coming up with an answer.

Thanks in advance for your help!

Just my opinion:

I Stared at diagram for approx 3 sec, this is clearly relational. Benefits of an RDBMS heavily outweigh a NoSQL solution here. Why would you want to use NoSQL? Are there 100,000+ records (may a million plus)? You need microsecond/millisecond performance?

NoSQL, as I understand, is not because you don't like lots of joins. It's because big systems for hierarchical data don't suit every situation. This suit this perfectly, however.

Django+Postgres: "current transaction is aborted, commands ignored until end of transaction block"

5 votes

I've started working on a Django/Postgres site. Sometimes I work in manage.py shell, and accidentally do some DB action that results in an error. Then I am unable to do any database action at all, because for any database action I try to do, I get the error:

current transaction is aborted, commands ignored until end of transaction block

My current workaround is to restart the shell, but I should find a way to fix this without abandoning my shell session.

(I've read this and this, but they don't give actionable instructions on what to do from the shell.)

You can try this:

from django.db import connection
connection._rollback()

The more detailed discussion of this issue can be found here: https://code.djangoproject.com/ticket/10813

approximate search in a database

4 votes

I have a large database with a list of institutions (universities, hospitals, etc). The names of institutions come from different sources and can be spelled differently for the same institution. They can be misspelled, for example, or words can be shortened ("uni", or "univ", or "university")

Given a name that I need to insert in to the database, is there a practical way to find if this institution is already in the database? This is not a research project, so I am looking for a solution that is reasonably fast.

I am using django and postgresql, but it does not matter I suppose.

This is the problem of record linkage. Many databases provide basic methods for this such as character-level n-gram matching, where a term like "university" is expanded into

["uni", "niv", "ive", "ver", "ers", ...]

for n = 3. The database would index all such n-grams and allow a search with some kind of weighted matching. pg_trgm seems to do exactly this, try it out.