Best database questions in October 2010

What are the Options for Storing Hierarchical Data in a Relational Database?

28 votes

Good Overviews

Generally speaking you're making a decision between fast read times (e.g. nested set) or fast write times (adjacency list). Usually you end up with a combination of the options below that best fit your needs. The following provides some in depth reading:

Options

Ones I am aware of and general features:

  1. Adjacency List:
    • Columns: ID, ParentID
    • Easy to implement.
    • Cheap node moves, inserts, and deletes.
    • Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Hierarchy combined with level column can solve), path (Lineage Column can solve).
    • Use Common Table Expressions in those databases that support them to traverse.
  2. Nested Set (a.k.a Modified Preorder Tree Traversal)
    • First described by Joe Celko - covered in depth in his book Trees and Hierarchies in SQL for Smarties
    • Columns: Left, Right
    • Cheap level, ancestry, descendants
    • Compared to Adjacency List, moves, inserts, deletes more expensive.
    • Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
  3. Nested Intervals
    • Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information.
  4. Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
    • Columns: ancestor, descendant
    • Stands apart from table it describes.
    • Can include some nodes in more than one hierarchy.
    • Cheap ancestry and descendants (albeit not in what order)
    • For complete knowledge of a hierarchy needs to be combined with another option.
  5. Flat Table
    • A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
    • Expensive move and delete
    • Cheap ancestry and descendants
    • Good Use: threaded discussion - forums / blog comments
  6. Lineage Column (a.k.a. Materialized Path, Path Enumeration)
    • Column: lineage (e.g. /parent/child/grandchild/etc...)
    • Limit to how deep the hierarchy can be.
    • Descendants cheap (e.g. LEFT(lineage, #) = '/enumerated/path')
    • Ancestry tricky (database specific queries)

Database Specific Notes

MySQL

Oracle

PostgreSQL

SQL Server

  • General summary
  • 2008 offers HierarchyId data type appears to help with Lineage Column approach and expand the depth that can be represented.

This is kind of a question that is still interesting even after all big 3 vendors implemented Recursive WITH clause. I'd suggest that different readers would be pleased with different answers.

  1. Comprehensive list of references by Troels Arvin (although it seems to be missing many recent fine articles mentioned in similar stackoverflow threads).
  2. For the lack of competition, introductory textbook by Joe Celko "Trees and Hierarchies in SQL for Smarties" can indeed be considered a classics.
  3. For mathematical sophistry and connections between various methods look up Tropashko publications.

How to implement "autoincrement" on Google AppEngine

10 votes

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.

  1. A number MUST NOT BE used twice
  2. Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).

Fancy way of saying: I need to count 1,2,3,4 ... The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.

I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".

Can this be implemented on Google AppEngine (preferably in Python)?

If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.

Node.JS and MySQL drivers

9 votes

Is there a Node.JS Driver for MySQL that is commonly used other than http://github.com/masuidrive/node-mysql ? This driver is unstable (says the creator). It seems like there is not much activity with node.js database drivers. Is there a reason for this or is it just because Node.JS is so young?

Here are some options:

What is wrong with this MySQL Query?

8 votes

It's 12:30am and I have been coding for 9 hours straight. I really need to get this project done, but MySQL is messing with my deadline. Could you examine this snippet for me and see if you can find out what is wrong?

PHP/MySQL Query

$q = $this->db->query("SELECT * FROM bans WHERE ip='".$ip."'");

Keeps returning the following error...

MYSQL Error [Oct 6th, 2010 11:31pm CDT]
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '* FROM bans WHERE ip='206.53.90.231'' at line 1 (1064)

I do not see anything wrong with the query. I've even tried different methods of including the variable $ip but with no avail.

EDIT:
Just to add in here, the ip column in my database is a varchar(255).

EDIT 2:
Here is the whole affected code. Keep in mind that this is all in a class. If I'm missing something, let me know.

Line from another Function

if($this->isBanned($_SERVER['REMOTE_ADDR'])===true) { return json_encode(array('error'=>'You are banned from this ShoutBox.')); }

Affected Function

function isBanned($ip) {
    $q = $this->db->query("SELECT * FROM bans WHERE ip='".$ip."'"); $num = $this->db->affected_rows;
    if($num>0) { $row = $this->db->fetch_array($q); if(($row['expires'] < time()) && ($row['expires'] !== 0)) { $this->unbanUser($ip,'internal'); return false; } return true; } return false;
}

unbanUser function

function unbanUser($ip,$t='box') {
    $q = $this->db->query("SELECT * FROM bans WHERE ip='".$ip."'"); $num = $this->db->affected_rows; if($num>0) { $q = $this->db->query("DELETE * FROM bans WHERE ip='".$ip."'"); 
    return (($t=='box') ? json_encode(array('status'=>'removed')) : true); } else { return (($t=='box') ? json_encode(array('error'=>'Unable to locate the user.')) : true); }
}

I think it may be It is your DELETE statement which is causing the error.

Remove the * after the DELETE and it should be fine.

Does every table really need an auto-incrementing artificial primary key?

8 votes

Almost every table in every database I've seen in my 7 years of development experience has an auto-incrementing primary key. Why is this? If I have a table of U.S. states where each state where each state must have a unique name, what's the use of an auto-incrementing primary key? Why not just use the state name as the primary key? Seems to me like an excuse to allow duplicates disguised as unique rows.

This seems plainly obvious to me, but then again, no one else seems to be arriving at and acting on the same logical conclusion as me, so I must assume there's a good chance I'm wrong.

Is there any real, practical reason we need to use auto-incrementing keys?

No.

In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.

In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.

A primary key must be:

  • unique (100% guaranteed! Not just "almost" unique)
  • NON NULL

A primary key should be:

  • stable if ever possible (not change - or at least not too frequently)

State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.

So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.

Transactional And Reporting Databases - How?

7 votes

When building a transactional system that has a highly normalized DB, running reporting style queries, or even queries to display data on a UI can involve several joins, which in a data heavy scenario can and usually does, impact performance. Joins are expensive.

Often, the guidance espoused is that you should never run these queries off your transactional DB model, rather you should use a denormalized flattened model that is tailored for specific UI views or reports which eliminates the need for many joins. Data duplication is not an issue in this scenario.

This concept makes perfect sense, but what I rarely see when experts make these statements is exactly HOW to implement this. For example, (and quite frankly I'd appreciate an example using any platform) in a mid sized system running on a sql server back-end you have a normalized transactional model. You also have some reports and a website that require queries. So, you create a "reporting" database that flattens up the normalized data. How do you keep this in sync? Transaction log shipping? If so, how do you transform the data to fit in the reporting model?

In our shop, we set up a continuous transactional replication from the OLTP system to another DB server used for reporting. You wouldn't want to use log shipping for this purpose as it requires an exclusive lock on the database every time it restores a log, which would prevent your users from running reports.

With the optimizer in SQL Server today, I think the notion that the joins on a normalized database are "too expensive" for reporting is a bit outdated. Our design is fully 3rd normal form, several million rows in our main tables, and we have no problems running any of our reports. Having said that, if push came to shove, you could look into creating some indexed views on your reporting server to help out.

When to use JCR (content repository) over other options?

7 votes

I'm trying to evaluate content repositories (JSR283) like Jackrabbit and ModShape but I must confess that I don't understand what problem resolves in first place and even if it is a good choice for the project. Which cases do you think is the best solution to apply? Is not the same thing as relational databases, except for the size? Why? Extra points for pointing real world examples.

Thanks in advance.

JCR repositories are different than RDBMSes, because a JCR repository:

  • is hierarchical, allowing your to organize your content in a structure that closely matches your needs and where related information is often stored close together and thus easily navigated
  • is flexible, allowing the content to adapt and evolve, using a node type system that can be completely "schemaless" to full-on restrictive (e.g., like a relational database)
  • uses a standard Java API (e.g., javax.jcr)
  • abstracts where the information is really stored: many JCR implementations can store content in a variety of relational databases and other stores, some can expose non-JCR stores through the JCR API, and some can federate multiple stores into a single, virtual repository.
  • supports queries and full-text search out of the box
  • supports events, locking, versioning, and other features

You certainly can build all or some of these features in your own application, but that likely gets further away from what your main purpose of your app.

What kind of applications can benefit from these features? Content management systems have used repositories for a long time, and JCR (and Jackrabbit) really grew out of the need for a common, standard API to access different content repositories (see JSR-170 and JSR-283).

Another example are document management systems, which manage electronic files (that are often images of paper documents) and provide search and query. DMSes have used repositories for some time.

Artifact management systems can use repositories to manage digital artifacts (often files) along with additional information (metadata). JCR works great here, because you can store the metadata in the same location as the files: those that understand these extra properties can see them, those that don't care don't have to see them. I know Artifactory is a Maven repository implementation that uses JCR. There are also repositories for managing web service artifacts, data service artifacts, and test artifacts.

But JCR repositories are not for managing files. JCR uses a simple notion of a hierarchy of nodes, where the nodes can contain named properties (with one or multiple values) and children. The properties and child node that are allowed are dictated entirely by node types, which can be changed and mixed in as needed on a node-by-node basis. JCR predefines some built-in node types that are commonly needed, like those used to represent files and folders in the repository. You can reuse these built-in types, extend them, or write your own. Many people advocate using mixins almost as facets or aspects, so that if a node needs to take on a facet you can simply add a mixin to the node.

JCR was designed to easily support importing XML content into the repository, where each element is mapped to a node and each attribute is mapped to an attribute. And lots of stuff is represented using XML (or YAML or JSON), and all of this can easily be represented and stored in a JCR repository. As an example, consider a JCR repository that stores configuration information (that might normally be stored in multiple XML files). JCR can version that information, allow access to it from multiple processes, enable querying and search, and notify the application(s) when content changes.

There are several good overviews of JCR with more detail and examples. A few of these are:

Database Design: replace a boolean column with a timestamp column?

7 votes

Earlier I have created tables this way:

create table workflow (
    id number primary key,
    name varchar2(100 char) not null,
    is_finished number(1) default 0 not null,
    date_finished date
);

Column is_finished indicates whether the workflow finished or not. Column date_finished is when the workflow was finished.

Then I had the idea "I don't need is_finished as I can just say: where data_finished is not null", and I designed without is_finished column:

create table workflow (
    id number primary key,
    name varchar2(100 char) not null,
    date_finished date
);

(We use Oracle 10)

Is it a good or bad idea? I've heard you can not have an index on a column with NULL values, so where data_finished is not null will be very slow on big tables.

There is a right way to index null values, and it doesn't use a FBI. Oracle will index null values, but it will NOT index null LEAF values in the tree. So, you could eliminate the column is_finished and create the index like this.

CREATE INDEX ON workflow (date_finished, 1);

Then, if you check the explain plan on this query:

SELECT count(*) FROM workflow WHERE date_finished is null;

You might see the index being used (if the optimizer is happy).

Back to the original question: looking at the variety of answers here, I think there is no right answer. I may have a personal preference to eliminate a column if it is unnecessary, but I also don't like overloading the meaning of columns either. There are two concepts here:

  1. The record has finished. is_finished
  2. The record finished on a particular date. date_finished

Maybe you need to keep these separate, maybe you don't. When I think about eliminating the is_finished column, it bothers me. Down the road, the situation may arise where the record finished, but you don't know precisely when. Perhaps you have to import data from another source and the date is unknown. Sure, that's not in the business requirements now, but things change. What do you do then? Well, you have to put some dummy value in the date_finished column, and now you've compromised the data a bit. Not horribly, but there is a rub there. The little voice in my head is shouting YOU'RE DOING IT WRONG when I do things like that.

My advice, keep it separate. You're talking about a tiny column and a very skinny index. Storage should not be an issue here.

Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.

-Eric S. Raymond

Can I save credit card secret code in the database ?

7 votes

I need to save the credit card numbers and secret codes of users in the database in plain text ( consensus behind obviously ) for automatic operation made from the server.

Is there some problems ?

What do I need to be aware of?

Most credit card processing agreements that I have seen do not allow you to store the code from the back of the card.

There are other security implications of storing plain text credit card numbers, but storing the code is usually specifically disallowed by your agreement. You will need to read yours to make sure you can do that.

As for storing the credit card number, that is also usually a very bad idea. If your database is compromised, you will be held liable and it could cost you a lot of money.

Unless you have a very good reason to store the credit card number and have a very good team working on security, I would not recommend storing any credit card data.

How to store "same as" data?

6 votes

I've got one model with 3 addresses: pickup, dropoff, and billing. I figure the billing address will usually be either the pickup or drop-off address, so from a UI perspective, I should have a "same as" option. But from a DB perspective, should I save the "same as" field, or should I duplicate the data?

You should have the same Id of a row from an Address table in two different columns, PickUp and DropOff. This way, you do not duplicate the address, do not use some sentinel address, and can easily query to see if the PickUp address is the same as the DropOff. If one of these changes in the future, you can always modify the Id value stored in its respective column to a new address.

Is there a dbunit-like framework that doesn't suck for java/scala?

6 votes

I was thinking of making a new, light-weight database population framework. I absolutely hate dbunit. Before I do, I want to know if someone already did it.

Things i dislike about dbunit:

1) The simplest format to write and get started is deprecated. They want you to use formats that are bloated. Some even require xml schemas. Yeah, whatever.

2) They populate rows not in the order you write them, but in the order tables are defined in the xml file. This is really bad because you can't order your data in such a way that foreign key constraints won't cause problems. This just forces you to go through the hassle of turning them off altogether.

This also wastes time and bloats up your junit base classes to include code to disable the foreign key constraints. You will probably have to test for the database type (hsqldb, etc.) and disable them in database-specific ways. This is way bad.

It could be better if dbunit helped in disabling foreign key constraints as part of their framework automatically, but they don't do this. They do keep track of dialects... so why not use them for this? Ultimately, all of this does is force the programmer to waste time and not get up and testing quickly.

3) XML is a pain to write. I don't need to say more about this. They also offer so many ways to do it, that I think it just complicates matters. Just offer one really solid way and be done with it.

4) When your data gets large, keeping track of the ids and their consistent/correct relationships is a royal pain.

Also, if you don't work on a project for a month, how are you to remember that user_id 1 was an admin, user_id 2 was a business user, user_id 3 was an engineer and user_id 4 was something else? Going back to check this is wasting more time. There should be a meaningful way to retrieve it other than an arbitrary number.

5) It's slow. I've found that unless hsqldb is used, it is painfully slow. It doesn't have to be. There are also numerous ways to mess up its configuration as it is not easy to do "out of the box". There is a hump that you must go through to get it working right. All this does is encourage people to not use it, or be pissed of when they do start to use it.

6) Some values tend to repeat a lot, likes dates. It'd be nice to specify defaults, or even have the framework put defaults in automatically, even without you telling it to put defaults in there. That way you can create objects just with the values you want, and leave the rest off. This sure beats specifying every nook and cranny of a column if it's not required.

7) Probably the most annoying thing is that the first entry must include ALL the values - even null placeholders - or future rows won't pick the columns that you actually specified.

DBunit doesn't have a sensible default for translating [NULL] to a real null value either. You have to manually add it. Tell me, who hasn't done this with dbunit? Everyone has. It shouldn't be like this!

What this means is that if you have a polymorphic object, you must declare all the foreign keys to the joining tables of each subclass in the first row, even though they are null. If you do a table for all subclasses pattern, you still have to specify all the fields on the first row. This is just awful.

Anything out there to satisfy me, or should I become the next framework developer of a much better database testing framework?

I'm not aware of any real alternative to DbUnit and none of the tools mentioned by @Joe are in my eyes:

  • Incanto: not DB agnostic
  • SQLUnit: a regression and unit testing harness for testing database stored procedures (that's not what DbUnit is about)
  • Cactus: a tool for In-container testing (I fail to see where it helps with databases)
  • Liquibase: a database migration tool (doesn't load/verify data)
  • ORMUnit: can initialize a database but that's all
  • JMock: doesn't compete with DbUnit at all

That being said, I've personally used DbUnit successfully several times, on small and huge projects, and I find it pretty usable, especially when using Unitils and its DbUnit module. This doesn't mean it's perfect and can't be improved but with decent tooling (either custom made or something like Unitils), using it has been a decent experience.

So let me answer some of your points:

1) The simplest format to write and get started is deprecated. They want you to use formats that are bloated. Some even require xml schemas. Yeah, whatever.

DbUnit supports flat or structured XML, XLS, CSV. What revolutionary format would you like to use? By the way, a DTD or schema is not mandatory when using XML. But it gives you nice things like validation and auto-completion, how is that bad? And Unitils can generate it easily for you, see Generate an XSD or DTD of the database structure.

It could be better if dbunit helped in disabling foreign key constraints as part of their framework automatically, but they don't do this. They do keep track of dialects... so why not use them for this? Ultimately, all of this does is force the programmer to waste time and not get up and testing quickly.

They are waiting for your patch.

Meanwhile, Unitils provides support to handle constraints transparently, see Disabling constraints and updating sequences.

3) XML is a pain to write. I don't need to say more about this. They also offer so many ways to do it, that I think it just complicates matters. Just offer one really solid way and be done with it.

I guess pain is subjective but I don't find it painful, especially when using a schema and autocompletion. What is the silver bullet you're suggesting?

4) When your data gets large, keeping track of the ids and their consistent/correct relationships is a royal pain.

Keep them small, that's a know best practice. You're going against a known best practice and then complain...

Also, if you don't work on a project for a month, how are you to remember that user_id 1 was an admin, user_id 2 was a business user, user_id 3 was an engineer and user_id 4 was something else? Going back to check this is wasting more time. There should be a meaningful way to retrieve it other than an arbitrary number.

Yes, task switching is counter productive. But since you're working with low level data, you have to know how they are represented, there is no magic solution unless you use a higher level API of course (but that's not the purpose of DbUnit).

5) It's slow. I've found that unless hsqldb is used, it is painfully slow. It doesn't have to be. There are also numerous ways to mess up its configuration as it is not easy to do "out of the box". There is a hump that you must go through to get it working right. All this does is encourage people to not use it, or be pissed of when they do start to use it.

That's inherent to databases and JDBC, not DbUnit. Use a fast database like H2 if you want things to be as fast as possible (if you have a better agnostic way to do things, I'd be glad to learn about it).

6) Probably the most annoying thing is that the first entry must include ALL the values - even null placeholders - or future rows won't pick the columns that you actually specified.

Not when using Unitils as mentioned in presentations like Unitils - Home - JavaPolis 2008 or Unit testing: unitils & dbmaintain.

Anything out there to satisfy me, or should I become the next framework developer of a much better database testing framework?

If you think you can make things better, maybe contribute to existing solutions. If that's not possible and if you think you can create the killer database testing framework, what can I say, do it. But don't forget, ranting is easy, coming up with solutions using your own solutions is less so.

Smallest database that supports indexes, high write volumes, and is ACID?

6 votes

I don't really care if it's NoSQL or SQL based - as long as it uses int indexes (and stores them in RAM for fast searching) so I can find my data with simple queries based on criteria like user_id, lat, status, or other common int fields. The actual records can be stored on disk.

My first guess would be SQLite - but it moves slowly when dealing with a lot of concurrent writes.

Second, it also needs to be able to run in very small amounts of RAM for VPS with limited resources. This excludes MongoDB since it spreads to fill all available RAM (well, the diskcache does really). I also can't use MySQL Innodb since it uses about 100MB of RAM just to load and MyIsam doesn't support ACID.

So are their any RDBMS or NoSQL databases that meet all four requirements?

Update: When I say small databases, I mean databases that only use 8-60MB of RAM. I understand that actual data will increase this but most of my datasets are usually under 1GB for the smaller sites with about 5MB of indexes that would need to be stored in RAM. So an ideal database would use about 30MB when running with a fully index dataset of about 1GB. Take this site for example, I doubt the whole stackoverflow site takes much more than 1GB to store.

Update: To clarify, a setup would ideally store all data on disk. However, it would also keep column indexes in RAM (just ints after all) which would contain the needed pointers to data on the disk. This would avoid two things 1) keeping unneeded rows in memory like redis and 2) keeping indexes on the hard drive slowing searches (SQLite).

An example is MySQL which can be configured to only keep primary and secondary indexed columns in memory and all other data on the hard drive. However, MySQL either uses 100MB extra RAM just to add InnoDB or you forgo ACID compliance and stick with Myisam which is not transaction safe.

Again, the target is systems that are limited in RAM and can't handle more than a couple Megabytes of cached indexes - but that still need to allow frequent writes/updates of normally small data sets in a safe manner.

Update: apparently finding something that meets all these requirements is a bit much. So, starting with the most important features let me list them in descending importance.

  1. Low memory usage
  2. Indexes (or something to mimic them)
  3. Handles concurrent writes
  4. ACID

Expanding on #1, it is more important that data can be written than that reads are fast. Which also means that the amount of RAM should not have any affect on the amount of data that can be stored.

Expanding on #2, ideally (given how small they are) indexes should be stored in RAM since indexes should be nothing more than int values that are compared to filter results before actually accessing the disk for the data.

I think PostgreSQL will work with your requirements:

  1. Low memory usage
    Postgres is very flexible — it has very tweekable limits for used memory — for example this configuration variables will make it use about 30MB of RAM:
    shared_buffers = 16MB # used for caching of indexes and data
    temp_buffers = 2MB # maximum temporary buffers used per open client session
    maintenance_work_mem = 8MB # used temporarily for maintenance
    
    It will also automatically use temporary files if limits will be too low for a complex query.
  2. Indexes (or something to mimic them)
    Postgres supports b-tree and hash indexes. Also indexes on multiple columns or partial or functional indexes. In my test it used 17MB b-tree index for 1000000 32bit integers.
  3. Handles concurrent writes
    Postgres is famous for scaling much better for concurrent reads and writes than for example MySQL. It supports transactions, savepoints (subtransactions), several transaction isolation levels, and also, as a bonus, transactional DDL.
  4. ACID
    PostgreSQL is ACID compliant by default. Supports transactions for atomicity, foreign keys, unique indexes and check contraints for consistency, transaction isolation levels and locking for isolation and write-ahead logs and synchronous commits for durability. It also supports replication for increased durability.

And it's fun, too.

What's the attraction of schemaless database systems?

5 votes

I've been hearing a lot of talk about schema-less (often distributed) database systems like MongoDB, CouchDB, SimpleDB, etc...

While I can understand they might be valuable for some purposes, in most of my applications I'm trying to persist objects that have a specific number of fields of a specific type, and I just automatically think in the relational model. I'm always thinking in terms of rows with unique integer ids, null/not null fields, SQL datatypes, and select queries to find sets.

While I'm attracted to the distributed nature and easy JSON/RESTful interfaces of these new systems, I don't understand how loosely typed key/value hashes will help me with my development. Why would a loose typed, schema-less system be good for keeping clean data sets? How can I for example, find all items with dates between x and y when they might not have dates? Is there any concept of a join?

I understand many systems have their own differences and strengths, but I'm wondering at the difference in paradigm. I suppose this is an open-ended question, but perhaps the community's answers and ways they have personally seen the advantages of these systems will help enlighten me and others about when I would want to make use of these (admittedly more hip) systems instead of the traditional RDBMS.

I'll just call out one or two common reasons (I'm sure people will be writing essay answers)

1) With highly distributed systems, any given data set may be spread across multiple servers. When that happens, the relational constraints which the DB engine can guarantee are greatly reduced. Some of your referential integrity will need to be handled in application code. When doing so, you will quickly discover several pain points:

  • your logic is spread across multiple layers (app and db)
  • your logic is spread across multiple languages (SQL and your app language of choice)

The outcome is that the logic is less encapsulated, less portable, and MUCH more expensive to change. Many devs find themselves writing more logic in app code and less in the database. Taken to the extreme, the database schema becomes irrelevant.

2) schema management - especially on systems where downtime is not an option - is difficult. reducing the schema complexity reduces that difficulty

Edit, adding a few more reasons:

3) ACID doesn't work very well for distributed systems (BASE, CAP, etc). The SQL language (and the entire relational model to a certain extent) is optimized for a transactional ACID world. So some of the SQL language features and best practices are useless while others are actually harmful. Some developers feel uncomfortable about "against the grain" and prefer to drop SQL entirely in favor of a language which was designed from the ground up for their requirements.

4) Cost: most RDBMS systems aren't free. The leaders in scaling (Oracle, Sybase, SQL Server) are all commercial products. When dealing with large ("web scale") systems, database licensing costs can meet or exceed the hardware costs! The costs are high enough to change the normal build/buy considerations drastically towards building a custom solution on top of an OSS offering (all the significant NOSQL offerings are OSS)