Best database-design questions in November 2010

Why use an auto-incrementing primary key when other unique fields exist?

22 votes

I'm taking a course called "database systems" and for our class project I have to design a website.

Here's an example of a table I created:

CREATE TABLE users
(
  uid INT NOT NULL AUTO_INCREMENT,
  username VARCHAR(60),
  passhash VARCHAR(255),
  email VARCHAR(60),
  rdate DATE,
  PRIMARY KEY(uid)
);

The professor told me "uid" (user id) was completely useless and unnecessary and I should have used the username as the primary key, since no two users can have the same username.

I told him it was convenient for me use a user id because when I call something like domain.com/viewuser?id=5 I just check the parameter with: is_numeric($_GET['id'])... needless to say he was not convinced.

Since I've seen user_id and other similar attributes (thread_id, comment_id, among others) on plenty of tutorials and looking at the database schema of popular software (eg. vbulletin) there must be plenty of other (stronger) reasons.

So my question is: How would you justify the need of a not null auto incrementing id as a primary key vs using another attribute like the username?

Auto-incrementing primary keys are useful for several reasons:

  • They allow duplicate user names as on Stack Overflow
  • They allow the user name (or email address, if that's used to login) to be changed (easily)
  • Selects, joins and inserts are faster than varchar primary keys as its much faster to maintain a numeric index
  • As you mentioned, validation becomes very simple: if ((int)$id > 0) { ... }
  • Sanitation of input is trivial: $id = (int)$_GET['id']
  • There is far less overhead as foreign keys don't have to duplicate potentially large string values

I would say trying to use any piece of string information as a unique identifier for a record is a bad idea when an auto-incrementing numeric key is so readily available.

Systems with unique user names are fine for very small numbers of users, but the Internet has rendered them fundamentally broken. When you consider the sheer number of people named "john" that might have to interact with a website, it's ridiculous to require each of them to use a unique display name. It leads to the awful system we see so frequently with random digits and letters decorating a username.

However, even in a system where you enforced unique usernames, it's still a poor choice for a primary key. Imagine a user with 500 posts: The foreign key in the posts table is going to contain the username, duplicated 500 times. The overhead is prohibitive even before you consider that somebody might eventually need to change their username.

Database indexes: A good thing, a bad thing, or a waste of time?

9 votes

Adding indexes is often suggested here as a remedy for performance problems.

(I'm talking about reading & querying ONLY, we all know indexes can make writing slower).

I have tried this remedy many times, over many years, both on DB2 and MSSQL, and the result were invariably disappointing.

My finding has been that no matter how 'obvious' it was that an index would make things better, it turned out that the query optimiser was smarter, and my cleverly-chosen index almost always made things worse.

I should point out that my experiences relate mostly to small tables (<100'000 rows).

Can anyone provide some down-to-earth guidelines on choices for indexing?

The correct answer would be a list of recommendations something like:

  • Never/always index a table with less than/more than NNNN records
  • Never/always consider indexes on multi-field keys
  • Never/always use clustered indexes
  • Never/always use more than NNN indexes on a single table
  • Never/always add an index when [some magic condition I'm dying to learn about]

Ideally, the answer will give some instructive examples.

Indexes are kind of like chemotherapy...too much and it kills you...too little and you die...do it the wrong way and you die. You gotta know just how much, how often, and what kind to make it not kill you.

Your hardware, platform, environment, load all play a role. So to answer your questions..

Yes, possibly sometimes.

When to Denormalize a Database Design

8 votes

I know that normalis(z)ation has been extensively discussed on Stack Overflow. I've read many of the previous discussions. I've got some additional questions though.

I'm working on a legacy system with at least 100 tables. The database is has some un-normalized structure, tables that contain a variety of disparate data, and other problems. I've been given the task of trying to improve it. I can't just start again but need to modify the existing schema.

In the past I have always tried to design normalized databases. Now the questions. A senior developer has suggested that in some cases we can't normalize:

1) With temporal data. For example an invoice is created that links to a product. If a customer asks for a copy of this invoice a year later we must be able to produce an exact copy of the original. What if the product price, name or description have been updated? The senior guy suggested that the price and other product information should be copied to the invoice table. I'm thinking maybe we should have another table such as productPrice that has a date field so we can track changes in price over time. We would need the same thing for the product description and name I guess? Seems complicated. What do you think?

2) The database is an accounting system. I'm not very familiar with accounting. At the moment some summary data is derived and stored in the database. For example total sales for the year. My senior associate has said that accountants like to check things are correct by comparing this value with data that is actually calculated from invoices etc to give them confidence that the application is working correctly. He said that at the moment for example we can tell if someone deleted an invoice from last year mistakenly because the totals will not be the same. He also pointed out that it could be quite slow to calculate these totals on the fly. Of course I said that data should not be duplicated and should always be calculated when needed. I suggested that we could use SQL Reporting Services or some other solution that will generate these reports overnight and cache them. Anyway he's not convinced. Any comments on this?

Thanks very much :)
Cheers
Mark

EDIT

Thanks for the excellent responses. It's a pity I can only flag one as the answer because there's lots of good suggestions here.

Your senior colleague is a developer, not a data modeller. You are better off starting from scratch, without them. Normalisation is complicated only to those who will not read books, and get their 'knowledge' from the amateurs at wiki. It is fair enough that he is makes you think, but some of the issues are absurd.

Your numbers:

  1. You need to appreciate the differences between actual online data, and historic data; then the difference between merely historic and archival needs. All of them are right for the specific business requirement, and wrong for all others, there is no universal right and wrong.

    • why is there no paper-based copy of the invoice ? In most countries that would be a legal and tax requirement, what exactly is the difficulty of fishing out the old invoice ?
    • where the database has the requirement of storing the closed invoices, then sure, as soon as the invoice is closed, you need a method of capturing that information.
    • ProductPrice (actually, I would call it ProductDate) is a good idea, but may not be necessary. But you are right, you need to evaluate the currency of data, in the full context of the whole database.
    • I cannot see how copying the product price to the invoice table would help (are not there many line items ?)
    • in modern databases, where the copy of the invoice is required to be regurgitated, the closed Invoice is additionally stored in a different form, eg XML. One customer saves the PDFs as BLOBs. So there is no messing around with what the product price was five years ago. But the basic invoice data is online and current, even for closed invoices; you just cannot recalculate ancient invoice using current prices.
    • some people use an archive_invoice table, but that has problems because now every code segment or user report tool has to look in two places (note that these days some users understand databases better than most developers)
    • Anyway, that is all discussion, for your understanding. None of the databases I have written in 30 years has ever had that sort of problem, and all of them comply with the legal and tax requirements.
      • The database serves current and archival purposes from the one set of tables (no "archive" tables
      • Once an Invoice is created, it is a legal document, and cannot be changed or deleted (it can be reversed or partially credited by a new Invoice, with negative values). They are marked IsIssued/IsPaid/Etc
      • Products cannot be deleted, they can be marked IsObsolete
      • There are separate tables for InvoiceHeader and InvoiceItem
      • InvoiceItem has FKs to both InvoiceHeader and Product
      • for many reasons (not only those you mention), the InvoiceItem row contains the NumUnits; ProductPrice; TaxAmount; ExtendedPrice. Sure, this looks like a "denormalisation" but it is not, because prices, taxation rates, etc, are subject to change. But more important, the legal requirement is that we can reproduce the old invoice on demand.
      • (where it can be reproduced from paper files, this is not required)
      • the InvoiceTotalAmount is a derived column, just SUM() of the InvoiceItems
        .
  2. That is rubbish. Accounting systems, and accountants do not "work" like that.

    • If it is a true accounting system, then it will have JournalEntries, or "double entry"; that is what a qualified account is required to use (by law).

      • Double Entry does not mean duplicate entry; it means every financial transaction (one amount) shall have a source account and target account that it is applied to; so there is no "denormalisation" or duplication. In a banking database, because the financial transactions are against single accounts, that is commonly rendered as two separate financial transactions (rows) within one Db Transaction. Ordinary commercial database constraints are used to ensure that there are two "sides" to every financial transaction.
        .
    • Ensuring that Invoices are not deleteable is a separate issue, to do with security, etc. if anyone is paranoid about things being deleted from their database, and their database was not secured by a qualified person, then they have more and different problems that have nothing to do with this question. Obtain a security audit, and do whatever they tell you.

    • There are a few people on this site who think wiki is a place that you can learn something. It isn't. It is a cesspool of "definitions" written by amateurs, and the "definitions" are constantly changed by other amateurs. No fixed definition that you can rely on. So don't worry about what wiki says or what people say wiki says, the moment they mention wiki, you know their "knowledge" come from reading not qualification; and what they are reading is an ever-changing cesspit. They will predictably argue about "definitions" because they have no actual experience; the experienced will just get on with the job

    • A Normalised database is always much faster than Unnormalised database. So it is very important to understand what Normalisation and Denormalisaion is, and what it isn't. The process is greatly hindered when people have fluid and amateur "definitions", it just leads to confusion and time-wasting "discussions". When you have fixed definitions, you can avoid all that, and just get on with the job.

    • Summary tables are quite normal, to save the time and processing power, of recalculating info that does not change, eg: YTD totals for every year but this year; MTD totals for every month in this year but not this month. "Always recalculating" data is a bit silly when (a) the info is very large and (b) does not change. Calculate for the current month only

      • In banking systems (millions of Trades per day), at EndOfDay, we calculate and store Daily Total as well. These are overwritten for the last five days, because Audiitors are making changes, and JournalEntries against financial transactions for the last 5 days are allowed.
      • non-banking systems generally do not need daily totals
        .
    • Summary tables are not a "denormalisation" (except in the eyes of those who have just learned about "normalisation" from their magical, ever-changing fluid "source"; or as non-practitioners, who apply simple black-or-white rules to everything). Again, the definition is not being argued here; it simply does not apply to Summary tables.

    • Summary tables do not affect data integrity (assuming of course that the data that they were sourced from was integral).

    • Summary tables are an addition to the database, which are not required to have the same constraints as the database. There are essentially reporting tables or data warehouse tables, as opposed to database tables.

    • There are no Update Anomalies (which is a strict definition) related to Summary tables. You cannot change or delete an invoice from last year. Update Anomalies apply to true Denormalised or Unnormalised current data.

What are the first issues to check while optimizing an existing database?

7 votes

What are the top issues and in which order of importance to look into while optimizing (performance tuning, troubleshooting) an existing (but unknown to you) database?
Which actions/measures in your previous optimizations gave the most effect (with possibly the minimum of work) ?

I'd like to partition this question into following categories (in order of interest to me):

  1. one needs to show the performance boost (improvements) in the shortest time. i.e. most cost-effective methods/actions;
  2. non-intrusive or least-troublesome most effective methods (without changing existing schemas, etc.)
  3. intrusive methods

Update:
Suppose I have a copy of a database on dev machine without access to production environment to observe stats, most used queries, performance counters, etc. in real use.
This is development-related but not DBA-related question.
Update2:
Suppose the database was developed by others and was given to me for optimization (review) before it was delivered to production.
It is quite usual to have outsourced development detached from end-users.

Besides, there is a database design paradigm that a database, in contrast to application data storage, should be a value in itself independently on specific applications that use it or on context of its use.

Update3: Thanks to all answerers! You all pushed me to open subquestion
How do you stress load dev database (server) locally?

If you're not interested in the runtime behavior of the database, e.g. what are the most frequently executed queries and those that consume the most time, you can only do a "static" analysis of the database structure itself. That has a lot less value, really, since you can only check for a number of key indicators of bad design - but you cannot really tell much about the "dynamics" of the system being used.

Things I would check for in a database that I get as a .bak file - without the ability to collect live and actual runtime performance statistics - would be:

  1. normalization - is the table structure normalized to third normal form? (at least most of the time - there might be some exceptions)

  2. do all tables have a primary key? ("if it doesn't have a primary key, it's not a table", after all)

  3. For SQL Server: do all the tables have a good clustering index? A unique, narrow, static, and preferably ever-increasing clustered key - ideally an INT IDENTITY, and most definitely not a large compound index of many fields, no GUID's and no large VARCHAR fields (see Kimberly Tripp's excellent blog posts on the topics for details)

  4. are there any check and default constraints on the database tables?

  5. are all the foreign key fields backed up by a non-clustered index to speed up JOIN queries?

  6. are there any other, obvious "deadly sins" in the database, e.g. overly complicated views, or really badly designed tables etc.

But again: without actual runtime statistics, you're quite limited in what you can do from a "static analysis" point of view. The real optimization can only really happen when you have a workload from a regular day of operation, to see what queries are used frequently and put the most stress on your database --> use Mitch's checklist to check those points.

How to present a database design?

6 votes

I am doing a project in the university and it includes a MySQL database. I have a design for the database in terms of a list of tables and their respective fields.

In what form should I present this design? Just the list of tables and content? In an ERD? How do you present your designs?

To clarify - whatever you answer, I expect not only specification of how you present your design, but also which tools do you use the create the diagrams/list/tables etc.

ERD is the only way to go. As they say, a picture is worth a thousand words.

But don't try to put the whole database on one diagram. It will, in all but the most trivial cases, be overwhelming to your audience to try to digest the entire database design in one go. Instead, break the diagrams into subject areas depicting only the most relevant tables in each diagram. For example, a point-of-sale system might have separate diagrams for Inventory, Sales, Accounting, Customer Management, Security, Auditing, and Reporting. Some tables will show up in more than one subject area -- this is to be expected.

As far as tooling, nothing beats ErWin, but it is really expensive and only available for Windows. Visio is ubiquitous in a corporate environment, but is only available on Windows and is not exactly cheap either. Macs offer some really nice diagramming tools; most of them are not free.

Dia is a decent, free, and cross-platform diagramming tool. It is a bit quirky, though; and I have not had much success making the diagrams look as nice I want them to look.

For MySQL, I have played with fabFORCE dbDesigner and it is not bad, but I did find its support for multiple subject areas to be a bit lacking at the time -- perhaps they've improved it since. But it is free and works on Windows and Linux.

For the actual presentation, I create images from these diagramming tools and pull them into presentation software (PowerPoint, KeyNote, or OpenOffice Impress). These presentations can be exported to PDF and distributed to the audience; they won't need anything more than a PDF viewer to review the information later.

Database Design for Multi-use Table

5 votes

Say you have multiple "things" which can each have one or more comments attached. Product and Order, for instance. How should the tables be structured....

  1. Product, Order, Comment, ProductComment { ProductID, CommentID }, OrderComment { OrderID, CommentID }
  2. Product, Order, ProductComment { ProductID, Text }, OrderComment { OrderID, Text }
  3. Product, Order, Comment { ProductID, OrderID, Text }

Using SQL Server 2008, by the way.

Thoughts, opinions?

Definitely only use one Comment table, so you don't have to duplicate Comment information (e.g. timestamp, flagged_for_moderation, etc). Having two fields in comment is nice because it makes it clear that it's a one-to-many link. I'd probably lean towards that over multiple linking tables, though I do appreciate that you only have rows in the linking table when there's a link, versus having half the values be NULL. Perhaps in a very large database with more things that can be commented, you might go for the linking tables.

Database Design: optional, but must be unique if provided a value

5 votes

I have a column in one of my tables. It's optional, so it can be left blank. However, if a value is provided for that column, it must be unique. Two questions:

  1. How do I implement this in my database design (I'm using MySQL Workbench, by the way)
  2. Is there a potential problem with my model?

Just use a UNIQUE index on the column. See:

http://dev.mysql.com/doc/refman/5.1/en/create-index.html

A UNIQUE index creates a constraint such that all values in the index must be distinct. An error occurs if you try to add a new row with a key value that matches an existing row. For all engines, a UNIQUE index permits multiple NULL values for columns that can contain NULL. If you specify a prefix value for a column in a UNIQUE index, the column values must be unique within the prefix.