Best concurrency questions in August 2010

Only inserting a row if it's not already there

16 votes

I had always used something similar to the following to achieve it:

INSERT INTO TheTable
SELECT
    @primaryKey,
    @value1,
    @value2
WHERE
    NOT EXISTS
    (SELECT
        NULL
    FROM
        TheTable
    WHERE
        PrimaryKey = @primaryKey)

...but once under load, a primary key violation occurred. This is the only statement which inserts into this table at all. So does this mean that the above statement is not atomic?

The problem is that this is almost impossible to recreate at will.

Perhaps I could change it to the something like the following:

INSERT INTO TheTable
WITH
    (HOLDLOCK,
    UPDLOCK,
    ROWLOCK)
SELECT
    @primaryKey,
    @value1,
    @value2
WHERE
    NOT EXISTS
    (SELECT
        NULL
    FROM
        TheTable
    WITH
        (HOLDLOCK,
        UPDLOCK,
        ROWLOCK)
    WHERE
        PrimaryKey = @primaryKey)

Although, maybe I'm using the wrong locks or using too much locking or something.

I have seen other questions on stackoverflow.com where answers are suggesting a "IF (SELECT COUNT(*) ... INSERT" etc., but I was always under the (perhaps incorrect) assumption that a single SQL statement would be atomic.

Does anyone have any ideas?

Thanks.

Adam

What about the "JFDI" pattern?

BEGIN TRY
   INSERT etc
END TRY
BEGIN CATCH
    IF ERROR_NUMBER() <> 2627
      RAISERROR etc
END CATCH

Seriously, this is quickest and the most concurrent without locks, especially at high volumes. What if the UPDLOCK is escalated and the whole table is locked?

Read lesson 4

How bad is using SELECT MAX(id) in MYSQL instead of mysql_insert_id() in PHP?

8 votes

Background: I'm working on a system where the developers seem to be using a function which executes a MYSQL query like "SELECT MAX(id) AS id FROM TABLE" whenever they need to get the id of the LAST inserted row (the table having an auto_increment column).

I know this is a horrible practice (because concurrent requests will mess the records), and I'm trying to communicate that to the non-tech / management team, to which their response is...

"Oh okay, we'll only face this problem when we have 
(a) a lot of users, or 
(b) it'll only happen when two people try doing something
    at _exactly_ the same time"

I don't disagree with either point, and think we'll run into this problem much sooner than we plan. However, I'm trying to calculate (or figure a mechanism) to calculate how many users should be using the system before we start seeing messed up links.

Any mathematical insights into that? Again, I KNOW its a horrible practice, I just want to understand the variables in this situation...


Update: Thanks for the comments folks - we're moving in the right direction and getting the code fixed!

The point is not if potential bad situations are likely. The point is if they are possible. As long as there's a non-trivial probability of the issue occurring, if it's known it should be avoided.

It's not like we're talking about changing a one line function call into a 5000 line monster to deal with a remotely possible edge case. We're talking about actually shortening the call to a more readable, and more correct usage.

I kind of agree with @Mark Baker that there is some performance consideration, but since id is a primary key, the MAX query will be very quick. Sure, the LAST_INSERT_ID() will be faster (since it's just reading from a session variable), but only by a trivial amount.

And you don't need a lot of users for this to occur. All you need is a lot of concurrent requests (not even that many). If the time between the start of the insert and the start of the select is 50 milliseconds (assuming a transaction safe DB engine), then you only need 20 requests per second to start hitting an issue with this consistently. The point is that the window for error is non-trivial. If you say 20 requests per second (which in reality is not a lot), and assuming that the average person visits one page per minute, you're only talking 1200 users. And that's for it to happen regularly. It could happen once with only 2 users.

And right from the MySQL documentation on the subject:

You can generate sequences without calling LAST_INSERT_ID(), but the utility of 
using the function this way is that the ID value is maintained in the server as 
the last automatically generated value. It is multi-user safe because multiple 
clients can issue the UPDATE statement and get their own sequence value with the
SELECT statement (or mysql_insert_id()), without affecting or being affected by 
other clients that generate their own sequence values.