Best database questions in March 2012

What pagination schemes can handle rapidly-changing content lists?

8 votes

Pagination is hard when your content rankings can change quickly, and even harder when those rankings differ per-user. (Let's treat infinite scroll as a type of pagination where the links are invisible.) There are two hard problems: newly-added content at the top, and reranked content.

Let's forget about newly-added content, and accept that you'll have to refresh page 1 to see it. Let's also pretend we're doing pure ORDER BY position; if you're ordering by something else, you might have to use window functions. Our page_size is 3; page isn't a real column, but is the calculated page_num based on page_size. Here are our posts:

+----+----------+-----------+------+
| id | position |  animal   | page |
+----+----------+-----------+------+
|  1 |        1 | Alpacas   |    1 |
|  2 |        2 | Bats      |    1 |
|  3 |        3 | Cows      |    1 |
|  4 |        4 | Dogs      |    2 |
|  5 |        5 | Elephants |    2 |
|  6 |        6 | Foxes     |    2 |
+----+----------+-----------+------+

After we fetch page 1, and before we fetch page 2, Cows gets demoted from #3 to #4, while Dogs get promoted from #4 to #3. How do we handle it?

Offset/limit approach

This is the typical naive approach; in Rails, it's how will_paginate and Kaminari work. If I want to fetch page 2, I'll do

SELECT * FROM posts
ORDER BY posts.position
OFFSET ((:page_num - 1) * :page_size) 
LIMIT :page_size;

which gets rows 3-6. I'll never see Dogs, and I'll see Cows twice.

Last seen ID approach

Reddit improves on this. Instead of calculating the first row based on page size, the client tracks the ID of the last item you've seen, like a bookmark. When you hit "next", they start looking from that bookmark onward:

SELECT * from posts
JOIN posts AS last
ON posts.id = last.id
AND last.
WHERE posts.id > :last_seen_id
ORDER BY posts.id;

On page 1 I saw Alpacas, Bats and Cows. Dogs then got promoted above Cows. When I fetch page 2, I fetch page_size rows starting with "the one after Cows", so I see Elephants and Foxes. There are no duplicates, but I still never got to see Dogs.

Server side state

HackerNews solves this with server-side continuations; they store the entire result set for you (or at least several pages in advance?), and the "More" link references that continuation. When I fetch page 2, I ask for "page 2 assuming that page 1 consisted of Alpacas, Bats and Cows". This means my page 2 is guaranteed to consist of dogs, elephants, and foxes; there are no duplicates, there are no missing rows, and all I lose is the knowledge that dogs and cows switched places. The downside is that I have to store a lot of state on the server; on HN, that's stored in RAM, and in reality those continuations often expire before you can press the "More" button, forcing you to go all the way back to page 1 to find a valid link.

Are these the only three possible approaches? If so, is there an intuitive proof that these ARE the only possibilities? If not, are there computer-science concepts that would give me Google juice to read about this? Are there ways to approximate the continuation approach without storing the entire result set? About all that comes to mind, long term, is event-streaming/point-in-time systems, where "the result set as of the moment I fetched page 1" is forever derivable. Short of that... suggestions? FWIW, we're on PostgreSQL at the moment.

We're going with the server-side state approach for now, caching the entire result on the first query so we always return a consistent list. This will work as long as our query already returns all rows; eventually we'll need to use a nearest-neighbor approach and that wont work.

But I think there's a fourth possibility, which scales very well, as long as:

  1. You don't need a guarantee of no duplicates, only a high likelihood
  2. You're okay with missing some content during scrolls, as long as you avoid duplicates

The solution is a variant of the "last seen ID" solution: Have the client keep not one, but 5 or 10 or 20 bookmarks - few enough that you can store them efficiently. The query ends up looking like:

SELECT * FROM posts
WHERE id > :bookmark_1
AND id > :bookmark_2
...
ORDER BY id

As the number of bookmarks grows, the odds rapidly diminish that you are (a) starting at some point past all n bookmarks but (b) seeing duplicate content anyway because they were all reranked.

If there are holes, or better answers in the future, I'll happily unaccept this answer.

When to use R, when to use SQL?

8 votes

I have a moderate sized database with many joins and lookup tables.

I am more familiar with R than with SQL, and I am using MySQL.

My Question:

At what point is it beneficial to stop increasing the complexity of an SQL statement in favor of the data subsetting functionality in R (e.g., merge, *apply, maply, dlply, etc.)in R.

On one hand, SQL's join is easier than selecting all contents of each table and using the R merge function to join them. Also, doing the conditional selects in SQL would reduce the amount of data that has to be imported to R; but the speed difference is not significant.

On the other hand, a big join with a complex where clause becomes less easy to understand than the R syntax.

Below I have some untested code for illustrative purposes: I am asking this question at before having working code, and the answer to my question doesn't require working code (although this is always appreciated) - the "most elegant approach", "fewest lines", or "amazing implementation of X" are always appreciated, but what I am particularly interested in is the "most sensible / practical / canonical / based on first principles" rationale.

I am interested in the general answer of which steps should use a SQL where clause and which steps would be easier to accomplish using R.

Illustration:

Database description

there are three tables: a, ab, and b. Tables a and b each have a primary key id. They have a many-many relationship that is represented by a lookup table, ab, which contains fields ab.a_id and ab.b_id that join to a.id and b.id, respectively. Both tables have a time field, and a has a group field.

Goal:

Here is a minimal example of the join and subsetting that I want to do;

(MySQL naming of elements, e.g. a.id is equivalent to a$id in R)

  1. Join tables a and b using ab, appending multiple values of b.time associated with each a.id as a new column;

    select a_time, b.time, a.id, b.id from 
           a join ab on a.id = ab.a_id 
           join b on b.id = ab.b_id and then append b.time for distinct values of b.id;
    
  2. I don't need repeated values of b.time, I only need a value of b.max: for repeated values of b.time joined to each a.id, b.max is the value of b.time closest to but not greater than a.time

    b.max <- max(b.time[b.time < a.time))
    
  3. append the value dt <- a.time - b.max to the table, for example, in R,
  4. for each distinct value in a.group, select which(min(x.dt)))

    x.dt <- a.time - b.max
    

I usually do the data manipulations in SQL until the data I want is in a single table, and then, I do the rest in R. Only when there is a performance issue do I start to move some of the computations to the database. This is already what you are doing.

Computations involving timestamps often become unreadable in SQL (the "analytic functions", similar to ddply, are supposed to simplify this, but I think they are not available in MySQL).

However, your example can probably be written entirely in SQL as follows (not tested).

-- Join the tables and compute the maximum
CREATE VIEW t1 AS
SELECT a.id    AS a_id, 
       a.group AS a_group,
       b.id    AS b_id,
       a.time  AS a_time, 
       a.time - MAX(b.time) AS dt
FROM   a, b, ab
WHERE  a.id = ab.a_id AND b.id = ab.b_id
AND    b.time < a.time
GROUP  BY a.id, a.group, b.id;

-- Extract the desired rows
CREATE VIEW t2 AS 
SELECT t1.*
FROM t1, (SELECT group, MIN(dt) AS min_dt FROM t1) X
WHERE t1.a_id = X.a_id 
AND   t1.b_id = X.b_id 
AND   t1.a_group = X.a.group;

SQL: Feeding SELECT output to LIKE

7 votes

Problem:

select STR1 from T1 where STR2 = 'NAME1'

In the above querey STR1 can be in form {ABC, ABC_1, ABC_2,..., MNO, XYZ, XYZ_1...}. So let suppose I have following output

ABC_1
MNO
XYZ

Now I want to extract all those matching STR1 that include the part before _#. For example the expected output for the example dataset above is:

ABC
ABC_1
ABC_2

MNO

XYZ
XYZ_1

Note that STR2 is always unqiue per STR1.

Code wise I imagine some thing like following:

SELECT 
    STR1 
FROM 
    T1 
WHERE
    STR1 
LIKE '% (truncate_underscore_part(select STR1 from T1 where STR2 = 'NAME1')) %'

Any idea?


First solution:

select t1.str1
  from (
  select case when instr( str1, '_' ) > 0
                then substr( str1, 1, instr( str1, '_' ) - 1 )
              else str1
         end prefix
    from t1 where str2 = 'NAME1'
) prefix_list,
  t1
  where t1.str1 like prefix || '%'

with prefix_list as (
  select regexp_substr( str1, '^[A-Z]*' ) prefix from t1 where str2 = 'NAME1'
)
select t1.str1 from t1 join prefix_list
        on t1.str1 = prefix_list.prefix
           or regexp_like( t1.str1, prefix_list.prefix||'_[0-9]' )

To do it without the regexp functions (for older Oracle versions), it depends a bit on how much you want to validate the format of the strings.

select t1.str1
  from (
  select case when instr( str1, '_' ) > 0
                then substr( str1, 1, instr( str1, '_' ) - 1 )
              else str1
         end prefix
    from t1 where str2 = 'NAME1'
) prefix_list,
  t1
where t1.str1 = prefix
   or t2.str1 like prefix || '\__' escape '\'

In Oracle SQL update statement, does row update occur concurrently?

6 votes

In Oracle SQL update statement, assuming the update would affect 5 rows, does the update statement updates all 5 rows concurrently or sequentially? E.g.

UPDATE table1 
set column2 = 'completed' WHERE
index between 1 AND 5

In the above statement, would index 1 to 5 be updated in sequence, i.e. 1, 2, 3, 4 then 5, or would it occur concurrently (1-5 all at once).

I had referred to Oracle documentation but it seems that nothing is mentioned on this.

After the UPDATE statement has executed, the effects of the statement will become visible to the rest of the transaction (and if you commit, to other transactions). In what order will Oracle physically do it, is an implementation detail (similarly how the order of SELECT result is not guaranteed unless you specify ORDER BY).


In most cases, this order does not matter to the client. One case where it might is to avoid deadlocks with another transaction that is updating the overlapping set of rows. UPDATE will lock the row being updated until the end of the transaction, so if two transactions try to lock the same rows, but in different order, a deadlock may ensue.

The standard way of avoiding deadlocks is to always lock in a well-defined order. Unfortunately, UPDATE does not have the ORDER BY clause, but you can do this:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
SELECT ... WHERE condition ORDER BY ...  FOR UPDATE;
UPDATE ... WHERE condition;
COMMIT;

Where condition is same for both statements. The serializable isolation level is necessary for WHERE to always see the same set of rows in both statements.

Or, in PL/SQL you could do something like this:

DECLARE
    CURSOR CUR IS SELECT * FROM YOUR_TABLE WHERE condition ORDER BY ... FOR UPDATE;
BEGIN
    FOR LOCKED_ROW IN CUR LOOP
        UPDATE YOUR_TABLE SET ... WHERE CURRENT OF CUR;
    END LOOP;
END;
/

Naming complicated methods

6 votes

I have a method for pulling data from a database, and I want it to get this:

Limit of Five entries, Item type is Newsletter, Needs to be active (PublishDate < DateTime.Now)

So I'm thinking of naming it GetFiveActiveNewslettersByCreatedDate()

This seems a little long to me. I looked on the site for a good way to name things like this, how would you handle it?

To avoid this specific naming I would think about making the method generic. Something like:

GetNewsLetters(int amount, bool onlyActive, SortOrder orderBy)

Retrieving a specific row depending on a date variable?

6 votes

I have 7 columns that which contain the information of closing times, each for one day. (It goes like VENUE_CLOSE_T_MO, VENUE_CLOSE_T_TU... etc)

How would I, for example choose one of those columns depending on a date variable ($somevariable) which contains a specific date?

For example, if the date variable was Sunday, March 18 22:00, it would choose column VENUE_CLOSE_T_SU.

Thanks for the help everyone!

EDIT (Solution given by TEEZ that solved the issue)

My Date variable is $Start.

And this is the code:

$day_name=strtoupper(date('D',$start));
$day_name=substr($day_name,0,2);
$selectcolumn='VENUE_CLOSE_T_'.$day_name;

So in this case $selectcolumn = VENUE_CLOSE_T_SU

And the echo is then this:

$row[$selectcolumn]

Thanks for all your help again Teez!

first get day name from variable ($somevariable)

$day_name=strtoupper(date('D',$somevariable));

then make query like below for getting column according to day in $somevariable

select concat('VENUE_CLOSE_T_',left($day_name,2)) as datecolumnname  from tableame

EDIT:

OR

you don't need to do this in query if you taking all column in query. just add these lines in php code where you printing data in we page under date column

$day_name=strtoupper(date('D',$somevariable));
$day_name=substr($day_name,0,2);
$selectcolumn='venues.VENUE_CLOSE_T_'.$day_name; 
echo $row[$selectcolumn];

What is the best method to make sure two people don't edit the same row on my web app?

6 votes

I have a PHP/jQuery/AJAX/MySQL app built for managing databases. I want to implement the ability to prevent multiple users from editing the same database row at the same time.

  1. What is this called?
  2. Do I use a token system and who ever has the token can edit it until they release the token?
  3. Do I use a "last edit date/time" to compare you loading the HTML form with the time in the database and if the database is the most resent edit then it warns you?
  4. Do I lock the row using database functions?

I'm just not sure which is the best. Assuming between 10 - 15 concurrent users

There are two general approaches-- optimistic and pessimistic locking.

Optimistic locking is generally much easier to implement in a web-based environment because it is fundamentally stateless. It scales much better as well. The downside is that it assumes that your users generally won't be trying to edit the same set of rows at the same time. For most applications, that's a very reasonable assumption but you'd have to verify that your application isn't one of the outliers where users would regularly be stepping on each other's toes. In optimistic locking, you would have some sort of last_modified_timestamp column that you would SELECT when a user fetched the data and then use in the WHERE clause when you go to update the date, i.e.

UPDATE table_name
   SET col1 = <<new value>>,
       col2 = <<new values>>,
       last_modified_timestamp = <<new timestamp>>
 WHERE primary_key = <<key column>>
   AND last_modified_timestamp = <<last modified timestamp you originally queried>>

If that updates 1 row, you know you were successful. Otherwise, if it updates 0 rows, you know that someone else has modified the data in the interim and you can take some action (generally showing the user the new data and asking them if they want to overwrite but you can adopt other conflict resolution approaches).

Pessimistic locking is more challenging to implement particularly in a web-based application particularly when users can close their browser without logging out or where users may start editing some data and go to lunch before hitting Submit. It makes it harder to scale and generally makes the application more difficult to administer. It's really only worth considering if users will regularly try to update the same rows or if updating a row takes a large amount of time for a user so it's worth letting them know up front that someone else has locked the row.

Can BerkeleyDB in perl handle a hash of hashes of hashes (up to n)?

6 votes

I have a script that utilizes a hash, which contains four strings as keys whose values are hashes. These hashes also contain four strings as keys which also have hashes as their values. This pattern continues up to n-1 levels, which is determined at run-time. The nth-level of hashes contain integer (as opposed to the usual hash-reference) values.

I installed the BerkeleyDB module for Perl so I can use disk space instead of RAM to store this hash. I assumed that I could simply tie the hash to a database, and it would work, so I added the following to my code:

my %tags = () ; 
my $file = "db_tags.db" ; 
unlink $file; 


tie %tags, "BerkeleyDB::Hash", 
        -Filename => $file, 
        -Flags => DB_CREATE
     or die "Cannot open $file\n" ;

However, I get the error:

Can't use string ("HASH(0x1a69ad8)") as a HASH ref while "strict refs" in use at getUniqSubTreeBDB.pl line 31, line 1.

To test, I created a new script, with the code (above) that tied to hash to a file. Then I added the following:

my $href = \%tags; 
$tags{'C'} = {} ;

And it ran fine. Then I added:

$tags{'C'}->{'G'} = {} ;

And it would give pretty much the same error. I am thinking that BerkeleyDB cannot handle the type of data structure I am creating. Maybe it was able to handle the first level (C->{}) in my test because it was just a regular key -> scaler?

Anyways, any suggestions or affirmations of my hypothesis would be appreciated.

Use DBM::Deep.

my $db = DBM::Deep->new( "foo.db" );

$db->{mykey} = "myvalue";
$db->{myhash} = {};
$db->{myhash}->{subkey} = "subvalue";

print $db->{myhash}->{subkey} . "\n";

The code I provided yesterday would work fine with this.

sub get_node {
   my $p = \shift;
   $p = \( ($$p)->{$_} ) for @_;
   return $p;
}

my @seqs = qw( CG CA TT CG );

my $tree = DBM::Deep->new("foo.db");
++${ get_node($tree, split //) } for @seqs;

How to find out the tables that take up maximum memory in database?

6 votes

Hi I am new to databases. I am working on huge database and trying to clear up the mess. I want to start by finding the top ten tables that take up highest memory in the whole database. I cannot go by finding memory of each table since there are too many tables. I need the top 10 or 20 tables that take up the maximum space. Any help would be much appreciated. Thank you.

Maybe something like this:

SELECT CONCAT(table_schema, '.', table_name),
       CONCAT(ROUND(table_rows / 1000000, 2), 'M')                                    rows,
       CONCAT(ROUND(data_length / ( 1024 * 1024 * 1024 ), 2), 'G')                    DATA,
       CONCAT(ROUND(index_length / ( 1024 * 1024 * 1024 ), 2), 'G')                   idx,
       CONCAT(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2), 'G') total_size,
       ROUND(index_length / data_length, 2)                                           idxfrac
FROM   information_schema.TABLES
ORDER  BY data_length + index_length DESC
LIMIT  10;

Reference here

Dynamically creating date periods using MySQL

5 votes

I trying to get the Grape count from dates March 1 - 3.

enter image description here

You will notice that on March 2 - there are no grapes inserted..

I'st possible to show a query from dates March 1, 2 and 3 but showing 0 count for March 2 enter image description here

In this image above only shows dates where there are grapes..

Here is mySQL query

SELECT  `fruitDate` ,  `fruitName` , COUNT( * ) 
FROM  `tbl_fruits` 
WHERE  `fruitName` =  "Grapes"
GROUP BY  `fruitDate

UPDATE 2:

Using this query:

SELECT f.fruitDate, f.fruitName, f1.count FROM tbl_fruits f
    LEFT JOIN (SELECT fruitDate, COUNT(*) as count from tbl_fruits d WHERE d.fruitName='Grapes' GROUP BY d.fruitDate) as f1 ON (f.fruitDate = f1.fruitDate) 
    GROUP BY f.fruitDate

I got this result..but its dsplaying diffrent fruit..something wrong with my query?

enter image description here

Remember there is a dynamically (and a bit ugly) solution to creating a date range that does not require creating a table:

select aDate from (
  select @maxDate - interval (a.a+(10*b.a)+(100*c.a)+(1000*d.a)) day aDate from
  (select 0 as a union all select 1 union all select 2 union all select 3
   union all select 4 union all select 5 union all select 6 union all
   select 7 union all select 8 union all select 9) a, /*10 day range*/
  (select 0 as a union all select 1 union all select 2 union all select 3
   union all select 4 union all select 5 union all select 6 union all
   select 7 union all select 8 union all select 9) b, /*100 day range*/
  (select 0 as a union all select 1 union all select 2 union all select 3
   union all select 4 union all select 5 union all select 6 union all
   select 7 union all select 8 union all select 9) c, /*1000 day range*/
  (select 0 as a union all select 1 union all select 2 union all select 3
   union all select 4 union all select 5 union all select 6 union all
   select 7 union all select 8 union all select 9) d, /*10000 day range*/
  (select @minDate := '2001-01-01', @maxDate := '2002-02-02') e
) f
where aDate between @minDate and @maxDate

Depending on the length of the date range you can reduce the amount of dynamically generated results (10000 days means over 27 years of records each representing one day) by removing tables (d, c, b and a) and removing them from the upper formula. Setting the @minDate and @maxDate variables will allow you to specify the dates between you want to filter the results.

Edit:

I see you're still looking for a solution. Try this:

select c.date, f.fruitName, count(f.fruitName = 'Grapes')
from tbl_calendar c
left join tbl_fruits f
on c.date = f.fruitDate and f.fruitName = 'Grapes'
group by c.date, f.fruitName

If you also want to filter the extra dates from the created table, use this query:

select c.date, f.fruitName, count(f.fruitName = 'Grapes')
from tbl_calendar c
left join tbl_fruits f
on c.date = f.fruitDate and f.fruitName = 'Grapes'
group by c.date, f.fruitName
having c.date between
  (select min(fruitDate) from tbl_fruits) and
  (select max(fruitDate) from tbl_fruits)

Breaking out of a loop when a condition occurs, and avoiding the usage of its preset db value

5 votes

Assuming a student take 6 courses in a semester. All those couses have coures units(int), and depending on the score in each course there are points..

 so a score >=70 will have a point of 5

 <70 and >=60 will have a ponit of 4

and so on. For each course unit and point are multipied together, down the column for each column. Now when the score of a course is not found the grade is 'AR'. Now what i want is for the loops to omit the occurence of AR..i.e not adding the course unit of the course having a grade of 'AR'. But when i run my queries above the units still add to the total course units.

Query4 is used to generate some rows of course_unit and Score

  $query4 = mysql_query("SELECT  c.course_unit, m.score
  FROM    maintable AS m
  INNER JOIN students AS s ON
  m.matric_no = s.matric_no
  INNER JOIN courses AS c ON
  m.course_code = c.course_code
  WHERE m.matric_no = '".$matric_no."'
  AND m.level = '".$level."'")
  or die (mysql_error());

Query3 is used for the summation of the course_units

 $query3 = mysql_query("SELECT  SUM(c.
 course_unit) AS 'TOTAL'
 FROM    maintable AS m
 INNER JOIN students AS s ON
 m.matric_no = s.matric_no
 INNER JOIN courses AS c ON
 m.course_code = c.course_code
 WHERE m.matric_no = '".$matric_no."'
 AND m.level = '".$level."'")
 or die (mysql_error());

Grades in Respect to Score

 while ($row8 = mysql_fetch_assoc
 ($query8)) {
            if ($row8['score'] >= 70) {
              $grade = 'A';
            }
            elseif ($row8['score'] >= 60) {
               $grade = 'B';
            }elseif ($row8['score'] >= 50) {
               $grade = 'C';
            }elseif ($row8['score'] >= 45) {
               $grade = 'D';
            }elseif($row8['score'] >= 40) {
               $grade = 'E';
            }elseif($row8['score'] >= 0) &&
            ($row8['score'] < 40){
               $grade = 'F';
            }else{
               $grade = 'AR';
            }   
     }   

Calculation of the Grade Point

      $grade_point = 0;
      while ($row4 = mysql_fetch_assoc($query4)) {
         if ($row4['score'] >= 70) {
            $score = 5;
          }
          elseif ($row4['score'] >= 60) {
             $score = 4;
          }elseif ($row4['score'] >= 50) {
             $score = 3;
          }elseif ($row4['score'] >= 45) {
             $score = 2;
          }elseif($row4['score'] >= 40) {
             $score = 1;
          }elseif($row4['score'] >= 0 AND                       $row4['score'] < 40) {
             $score = 0;
          }else{
             $score = 0;
          } 

          $grade_point += $score * $row4['course_unit'];

      }

I have added

  if ( $grade == 'AR' )
  {
       continue;
  }

But the calculations are still the same. It adds the course_unit value of any course having

$grade == 'AR' .

I'll be most delighted with you answers. Thanks very much.

UPDATE

I have being able to solve the grade piont part by adding

     elseif($row4['score'] >= 0 AND                       $row4['score'] < 40) {
             $score = 0;
          }else{
             $score = 0;
          }

This sets both the occurences of a score between 0 and 39 to zero and also the default score of <0 (i.e AR) to zero. But it still set's the value of the courses having a grade of AR and a score of -1 to the default respective values of the course_unit.

I think this problem is being cause due to the fact that the course_unit are preloaded from the database. Any help?

Courses Table Stucture
=================

course_id
course_code
course_title
course_unit

I'll be most delighted with your answers. Thank you in anticipation.

Is it as simple as adding "AND NOT 'AR'" to your SELECT SUM statement?

Or... if your DB values are coming in as AR, why can't you use PHP is_int() in your loop? That would allow you to still assign 0 for F, and just skip over any non integer values being sent from your DB.

SQL query to add values of two columns containing null values?

5 votes

Given table:

    ID   ONE   TWO
    X1   15    15
    X2   10    -
    X3   -     20

This query:

SELECT (ONE + TWO) FROM (TABLE)

Just returns the sum of X1's values but not the others since at least one column has a null value. How can I still add them even if there is a null? i.e. consider the null as a 0 maybe?

SELECT (COALESCE(ONE, 0) + COALESCE(TWO, 0)) FROM (TABLE) 

COALESCE will return the first non-null value found in the parameters from left to right. So, when the first field is null, it will take the 0.

That way, X2 will result in 10 + 0 = 10

Optimal solution for massive number of requests on one database table

5 votes

We have a system where customers are allocated a product on a first come first served basis.

Our products table contains an incrementing primary key that started at zero which we use to keep track of how many products have been allocated i.e. a user reserves a product and gets allocated 1, next user gets 2 etc.

The problem, is that potentially hundreds of thousands of users will access the system in any given hour. All of whom will be hitting this one table.

Since we need to ensure that each customer is only allocated one product and keep track of how many products have been allocated, we use a row lock for each customer accessing the system to ensure they write to the table before the next customer hits the system - i.e. enforcing the first come first served rule.

We are concerned about the bottleneck that is the processing time of each request coming into SQL Server 2008 Enterprise Edition and the row lock.

We can't use multiple servers as we need to ensure the integrity of the primay key so anything that requires replication isn't going to work.

Does anyone know of any good solutions that are particularly efficient at handling a massive number of requests on one database table?

A bit more info: The table in question essentially contains two fields only - ID and CustomerID. The solution is for a free giveaway of a million products - hence the expectation of high demand and why using the incrementing primary key as a key makes sense for us - once the key hits a million, no more customers can register. Also, the products are all different so allocation of the correct key is important e.g. first 100 customers entered receieve a higher value product than the next 100 etc

Thanks for any help.

First, to remove the issue of key generation, I would generate them all in advance. It's only 1m rows and it means you don't have to worry about managing the key generation process. It also means you don't have to worry about generating too many rows accidentally, because once you have the table filled, you will only do UPDATEs, not INSERTs.

One important question here is, are all 1m items identical or not? If they are, then it doesn't matter what order the keys are in (or even if they have an order), so as customers submit requests, you just 'try' to UPDATE the table something roughly like this:

UPDATE TOP(1) dbo.Giveaway -- you can use OUTPUT to return the key value here
SET CustomerID = @CurrentCustomerID
WHERE CustomerID IS NULL

IF @@ROWCOUNT = 0 -- no free items left
PRINT 'Bad luck'
ELSE
PRINT 'Winner'

If on the other hand the 1m items are different then you need another solution, e.g. item 1 is X, items 2-10 are Y, 11-50 are Z etc. In this case it's important to assign customers to keys in the order the requests are submitted, so you should probably look into a queuing system of some kind, perhaps using Service Broker. Each customer adds a request to the queue, then a stored procedure processes them one at a time and assigns them the MAX free key, then returns the details of what they won.