Best sql questions in June 2011

Integrating openID and oauth as website login, signin and authentication system

14 votes

First of all let me start by saying that this question is not about different openID and oAuth implementations. There are many classes about these.

My question is what to do after authenticating a user:

  • How to add this user to the user table in the database?
  • How to handle different logins for the same user? (Remy Sharp's example suggests something for openID)
  • How to combine oAuth and openID in the database?

Any ideas?

Your question has to main parts to it:

  1. Authentication
  2. Authorization

Usually the two are not treated differently if the identity provider (IP) is your own, which has been the most common setup in web apps until now.

When using an OpenId Provider such as Google, the authentication part is seperated from your control. You will get a token back telling you if the user is authenticated or not. The token will normally contain the following claims: Name, Email and Named Identity where the last is the unique id of the identity at the IP.

So far so good.

The trick is now as you ask, how do I authorize this user?

well, there are a couple of approaches to this.

First off, when you create a local user in your system, you can prepopulate the Name and Email values based off the claims you get from the IP. In this process, you can start and say that all users that have a profile stored in your system are authorized, or you can develop further processes that will add whatever details you need to know about the user.

Then, how do you avoid that the user is not re-registered if they switch from google to facebook as the IP?

This is where things get tricky. The most common claim that Google, Yahoo, Facebook will provide to you is the email address and Name. So what you can do, is try to match the incomming claim with existing customers in your app. This is not failsafe however, as people can have different emails in different systems.

The name value is also not safe.

In our setup, we start by matching emails, as we know that most IPs validate email addresses. This will reduce duplicates a lot. After that check, we start our own validation process where the goal is to see if the person is already registered. This process looks for the customers mobile number in our database, and if a match is found, we send a one-time-password to the customer to verify correct ownership of the phone number.

Since login is a time sensitive setup, we are created a simple SQL table that maps external identities to our customer numbers. This allows us to implement this kind of validation logic outside all our web apps (and thereby reduce code redundancy)

Is MySQL naturally slow at this kind of query, or do I have it misconfigured?

12 votes

The following query is intended to receive a list of unread messages by user. It involves 3 tables: recipients contains a relation of users to message IDs, messages contains the messages themselves, and message_readers contains a list of which users have read which messages.

The query reliably takes 4.9 seconds - this is seriously hurting our performance, and is especially worrisome since we hope the database will eventually be several orders of magnitude larger. Granted, it's an inherently heavy query, but the data set is tiny, and intuitively it seems that it should be much faster. The server has enough memory (32gb) that the entire database should be loaded in RAM at all times, and there's nothing else running on the box.

The tables are all tiny:

recipients: 23581
messages: 9679
message_readers: 2685

The query itself:

SELECT 
    m.*
FROM 
    messages m
INNER JOIN recipients r ON r.message_id = m.id
LEFT JOIN message_readers mr ON mr.message_id = m.id
WHERE
    r.id = $user_id
    AND (mr.read_by_id IS NULL OR mr.read_by_id <> $user_id)

The explain plan is pretty straightforward:

+----+-------------+-------+--------+-----------------------------------+-----------------------------------+---------+--------------------------------+-------+-------------+
| id | select_type | table | type   | possible_keys                     | key                               | key_len | ref                            | rows  | Extra       |
+----+-------------+-------+--------+-----------------------------------+-----------------------------------+---------+--------------------------------+-------+-------------+
|  1 | SIMPLE      | r     | ref    | index_recipients_on_id            | index_recipients_on_id            | 768     | const                          | 11908 | Using where |
|  1 | SIMPLE      | m     | eq_ref | PRIMARY                           | PRIMARY                           | 4       | db.r.message_id                |     1 | Using index |
|  1 | SIMPLE      | mr    | ALL    | NULL                              | NULL                              | NULL    | NULL                           |  2498 | Using where |
+----+-------------+-------+--------+-----------------------------------+-----------------------------------+---------+--------------------------------+-------+-------------+

There IS an index on message_readers.read_by_id, but I guess it can't really use it because of the IS NULL condition.

I'm using all default settings except for the following:

key_buffer=4G
query_cache_limit = 256M
query_cache_size = 1G
innodb_buffer_pool_size=12G

Thanks!

Assuming that message_readers is a subset of recipients, I recommend making the following changes:

  1. Get rid of the message_readers table and replace it with a flag on the recipients table. This will eliminiate the null check and remove a join.

  2. It probably already is, but make sure your clustered index for recipients is id, message_id rather than message_id, id, since nearly all searches for messages will be based on the recipients.

Here is the SELECT that results:

SELECT
    r.whatever,
    m.whatever,
    -- ...
FROM
    recipients r
    INNER JOIN messages m ON m.id = r.message_id
WHERE
    r.id = $user_id
    AND r.read_flag = 'N'

UPDATE

Here is the correct version of your query using the existing scheme:

SELECT
    r.whatever,
    m.whatever,
    -- ...
FROM
    recipients r
    INNER JOIN messages m ON r.message_id = m.id
    LEFT JOIN message_readers mr ON mr.read_by_id = r.id 
                                 AND mr.message_id = m.id
WHERE
    r.id = $user_id
    AND mr.read_by_id IS NULL

This assumes that your clustered indexes are what would be expected:

recipients: id, message_id
messages: id
message_readers: read_by_id, message_id

Complex time-series statistical aggregation involving polymorphic associations

11 votes

Ok. Bear with me, as I need to provide a lot of contextual detail before I can solicit a reasonable answer to my question.

I have a site that allows you to make daily stock picks. The way it works is that you're prompted to make picks between companies that are facing-off for the day. For example, GE vs. IBM. You can make two types of picks: Performance (which stock will perform better?) and Total Volume (will the combined stocks trade at volumes higher or lower than X?). You're given 100 virtual dollars each day to make picks.

Ultimately, our goal here is to track which user makes the most money per pick in various categories (explained below) over the following periods of time: 5 days, 15 days, 30 days, 90 days, 180 days, 1 year, all-time. It's very simple to calculate how much money is made per pick. It's the total money made (or lost) / number of picks.

Now, each company that the user makes a pick on falls under a categorical hierarchy. Generically, the categorical hierarchy looks like this:

Division --> Major Group --> Industry Group --> Classification --> Company

Here are some examples:

  • Mining --> Metal Mining --> Iron Ores --> Brown Ore Mining --> Company A
  • Mining --> Metal Mining --> Iron Ores --> Brown Ore Mining --> Company B
  • Mining --> Metal Mining --> Iron Ores --> Limonite Mining --> Company C
  • Mining --> Metal Mining --> Iron Ores --> Limonite Mining --> Company D
  • Manufacturing --> Tobacco Products --> Cigars --> Stogies --> Company E
  • Manufacturing --> Tobacco Products --> Cigars --> Stogies --> Company F
  • Manufacturing --> Tobacco Products --> Cigars --> Cigarillos --> Company G
  • Manufacturing --> Tobacco Products --> Cigars --> Cigarillos --> Company H
  • …and so on…

There's a model for each category (and corresponding table, of course), and they are associated (think foreign_key) just like you see above.

There is a model for Matchup, with each record representing which companies are facing-off for the day. Each record keeps track of the starting and final stock prices for each company, as well as the the total trade volume.

Each Matchup has one or more :pick_prices that can change throughout the day. Normally, each matchup has a Performance Pick Price and a Total Volume Pick Price. The price determines what the pick will cost you and how much you earn for a correct pick. (Now, this is all just background info. You don't need to worry about those particular price calculations.)

At the end of the trading day, the user's picks are resolved. Picks are represented in a Pick model, with the following attributes:

  • user_id
  • amount_spent (e.g., $10)
  • result (e.g., WON, LOST)
  • pick (e.g., company A)
  • matchup_id
  • pick_price_id
  • amount_won
  • resolved (true or false)
  • created_at
  • updated_at

Currently, when each pick is resolved, another table is updated called pick_records, which has the following attributes:

  • user_id
  • recordable_id
  • recordable_type (Division or Major Group or Industry Group or Classification or Company)
  • picks (total picks made, regardless of pick type)
  • won (total picks won, regardless of pick type)
  • lost (total picks lost, regardless of pick type)
  • money (total money won)
  • money_per_pick (money / picks)
  • performance_picks
  • performance_won
  • performance_lost
  • performance_money
  • performance_money_per_pick
  • volume_picks
  • volume_won
  • volume_lost
  • volume_money
  • volume_money_per_pick
  • created_at
  • updated_at

As you can tell, this is a polymorphic model. The table aggregates the all-time pick record statistics.

So now here's the challenge:

Given the existing design, what do I have to do so that I can capture the user's pick records over the following periods of time: 5 days, 15 days, 30 days, 90 days, 180 days, 1 year, all-time? It needs to be simple, efficient, and fast!

I'm currently running Rails 2.3.11 on MySQL DB.

I don't see the need for table pick_records.
You can do a query like this for any number of days:

SELECT 
   user_id
   ,sum(amount_spent) 
   ,sum(IF(result = 'WON',1,0)) as WON_count
   ,sum(IF(result = 'LOST',1,0)) as LOST_count
   ,pick 
   /*matchup_id*/
   ,sum(pc.price) as price
   ,sum(IF(result = 'WON'),amount_won,0)) as amount_won
   ,sum(IF(result = 'LOST'),amount_won,0)) as amount_lost
   ,sum(IF(result = 'WON'),amount_won,-amount_won)) as nett_amount
FROM picks
INNER JOIN pick_price pc ON (pc.id = user.pick_price_id)
WHERE created_at BETWEEN DATE_SUB(NOW(), INTERVAL 5 DAY) AND NOW()
  AND resolved = 'true'
GROUP BY user_id, pick

Why the Left Outer join?

10 votes

weird one. (Probably not weird, at all)

I have 3 objects, Employee, Rota and Department.

public class Employee
{
    public int Id { get; set; }
    public String Name { get; set; }
    public virtual Department Department { get; set; }
}

internal class EmployeeMapping : EntityTypeConfiguration<Employee>
{
    public EmployeeMapping()
    {
        HasKey(a => a.Id);
        Property(a => a.Id).HasColumnName("UserId");

        HasRequired<Department>(a => a.Department).WithOptional().Map(a => a.MapKey("DepartmentId"));
    }
}

public class Department
{
    public int Id { get; set; }
    public String Name { get; set; }
}

internal class DepartmentMapping : EntityTypeConfiguration<Department>
{
    public DepartmentMapping()
    {
        HasKey(a => a.Id);
        Property(a => a.Id).HasColumnName("DepartmentId");
    }
}

public class Rota
{
    public int Id { get; set; }
    public virtual Employee Employee { get; set; }
    public virtual Department Department { get; set; }
}

internal class RotaMapping : EntityTypeConfiguration<Rota>
{
    public RotaMapping()
    {
        HasKey(a => a.Id);
        Property(a => a.Id).HasColumnName("RotaId");

        HasOptional<Employee>(a => a.Employee).WithOptionalDependent().Map(a => a.MapKey("EmployeeId"));
        HasOptional<Department>(a => a.Department).WithOptionalDependent().Map(a => a.MapKey("DepartmentId"));
    }
}

Not complicated, at all really. Rota can have an Employee and/or a Department assigned to it, all of this is configured using Fluent. All of my associations are correct (the schema is perfect), however I have a weird oddity.

When I do a myContext.Departments.FirstOrDefault() and have a look at the SQL Generated, there is a LEFT OUTER JOIN on Employee & Rota. Why is this there? I don't want it to do this. Maybe my Fluent mappings are incorrect? I've tried all sorts, but can't seem to figure it out. I would understand it if i want a Rota object, that would join on the Department. But not the other way around!

If i do myContext.Departments.AsNoTracking().FirstOrDefault() it doesn't do the LEFT OUTER JOIN's.

Any ideas guys?

Cheers, D

The reason is incorrect mapping. It looks correct but it is not. Use these instead:

internal class EmployeeMapping : EntityTypeConfiguration<Employee>
{
    public EmployeeMapping()
    {
        HasKey(a => a.Id);
        Property(a => a.Id).HasColumnName("UserId");

        HasRequired<Department>(a => a.Department).WithMany()
                                                  .Map(a => a.MapKey("DepartmentId"));
    }
}

internal class RotaMapping : EntityTypeConfiguration<Rota>
{
    public RotaMapping()
    {
        HasKey(a => a.Id);
        Property(a => a.Id).HasColumnName("RotaId");

        HasOptional<Employee>(a => a.Employee).WithMany()
                                              .Map(a => a.MapKey("EmployeeId"));
        HasOptional<Department>(a => a.Department).WithMany()
                                                  .Map(a => a.MapKey("DepartmentId"));
    }
}

Your mapping is correctly interpreted when creating database and database looks correct but EF thinks that you map all relations as one-to-one. That confuse EF and it will generate queries used for one-to-one to create internal entity references. These left joins are necessary for one-to-one relation when you tell EF that dependent entities are optional - EF doesn't know if they exist unless it loads their keys.

How to delete a large record from SQL Server?

10 votes

In a database for a forum I mistakenly set the body to nvarchar(MAX). Well, someone posted the Encyclopedia Britanica, of course. So now there is a forum topic that won't load because of this one post. I have identified the post and ran a delete query on it but for some reason the query just sits and spins. I have let it go for a couple hours and it just sits there. Eventually it will time out.

I have tried editing the body of the post as well but that also sits and hangs. When I sit and let my query run the entire database hangs so I shut down the site in the mean time to prevent further requests while it does it's thinking. If I cancel my query then the site resumes as normal and all queries for records that don't involve the one in question work fantastically.

Has anyone else had this issue? Is there an easy way to smash this evil record to bits?

Update: Sorry, the version of SQL Server is 2008.

Here is the query I am running to delete the record:

DELETE FROM [u413].[replies] WHERE replyID=13461

I have also tried deleting the topic itself which has a relationship to replies and deletes on topics cascade to the related replies. This hangs as well.

Option 1. Depends on how big the table itself and how big are the rows.

  1. Copy data to a new table:

    SELECT *
    INTO tempTable
    FROM replies WITH (NOLOCK)
    WHERE replyID != 13461
    

    Although it will take time, table should not be locked during the copy process

  2. Drop old table

    DROP TABLE replies
    

    Before you drop:
    - script current indexes and triggers so you are able to recreate them later
    - script and drop all the foreign keys to the table

  3. Rename the new table

    sp_rename 'tempTable', 'replies'
    
  4. Recreate all the foreign keys, indexes and triggers.

Option 2. Partitioning.

  1. Add a new bit column, called let's say 'Partition', set to 0 for all rows except the bad one. Set it to 1 for bad one.

  2. Create partitioning function so there would be two partitions 0 and 1.

  3. Create a temp table with the same structure as the original table.

  4. Switch partition 1 from original table to the new temp table.

  5. Drop temp table.

  6. Remove partitioning from the source table and remove new column.

Partitioning topic is not simple. There are some examples in the internet, e.g. Partition switching in SQL Server 2005

Large SQL transaction: runs out of memory on PostgreSQL, yet works on SQL Server

9 votes

I have decided to move my C# daemon application (using dotConnect as ADO.NET provider) from SQL Server 2008 R2 to PostgreSQL 9.0.4 x64 (on Windows Server 2008 R2). Therefore I slightly modified all queries to match PostgreSQL syntax and... got stuck on behavior which never happened with the same queries on SQL Server (not even on lowly Express edition).

Let's say the database contains 2 very simple tables without any relation to each other. They look somewhat like this: ID, Name, Model, ScanDate, Notes. I have a transformation process which reads data over TCP/IP, processes it, starts a transaction and puts the results into aforementioned 2 tables using vanilla INSERTs. The tables are initially empty; no BLOB columns. There are about 500.000 INSERTs on a bad day, all wrapped in a single transaction (and cannot be split into multiple transactions, btw). No SELECTs, UPDATEs or DELETEs are ever made. An example of INSERT (ID is bigserial - autoincremented automatically):

INSERT INTO logs."Incoming" ("Name", "Model", "ScanDate", "Notes")
VALUES('Ford', 'Focus', '2011-06-01 14:12:32', NULL)

SQL Server calmly accepts the load while maintaining a reasonable Working Set of ~200 MB. PostgreSQL, however, takes up additional 30 MB each second the transaction runs (!) and quickly exhausts system RAM.

I've done my RTFM and tried fiddling with postgresql.conf: setting "work_mem" to a minimum 64 kB (this slightly slowed down the RAM hogging), reducing "shared_buffers" / "temp_buffers" to minimum (no difference), - but to no avail. Reducing transaction isolation level to Read Uncommitted didn't help. There are no indexes except the one on ID BIGSERIAL (PK). SqlCommand.Prepare() makes no difference. No concurrent connections ever are established: daemon uses the database exclusively.

It may seem PostgreSQL cannot cope with mind-numbingly simple INSERT-fest, while SQL Server can do that. Maybe it's a PostgreSQL snapshot-vs-SQL Server locks isolation difference? It's a fact for me: vanilla SQL Server works, while neither vanilla nor tweaked PostgreSQL does.

What can I do to make PostgreSQL memory consumption to remain flat (as is apparently the case with SQL Server) while INSERT-based transaction runs?

EDIT: I have created an artificial testcase:

DDL:

CREATE TABLE sometable
(
  "ID" bigserial NOT NULL,
  "Name" character varying(255) NOT NULL,
  "Model" character varying(255) NOT NULL,
  "ScanDate" date NOT NULL,
  CONSTRAINT "PK" PRIMARY KEY ("ID")
)
WITH (
  OIDS=FALSE
);

C# (requires Devart.Data.dll & Devart.Data.PostgreSql.dll)

PgSqlConnection conn = new PgSqlConnection("Host=localhost; Port=5432; Database=testdb; UserId=postgres; Password=###########");
conn.Open();
PgSqlTransaction tx = conn.BeginTransaction(IsolationLevel.ReadCommitted);

for (int ii = 0; ii < 300000; ii++)
{
    PgSqlCommand cmd = conn.CreateCommand();
    cmd.Transaction = tx;
    cmd.CommandType = CommandType.Text;
    cmd.CommandText = "INSERT INTO public.\"sometable\" (\"Name\", \"Model\", \"ScanDate\") VALUES(@name, @model, @scanDate) RETURNING \"ID\"";
    PgSqlParameter parm = cmd.CreateParameter();
    parm.ParameterName = "@name";
    parm.Value = "SomeName";
    cmd.Parameters.Add(parm);

    parm = cmd.CreateParameter();
    parm.ParameterName = "@model";
    parm.Value = "SomeModel";
    cmd.Parameters.Add(parm);

    parm = cmd.CreateParameter();
    parm.ParameterName = "@scanDate";
    parm.PgSqlType = PgSqlType.Date;
    parm.Value = new DateTime(2011, 6, 1, 14, 12, 13);
    cmd.Parameters.Add(parm);

    cmd.Prepare();

    long newID = (long)cmd.ExecuteScalar();
}

tx.Commit();

This recreates the memory hogging. HOWEVER: if the 'cmd' variable is created and .Prepare()d outside the FOR loop, the memory does not increase! Apparently, preparing multiple PgSqlCommands with IDENTICAL SQL but different parameter values does not result in a single query plan inside PostgreSQL, like it does in SQL Server.

The problem remains: if one uses Fowler's Active Record dp to insert multiple new objects, prepared PgSqlCommand instance sharing is not elegant.

Is there a way/option to facilitate query plan reuse with multiple queries having identical structure yet different argument values?

UPDATE

I've decided to look at the simplest possible case - where a SQL batch is run directly on DBMS, without ADO.NET (suggested by Jordani). Surprisingly, PostgreSQL does not compare incoming SQL queries and does not reuse internal compiled plans - even when incoming query has the same identical arguments! For instance, the following batch:

PostgreSQL (via pgAdmin -> Execute query) -- hogs memory

BEGIN TRANSACTION;

INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times

COMMIT;

SQL Server (via Management Studio -> Execute) -- keeps memory usage flat

BEGIN TRANSACTION;

INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO [dbo].sometable ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- the same INSERT is repeated 100.000 times

COMMIT;

and the PostgreSQL log file (thanks, Sayap!) contains:

2011-06-05 16:06:29 EEST LOG:  duration: 0.000 ms  statement: set client_encoding to 'UNICODE'
2011-06-05 16:06:43 EEST LOG:  duration: 15039.000 ms  statement: BEGIN TRANSACTION;

INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES('somename', 'somemodel', '2011-06-01 14:12:19');
-- 99998 lines of the same as above
COMMIT;

Apparently, even after transmitting the whole query to the server as-is, the server cannot optimize it.

ADO.NET driver alternative

As Jordani suggested, I've tried NpgSql driver instead of dotConnect - with the same (lack of) results. However, Npgsql source for .Prepare() method contains such enlightening lines:

planName = m_Connector.NextPlanName();
String portalName = m_Connector.NextPortalName();
parse = new NpgsqlParse(planName, GetParseCommandText(), new Int32[] { });
m_Connector.Parse(parse);

The new content in the log file:

2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  statement: BEGIN; SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
2011-06-05 15:25:26 EEST LOG:  duration: 1.000 ms  parse npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  bind npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 1.000 ms  execute npgsqlplan1: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  parse npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  bind npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  execute npgsqlplan2: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"
2011-06-05 15:25:26 EEST DETAIL:  parameters: $1 = 'SomeName', $2 = 'SomeModel', $3 = '2011-06-01'
2011-06-05 15:25:26 EEST LOG:  duration: 0.000 ms  parse npgsqlplan3: INSERT INTO public."sometable" ("Name", "Model", "ScanDate") VALUES($1::varchar(255), $2::varchar(255), $3::date) RETURNING "ID"

Inefficiency is quite obvious in this log excerpt...

Conclusions (such as they are)

Frank's note about WAL is another awakening: something else to configure that SQL Server hides away from a typical MS developer.

NHibernate (even in its simplest usage) reuses prepared SqlCommands properly...if only it was used from the start...

it is obvious that an architectural difference exists between SQL Server and PostgreSQL, and the code specifically built for SQL Server (and thus blissfully unaware of the 'unable-to-reuse-identical-sql' possibility) will not work efficiently on PostgreSQL without major refactoring. And refactoring 130+ legacy ActiveRecord classes to reuse prepared SqlCommand objects in a messy multithreaded middleware is not a 'just-replace-dbo-with-public'-type affair.

Unfortunately for my overtime, Eevar's answer is correct :)

Thanks to everyone who pitched in!

I suspect you figured it out yourself. You're probably creating 500k different prepared statements, query plans and all. Actually, it's worse than that; prepared statements live outside of transaction boundaries and persist until the connection is closed. Abusing them like this will drain plenty of memory.

If you want to execute a query several times but avoid the planning overhead for each execution, create a single prepared statement and reuse that with new parameters.

If your queries are unique and ad-hoc, just use postgres' normal support for bind variables; no need for the extra overhead from prepared statements.

Representing time periods in the UI and database

9 votes

I've recently adopted a project with an Employee model that needs to contain the person's available hours as an attribute.

The existing form uses 168 checkboxes to represent each hour in the week, and stores the information as seven 24 bit binary strings in the database, each bit acting as a boolean true or false for its corresponding hour in that day.

I'd really like to transition to something a little more elegant and manageable, but I haven't been able to come up with any simple solutions that match the existing implementation's flexibility.

Storing time periods as start and end times can be just as tedious to input when there can be multiple per day, and would likely make querying for availability at a particular time more complicated.

Is there a best practice for dealing with this type of information, both in the user interface and the database structure?

I would model the data in the database this way.

Employee/Day/Hour Relationship

There's a many to many relationship between the employees and hours for each day of the week.

On the UI side, you could use checkboxes for the days and multiselect list boxes to set the hours for the given day.

SQL Server aggregate performance

9 votes

I am wondering whether SQL Server knows to 'cache' if you like aggregates while in a query, if they are used again.

For example,

Select Sum(Field),
       Sum(Field) / 12
From   Table

Would SQL Server know that it has already calculated the Sum function on the first field and then just divide it by 12 for the second? Or would it run the Sum function again then divide it by 12?

Thanks

It calculates once

Select
   Sum(Price),
   Sum(Price) / 12
From
   MyTable

The plan gives:

|--Compute Scalar(DEFINE:([Expr1004]=[Expr1003]/(12.)))
  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
     |--Stream Aggregate(DEFINE:([Expr1010]=Count(*), [Expr1011]=SUM([myDB].[dbo].[MyTable].[Price])))
        |--Index Scan(OBJECT:([myDB].[dbo].[MyTable].[IX_SomeThing]))

This table has 1.35 million rows

SQL Select 'n' records without a Table

8 votes

Is there a way of selecting a specific number of rows without creating a table. e.g. if i use the following:

SELECT 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

It will give me 10 across, I want 10 New Rows.

Thanks

You can use a recursive CTE to generate an arbitrary sequence of numbers in T-SQL like so:

DECLARE @start INT = 1;
DECLARE @end INT = 10;

WITH numbers AS (
    SELECT @start AS number
    UNION ALL
    SELECT number + 1 
    FROM  numbers
    WHERE number < @end
)
SELECT *
FROM numbers
OPTION (MAXRECURSION 0);

Should I open() and close() my SQL database constantly or leave it open?

7 votes

I am creating an app that uses an SQL database for storing data. The way the app is designed it will be updated every 3 minutes or so with new data depending on user actions while the app is running.

In the tutorials I have seen, they recommend that you close the database after changing it (it is an "expensive" in terms of resources).

Is it better to leave it open for the duration of my App since it is being updated on a fairly frequent basis or should I run the close() method immediately after every change?

I guess my worry is that opening it and closing it constantly will draw more resources than it would to leave it open the entire time.

I tried to leave a connection open once -- I used it to stuff a repeater or something -- can't remember now.

Later in the program, I had another need to use the connection -- I think I had it so that when a user clicked a parent item in the repeater, a detail div would pop-up with more information for that item. This generated an error -- something to the effect of 'cannot open() on an open connection'.

I think the error might have been avoidable another way (like checking to see if the connection I was trying to open was already open), but as I thought about it, I realized I'd have to make that a standard practice throughout my app, and that seemed like too much work, so I just made it a standard practice to always close my connections after each use.

Connections stay in a connection pool -- I'm no whiz on that -- but if was curious about performance, I guess I'd keep that in mind, in terms of what it costs to open a connection multiple times -- whatever your situation is requiring anyway.

Another thought is that your DB admin may be able to force-close all open connections, or the db may close for some other reason. If you're not the dba, you might give a thought to risk/benefit of depending on something that you don't have long-term control over like keeping the connection open.

Java API for SQL Data Definition Language

7 votes

Before I write one, is there a Java API for manipulating a database. Like an object orientated wrapper around java.sql.DatabaseMetaData, with support for things like Schema.createTable(name, columns)?

Obviously correct SQL statements should be executed in the background based on the DB in use.

I'm specifically interested in an API for performing DDL statements.

As far as I can remember some tools such as NetBeans can create/ modify database schemas on-the-fly. You could have a look at the source code if you don't find a specific library.

SQL Server 2008 Query Editor changes the query logic

7 votes

I hand coded a simple SQL in SQL Server 2008 as below;

SELECT * FROM Tab1 WHERE A='1' AND (B='1' OR C='1');

Being lazy I opened this query in the Query Editor to validate the syntax and pressed OK on the dialog without making any changes.

I noticed that the Query Editor had changed my query to:

SELECT * FROM Tab1 WHERE A='1' AND (B='1') OR (C='1');

clearly this changes the logic of the SQL and returns different results depending on which one you execute.

I routinely use the Query Editor to validate my syntax on complex queries. So a little worried that the a subtle change like this would go unotice, but would change the outcome.

Is this a feature of the designer? Is there something I can do to change this behavior?

EDIT: Thanks for pointing out that the changes made by the editor is not quite the same as above, but still the query is modified although the results are the same.

Thanks

I tried to replicate this in the Query Designer and had a slightly different result. I typed the same as you:

SELECT * FROM Tab1 WHERE A='1' AND (B='1' OR C='1');

And got this:

SELECT     *
FROM         Tab1
WHERE     (A = '1') AND (B = '1') OR
                      (A = '1') AND (C = '1')

I have to say that the result is the same, but we can all see a dangerous road here. Also, I did not like the (A = '1') replication. Heck, I want the code how I coded it!

A word to the wise: I never format my queries in SQL Server Management Studio. Have you seen what it does to your view's code? I hate it. I just code somewhere else and paste in SMS when done.

Update with sub select - How to handle NULL values?

6 votes

I'm trying an update with a conditional sub-select which could return null...

UPDATE 
aTable SET 
aColumn = 
(   
    SELECT TOP 1    
        CASE 
            WHEN bTable.someColumn = 1 THEN someValue1 
            WHEN bTable.someColumn = 2 THEN someValue2 
            ELSE someValue3
        END  
    FROM         
        bTable
    WHERE
        bTable = @someCriteria
    ORDER BY
        someSortColumn
) WHERE 
aTable.id = @someId;

If the "bTable = @someCriteria" clause causes no results to be returned from the SELECT, it attempts to insert a NULL into "aColumn", which in this case is a NOT NULL column.

Question

How do I get it to simply leave "aColumn" alone in this circumstance?

Many thanks.

...
aColumn = 

    ISNULL(
        (   
            SELECT TOP 1    
                CASE 
                    WHEN bTable.someColumn = 1 THEN someValue1 
                    WHEN bTable.someColumn = 2 THEN someValue2 
                    ELSE someValue3
                END  
            FROM         
                bTable
            WHERE
                bTable = @someCriteria
            ORDER BY
                someSortColumn
        ), aColumn)
...

Re-define wait method in a Java interface

6 votes

I would like to use wait(int) as the signature of a method in a fluent API (used for http://www.jooq.org). The goal is to be able to construct SQL queries like this example:

SELECT * FROM T_AUTHOR
WHERE ROWNUM <= 1
FOR UPDATE OF FIRST_NAME, LAST_NAME
WAIT 5

The full FOR UPDATE clause syntax specification (at least for Oracle) can be seen here:

FOR UPDATE [ OF [ [ schema. ] { table | view } . ] column
             [, [ [ schema. ] { table | view } . ] column]...]
[ { NOWAIT | WAIT integer | SKIP LOCKED } ]

http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/img_text/for_update_clause.htm

With jOOQ, I really want to stay close to the SQL syntax. So I'd like to be able to model the above SQL clause with the jOOQ fluent API like this:

Result<Record> result = create.select()
                              .from(T_AUTHOR)
                              .limit(1)
                              .forUpdate()
                              .of(FIRST_NAME, LAST_NAME)
                              .wait(5) // Here's the issue
                              .fetch();

The fetch method is used to render the API's underlying object as SQL and run the SQL statement against an Oracle (or any other) database. The above can be legally specified in an interface:

/**
 * A type that models a "step" in the creation of a query using the fluent API
 */
public interface SelectForUpdateWaitStep extends SelectFinalStep {
    // [...]

    /**
     * Add a "FOR UPDATE .. WAIT n" clause to the query
     */
    SelectFinalStep wait(int seconds);

    // [...]
}

I have some doubts about this, though, because there is a risk of collision with another method:

public class Object {
    // [...]

    public final native void wait(long timeout) throws InterruptedException;

    // [...]
}

Thanks to method-overloading (int vs. long arguments), I can actually do this. But I'm afraid it might confuse my users and lead to mistakes. So this would be wrong:

                              .forUpdate()
                              .of(FIRST_NAME, LAST_NAME)
                              .wait((long) 5) // This doesn't make sense
                              .fetch();       // This doesn't compile

So my questions are:

  1. Can I somehow prevent calling/accessing Object.wait(long) altoghether? I don't think so because it's declared final but maybe someone knows a compiler-trick, or something else?
  2. Do you have a better idea for my API design apart from just renaming the method to something silly like doWait(int) or WAIT(int)?

What this requires is a way to disable an Object method. And main reason seems to be because it has a nice name that would fit the purposes of a proprietary API.

At first, this contradicts the entire idea of inheritance -- once you inherit from a class, all subclasses must expose the same non-private fields & method. You can always override a method, except when (1) it is marked as final and (2) it has an incompatible (non-covariant) return type, both of which are true with the void wait(long) method.

Furthermore, since every object is an Object in Java, everything must have a method void wait(long) and there should be no way to hide/delete/disable/forward/override it. Assuming it were possible to hide the void wait(long) method, how would you go about invoking it, should you wish to invoke it?

However, assuming you would never need to invoke void wait(long) for your particular classes, there is always the approach of source/byte-code weaving that AspectJ uses in order to make changes to the .class Java bytecode based on certain invocation rules. You could trap every call to wait(long) and declare an error/warning. See more here: http://www.eclipse.org/aspectj/doc/released/adk15notebook/annotations-decp.html

However, native method pointcuts are not possible even with AspectJ with byte-code weaving. Most likely, this is not possible even with source-code weaving -- but it might be worth a try.

Compatible SQL to test for not null and not empty strings

6 votes

I want to have compatible SQL for both Oracle database and Microsoft SQL server.

I want a compatible SQL expression that will return true for not null and not empty strings.

If I use:

column <> ''

it will work on Microsoft SQL server but not on Oracle database (as '' is null for Oracle)

If I use:

len(column) > 0

it will work on Microsoft SQL server but not on Oracle database (since it uses length() )

NULLIF is available on both Oracle (doc) and SQL Server (doc). This expression should work:

NULLIF(column, '') IS NOT NULL

In both servers, if column is NULL, then the output of NULLIF will just pass the NULL value through. On SQL Server, '' = '', so the output of NULLIF will be NULL. On Oracle, '' is already NULL, so it gets passed through.

This is my test on SQL Server 2008 R2 Express:

WITH SampleData AS
    (SELECT 1 AS col1, CAST(NULL AS varchar(10)) AS col2
     UNION ALL
     SELECT 2, ''
     UNION ALL
     SELECT 3, 'hello')
SELECT *
  FROM SampleData
 WHERE NULLIF(col2, '') IS NOT NULL;

And this is my test case on Oracle 10g XE:

WITH SampleData AS
    (SELECT 1 AS col1, NULL AS col2 FROM DUAL
     UNION ALL
     SELECT 2, '' FROM DUAL
     UNION ALL
     SELECT 3, 'hello' FROM DUAL)
SELECT *
  FROM SampleData
 WHERE NULLIF(col2, '') IS NOT NULL;

Both return 3 as expected.

How dangerous is it to provide a means for the public to run SELECT queries on a database?

5 votes

Suppose I do the following:

  • I create a MySQL database, and populate it with some data.
  • I create a MySQL user who has access only to that database, and who only has SELECT privileges.
  • I create a web page through which a user (any user, no password required) can enter arbitrary SQL, and on submitting the form, a script attempts to run the SQL as the MySQL user I created; any result set generated is displayed to the user; any error message generated is displayed to the user.
  • Assume that the database contains no stored procedures etc, just tables and views, and that I am happy for anybody to see any of the contents of that specific database.

We assume that the setup will be probed by a malicious user. What is the worst that could happen?

Some thoughts:

  • MySQL provides various statements like SHOW etc. that a user even having only SELECT privileges could use to gather information about the database server or about my databases. Other information could be obtained from error messages. While probably not sufficient to gain improper access, this information could surely help in doing so.
  • There might be flaws in the database software, or in my scripts, or in the scripting language itself, that could allow a visitor to do things they are not supposed to be able to do through this interface.
  • Doing this might violate a terms of service agreement, particularly if I am using shared hosting.

Hmmm. Clever users may attack via syntax like:

select some_function_that_updates() from some_table;

And there's a denial of service attack that could blow memory, like:

select * from some_massive_table cross join some_other_massive_table;

And frankly, it's hard enough for experienced programmers to write queries that behave well... what chance does a poor user have even if they try to write a good query

MySQL PRIMARY KEYs: UUID / GUID vs BIGINT (timestamp+random)

5 votes

tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?

Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.

My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.

My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.

Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.

As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456

A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.

One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?

Pros of my BIGINT idea:

  • Gives me the multi-server/merging advantages of UUIDs
  • Requires very little change to my application code (everything is already programmed to handle integers for IDs)
  • Half the storage of a UUID (8 bytes vs 16 bytes)

Cons:

  • ??? - Please let me know if you can think of any.

Some follow up questions to go along with this:

  1. Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?

  2. Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.

Thanks for any tips. Sorry about the wall of text.

I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.

Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.

Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).

To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.

You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)

SVN database versioning for multiple developers environment

5 votes

I'd like to know if there is a scenario for versioning database with SVN which will ensure no conflicts when few developers try to commit changes simultaneously.

Me and my team have been using changescripts with increasing schema version number (similar to this solution: http://odetocode.com/blogs/scott/archive/2008/02/02/versioning-databases-change-scripts.aspx ).

It's a pretty good solution, but its main flaw is that conflicts can occur when multiple developers try to commit change script with the same schema number - it's not only a simple SVN conflict, but also requires users with that conflict to manually change database table with schema versions, revert their db changes, change script files' numbers to have all the db updates. Is it possible to avoid this obstacles? I don't mean technical solutions only, but maybe there is a better way to organize this task? Any ideas?

Rails solved this exact problem by using a timestamp instead of an incrementing version number. The odds of two users creating new schema versions in the same second is pretty low.

Small footprint embedded Java SQL database

5 votes

I'm looking for an embedded SQL database for Java, with a small footprint. Something like SQLite, but it should be pure Java. So far I've seen HSQLDB, H2 and Apache Derby, but they're not what could be called small footprint. Ideally, the database I'm looking for would be embeddable in a J2ME application.

Is there something like this?

You may try http://www.jepstone.net/tinySQL/.

Pros:

  • It is small (93k!)
  • It is embeddable
  • It uses DBF or text files files to store data, so they are easy to read.

Cons:

  • It is an old unmaintained project
  • It is not designed to work in j2me, but since it can work in JDK 1.1.8 it won't be hard to make it work in j2me. Of course you will have to change some code from using RandomAccessFile to FileConnection and stuff like that, but at least you wont need to mess with generics related code.
  • It is not very fast, because it does not use indexes, so you need to try and see if it is fits yuor needs
  • It is not feature complete, just gives you a small subset of SQL

Data normalization and writing queries

5 votes

I'm a jr. developer (5 months on the job), and I have a question about data normalization. Now, as I understand it, the general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum. In my project, one of the DB people created a DB. We have 50+ tables, and the tables in the DB are usually very fragmented, ie. a table has two or three columns and that's it. Now, when it comes to writing sql queries, it has become something of a minor hassle since each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization? Or does this point to something else?

I know that the easiest thing to do, for me, would be to write tables based off the queries I have to write. This will create a DB with a lot of redundant data, but I was curious if there is a happy medium?

Just as a postscript, I don't want to come across like I'm whining about my work, but I'm genuinely curious to learn more about this. My work environment is not the most friendly so I don't feel comfortable posing this question with my colleagues. However, I would appreciate any thoughts, books, tutorials or opinions from more experienced people.

Thanks.

general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Only partly true.

Normalization is not about "redundancy".

It's about "update anomalies".

1NF is the "don't use arrays" rules. Breaking 1NF means a row isn't atomic, but a collection and independent updates in the collection wouldn't work out well. There'd be locking and slowness.

2NF is the "one key" rule. Each row has exactly one key and everything in the row depends on the key. There are no dependencies on part of the key. Some folks like to talk about candidate keys and natural keys and foreign keys; they may exist or they may not. 2NF is satisfied when all attributes depend on one key. If the key is a single-column surrogate key, this normal form is trivially satisfied.

If 2NF is violated, you've got columns which depend on part of a key, but not the whole key. If you had a table with (Part Number, Revision Number) as a key, and attributes of color and weight, where weight depends on the whole key, but color only depends on the part number. You have a 2NF problem where you could update some part colors but not others, creating data anomalies.

3NF is the "only the key" rule. If you put derived data in a row, and change the derived result, it doesn't match the source columns. If you change a source column without updating the derived value, you have a problem, too. Yes, triggers are a bad hackaround to allow 3NF design violations. That's not the point. The point is merely to define 3NF and show that it prevents an update problem.

each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization?

It is.