Best database questions in September 2010

is email address a bad primary key

73 votes

Is email address a bad candidate for primary when compared to auto incrementing numbers. Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. But, my colleague suggests that string comparison will be slower to integer comparison. Is it a valid reason to not use email ids as primary key.

We are using postgres Thanks

String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.

If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.

Why is SELECT * considered harmful?

37 votes

Why is SELECT * bad practice? Wouldn't it mean less code to change if you added a new column you wanted?

I understand that SELECT COUNT(*) is a performance problem on some DBs, but what if you really wanted every column?

The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.

Performance

The * shorthand can be slower because:

  • Not all the fields are indexed, forcing a full table scan - less efficient
  • What you save to send SELECT * over the wire risks a full table scan
  • Returning more data than is needed
  • Returning trailing columns using variable length data type can result in search overhead

Maintenance

When using SELECT *:

  • Someone unfamiliar with the codebase would be forced to consult documentation to know what columns are being returned before being able to make competent changes. Making code more readable, minimizing the ambiguity and work necessary for people unfamiliar with the code saves more time and effort in the long run.
  • If code depends on column order, SELECT * will hide an error waiting to happen if a table had its column order changed.
  • Even if you need every column at the time the query is written, that might not be the case in the future
  • the usage complicates profiling

Design

SELECT * is an anti-pattern:

  • The purpose of the query is less obvious; the columns used by the application is opaque
  • It breaks the modularity rule about using strict typing whenever possible. Explicit is almost universally better.

When Should "SELECT *" Be Used?

It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.

Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.

Local Sql database support for Windows phone 7

24 votes

I would like to develop for Windows Phone 7 but the lack of database support still blocks me. Has anyone found a good library, stack to use for accessing local database from WinPhone7? Can you post links/examples of good practices to select/insert data in DB from a WinPhone7 application?

Thanks in advance ! c.

[edited] any updates available?

There are lots of options:

WinPhone7Db
Perst
SQLite (and here)
siaqodb
db4o
Ninja Database Pro
Rapid Repository
Sterling
Effiproz
SQL Compact Edition will be coming in the "Mango" update - due later this year

Entity Framework and Connection Pooling

16 votes

I've recently started to use the Entity Framework 4.0 in my .NET 4.0 application and am curious about a few things relating to pooling.

  1. Connection pooling as I know is managed by the ADO.NET data provider, in my case that of MS SQL server. Does this apply when you instantiate a new entities context (ObjectContext), i.e. the parameterless new MyDatabaseModelEntities()?

  2. What are the advantages and disadvantages of a) creating a global entities context for the application (i.e. one static instance) or b) creating and exposing an entities context for each given operation/method, with a using block.

  3. Any other recommendations, best practices, or common approaches for certain scenarios that I should know about?

  1. Connection pooling is handled as in any other ADO.NET application. Entity connection still uses traditional database connection with traditional connection string. I believe you can turn off connnection pooling in connection string if you don't want to use it.
  2. Never ever use global context. ObjectContext internally implements several patterns including Identity Map and Unit of Work. Impact of using global context is different per application type.
  3. For web applications use single context per request. For web services use single context per call. In WinForms or WPF application use single context per form or per presenter. There can be some special requirements which will not allow to use this approach but in most situation this is enough.

If you want to know what impact has single object context for WPF / WinForm application check this article. It is about NHibernate Session but the idea is same.

Edit:

When you use EF it by default loads each entity only once per context. The first query creates entity instace and stores it internally. Any subsequent query which requires entity with the same key returns this stored instance. If values in the data store changed you still receive the entity with values from the initial query. This is called Identity map pattern. You can force the object context to reload the entity but it will reload a single shared instance.

Any changes made to the entity are not persisted until you call SaveChanges on the context. You can do changes in multiple entities and store them at once. This is called Unit of Work pattern. You can't selectively say which modified attached entity you want to save.

Combine these two patterns and you will see some interesting effects. You have only one instance of entity for the whole application. Any changes to the entity affect the whole application even if changes are not yet persisted (commited). In the most times this is not what you want. Suppose that you have an edit form in WPF application. You are working with the entity and you decice to cancel complex editation (changing values, adding related entities, removing other related entities, etc.). But the entity is already modified in shared context. What will you do? Hint: I don't know about any CancelChanges or UndoChanges on ObjectContext.

I think we don't have to discuss server scenario. Simply sharing single entity among multiple HTTP requests or Web service calls makes your application useless. Any request can just trigger SaveChanges and save partial data from another request because you are sharing single unit of work among all of them.

Even for a readonly application a global context is not a good choice because you probably want fresh data each time you query the application.

proper hibernate annotation for byte[]

10 votes

I have an application using hibernate 3.1 and JPA annotations. It has a few objects with byte[] attributes (1k - 200k in size). It uses the JPA @Lob annotation, and hibernate 3.1 can read these just fine on all major databases -- it seems to hide the JDBC Blob vendor peculiarities (as it should do).

@Entity
public class ConfigAttribute {
  @Lob
  public byte[] getValueBuffer() {
    return m_valueBuffer;
  }
}

We had to upgrade to 3.5, when we discovered that hibernate 3.5 breaks (and won't fix) this annotation combination in postgresql (with no workaround). I have not found a clear fix so far, but I did notice that if I just remove the @Lob, it uses the postgresql type bytea (which works, but only on postgres).

annotation                   postgres     oracle      works on
-------------------------------------------------------------
byte[] + @Lob                oid          blob        oracle
byte[]                       bytea        raw(255)    postgresql
byte[] + @Type(PBA)          oid          blob        oracle
byte[] + @Type(BT)           bytea        blob        postgresql

once you use @Type, @Lob seems to not be relevant
note: oracle seems to have deprecated the "raw" type since 8i.

I am looking for a way to have a single annotated class (with a blob property) which is portable across major databases.

  • What is the portable way to annotate a byte[] property?
  • Is this fixed in some recent version of hibernate?

Update: After reading this blog I have finally figured out what the original workaround in the JIRA issue was: Apparently you are supposed to drop @Lob and annotate the property as:

@Type(type="org.hibernate.type.PrimitiveByteArrayBlobType") 
byte[] getValueBuffer() {...

However, this does not work for me -- I still get OIDs instead of bytea; it did however work for the author of the JIRA issue, who seemed to want oid.

After the answer from A. Garcia, I then tried this combo, which actually does work on postgresql, but not on oracle.

@Type(type="org.hibernate.type.BinaryType") 
byte[] getValueBuffer() {...

What I really need to do is control which @org.hibernate.annotations.Type the combination (@Lob + byte[] gets mapped) to (on postgresql).


Here is the snippet from 3.5.5.Final from MaterializedBlobType (sql type Blob). According to Steve's blog, postgresql wants you to use Streams for bytea (don't ask me why) and postgresql's custom Blob type for oids. Note also that using setBytes() on JDBC is also for bytea (from past experience). So this explains why use-streams has no affect they both assume 'bytea'.

public void set(PreparedStatement st, Object value, int index) {
 byte[] internalValue = toInternalFormat( value );
 if ( Environment.useStreamsForBinary() ) {
  // use streams = true
   st.setBinaryStream( index, 
    new ByteArrayInputStream( internalValue ), internalValue.length );
 }
 else {
  // use streams = false
  st.setBytes( index, internalValue );
 }
}

This results in:

ERROR: column "signature" is of type oid but expression is of type bytea

Update The next logical question is: "why not just change the table definitions manually to bytea" and keep the (@Lob + byte[])? This does work, UNTIL you try to store a null byte[]. Which the postgreSQL driver thinks is an OID type expression and the column type is bytea -- this is because hibernate (rightly) calls JDBC.setNull() instead of JDBC.setBytes(null) which PG driver expects.

ERROR: column "signature" is of type bytea but expression is of type oid

The type system in hibernate is currently a 'work in progress' (according to 3.5.5 deprecation comment). In fact so much of the 3.5.5 code is deprecated, it is hard to know what to look at when sub-classing the PostgreSQLDialect).

AFAKT, Types.BLOB/'oid' on postgresql should be mapped to some custom type which uses OID style JDBC access (i.e. PostgresqlBlobType object and NOT MaterializedBlobType). I've never actually successfully used Blobs with postgresql, but I do know that bytea just simply works as one / I would expect.

I am currently looking at the BatchUpdateException -- its possible that the driver doesn't support batching.


Great quote from 2004: "To sum up my ramblings, I'd say they we should wait for the JDBC driver to do LOBs properly before changing Hibernate."

References:

What is the portable way to annotate a byte[] property?

It depends on what you want. JPA can persist a non annotated byte[]. From the JPA 2.0 spec:

11.1.6 Basic Annotation

The Basic annotation is the simplest type of mapping to a database column. The Basic annotation can be applied to a persistent property or instance variable of any of the following types: Java primitive, types, wrappers of the primitive types, java.lang.String, java.math.BigInteger, java.math.BigDecimal, java.util.Date, java.util.Calendar, java.sql.Date, java.sql.Time, java.sql.Timestamp, byte[], Byte[], char[], Character[], enums, and any other type that implements Serializable. As described in Section 2.8, the use of the Basic annotation is optional for persistent fields and properties of these types. If the Basic annotation is not specified for such a field or property, the default values of the Basic annotation will apply.

And Hibernate will map a it "by default" to a SQL VARBINARY (or a SQL LONGVARBINARY depending on the Column size?) that PostgreSQL handles with a bytea.

But if you want the byte[] to be stored in a Large Object, you should use a @Lob. From the spec:

11.1.24 Lob Annotation

A Lob annotation specifies that a persistent property or field should be persisted as a large object to a database-supported large object type. Portable applications should use the Lob annotation when mapping to a database Lob type. The Lob annotation may be used in conjunction with the Basic annotation or with the ElementCollection annotation when the element collection value is of basic type. A Lob may be either a binary or character type. The Lob type is inferred from the type of the persistent field or property and, except for string and character types, defaults to Blob.

And Hibernate will map it to a SQL BLOB that PostgreSQL handles with a oid .

Is this fixed in some recent version of hibernate?

Well, the problem is that I don't know what the problem is exactly. But I can at least say that nothing has changed since 3.5.0-Beta-2 (which is where a changed has been introduced)in the 3.5.x branch.

But my understanding of issues like HHH-4876, HHH-4617 and of PostgreSQL and BLOBs (mentioned in the javadoc of the PostgreSQLDialect) is that you are supposed to set the following property

hibernate.jdbc.use_streams_for_binary=false

if you want to use oid i.e. byte[] with @Lob (which is my understanding since VARBINARY is not what you want with Oracle). Did you try this?

As an alternative, HHH-4876 suggests using the deprecated PrimitiveByteArrayBlobType to get the old behavior (pre Hibernate 3.5).

References

  • JPA 2.0 Specification
    • Section 2.8 "Mapping Defaults for Non-Relationship Fields or Properties"
    • Section 11.1.6 "Basic Annotation"
    • Section 11.1.24 "Lob Annotation"

Resources

Is storing a comma separated list in a database column really that bad?

9 votes

Imagine a web form with a set of checkboxes (any or all can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.

Now, I know that the correct solution would be to create a second table and properly normalize the database. One reason was laziness, my SQL knowledge is very limited and only seldomly used, so I would have had to look up quite some stuff to implement the more correct solution.

I thought the saved time and simpler code was worth it in my situation, it this a defensible design choice, or should I have normalized it from the start?

Edit:

Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the programm and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.

In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:

  • Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
  • Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
  • Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
  • Can't delete a value from the list without fetching the whole list.
  • Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
  • Hard to count elements in the list, or do other aggregate queries.
  • Hard to join the values to the lookup table they reference.
  • Hard to fetch the list in sorted order.

To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.

Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

There are times when you need to employ denormalization, but as @OMG Ponies mentions, these are exception cases. Any non-relational "optimization" benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.

Is it better to create an index before filling a table with data, or after the data is in place?

9 votes

I have a table of about 100M rows that I am going to copy to alter, adding an index. I'm not so concerned with the time it takes to create the new table, but will the created index be more efficient if I alter the table before inserting any data or insert the data first and then add the index?

Creating index after data insert is more efficient way (it even often recomended to drop index before batch import and after import recreate it)

Need Advice: Is this a good use case for a 'NoSQL' Database? If so, which one?

9 votes

I have recently been researching NoSql options. My scenario is as follows:

We collect and store data from custom hardware at remote locations around the world. We record data from every site every 15 minutes. We would eventually like to move to every 1 minute. Each record has between 20 and 200 measurements. Once set up the hardware records and reports the same measurements every time.

The biggest issue we are facing is that we get a different set of measurements from every project. We measure about 50-100 different measurement types, however any project can have any number of each type of measurement. There is no preset set of columns that can accommodate the data. Because of this we create and build each projects data table with the exact columns it needs as we set up and configure the project on the system.

We provide tools to help analyze the data. This typically includes more calculations and data aggregation, some of which we also store.

We are currently using a mysql database with a table for each client. There are no relations between tables.

NoSql seems promising because we could store a project_id, timestamp then the rest would not be preset. This means one table, more relationships in the data, yet still handling the variety of measurements.

Is a 'NoSql' solution right for this job? If so which ones?

I have been investigation MongoDB and it seems promising...

Example for Clarification:

Project 1 has 5 data points recorded, the mysql table columns look like: timestamp, temp, wind speed, precipitation, irradiance, wind direction

Project 2 has 3 data points recorded mysql table columns: timestamp, temp, irradiance, temp2

The simple answer is that there is no simple answer to these sort of problems, the only way to find out what works for your scenario is to invest R&D time into it.

The question is hard to answer because the performance requirements aren't spelled out by the OP. It appears to be 75M/year records over a number of customers with a write rate of num_customers*1minute (which is low), but I don't have figures for the required read / query performance.

Effectively you have already a sharded database using horizontal partitioning because you're storing each customer in a seperate table. This is good and will increase performance. However you haven't yet established that you have a performance problem, so this needs to be measured and the problem size assessed before you can fix it.

A NoSQL database is indeed a good way of fixing performance problems with traditional RDBMS, but it will not provide automatic scalabity and is not a general solution. You need to find your performance problem fix and then design the (nosqL) data model to provide the solution.

Depending on what you're trying to achieve I'd look at MongoDB, Apache Cassandra, Apache HBase or Hibari.

Remember that NoSQL is a vague term typically encompassing

  • Applications that are either performance intensive in read or write. Often sacrificing read or write performance at the expense of the other.
  • Distribution and scalability
  • Different methods of persistency (RAM/Disk)
  • A more structured/defined access pattern making ad-hoc queries harder.

So, in the first instance I'd see if a traditional RDBMS can achieve the required performance, using all available techniques, get a copy of High Performance MySQL and read MySQL Performance Blog.

Rev1:

In light of your comments I think it is fair to say that you could achieve what you want with one of the above NOSQL engines.

My primary recommendation would be to get your data model designed and implemented, what you're using at the moment isn't really right.

So look at Entity-attribute-value model as I think it is exactly right for what you need.

You need to get your data model right before you can consider which technology to use, being honest modifying schemas dynamically isn't a datamodel.

I'd use a traditional SQL database to validate and test the new datamodel as the management tools are better and it's generally easier to work with the schemas as you refine the datamodel.

What is a good embedded database to use with C#?

8 votes

With the demise of VistaDB (vistadb.net), I am on the look out for a good embedded database to use for my C# programming.

Does anyone have any suggestions for something to use that it is simple to install along with my application? VistaDB was simple because I only had to copy 1 DLL file along with my app to get it to work.

I am looking for something like that but with good documentation/samples for C#.

Thanks.

SQLite with C# wrapper

Database constraints - keep or ignore?

8 votes

When I was learning in university, they taught us the database fundamentals, basics and rules, and one of the most important rules is the constraints (primary key, foreign key), and how to make 1-m, 1-1, m-n relationships.

Now when I move to real business environment they tell me: you should forget all you have been taught; no constraints, all those relationships are logical, no primary keys, no foreign keys, you can make your constraints through the code.

I don't know who is right: what I learned in my academic life or what I will learn in my new real business life. What do you think?

I think the constraints help you to have clean data. Performance is sometimes improved. In some cases, the performance can get affected by having the constraints. However, the answer to that is not removing the constraints. You have something called "denormalization" to help you deal with the performance issues (provided that your queries are already optimized). You can always create denormalized summary tables in such scenarios.

Did the guys who told you to "forget what you learnt" also tell you that they have forgotten the traffic rules they learnt at the driving classes?

How established are ORMs (object relational mapping) in the world of databases

8 votes

I'm not a database admin or architect, so I have to ask those who do it 24/7. How established is the concept of an ORM (object relational mapping) in the world of database administration and architecture? Is it still happening, widely approved but still in its early stages, or is generally disapproved? I'm learning this area and would like to get a feel for whether it's going to be knowledge appreciated by the wider segment of this field.

Widely used and definitely the present and near future. Database access through a handcoded layer of SQL generation was always fraught with drudgery and typos, and was unwieldy at best. ORMs let you use a persistence store in a programming way.

I thought this blog argued for it well: http://jonkruger.com/blog/category/fluent-nhibernate/ and SO posts like this (http://stackoverflow.com/questions/1114215/nhibernate-versus-llblgen-pro) show just how many people are using them.

How to change database design in a deployed application?

8 votes

Situation

I'm creating a C#/WPF 4 application using a SQL Compact Edition database as a backend with the Entity Framework and deploying with ClickOnce.

I'm fairly new to applications using databases, though I don't suspect I'll have much problem designing and building the original database. However, I'm worried that in the future I'll need to add or change some functionality which will require me to change the database design after the database is already deployed and the user has data in the database.

Questions

  1. Is it even possible to push an updated database design out to users via a clickonce update in the same way it is for code changes?

  2. If I did, how would the user's data be affected?

  3. How is this sort of thing done in real situations? What are some best-practices?

I figure that in the worst case, I'd need to build some kind of "version" number into the database or program settings and create some routine to migrate the user's current version of the database to the new one.

I appreciate any insight into my problem. Thanks a lot.

There are some 'tricks' that are employed when designing databases to allow for design changes.

Firstly, many database designers create views to code against, rather than coding directly to the tables. This allows tables to be altered (split or merged, etc) while only requiring that the views are updated. You may want to investigate database refactoring techniques for this.

Secondly, you can indeed add versioning information to the database (commonly done as a 'version' table with a single field). Updating the database can be done through code or through scripts. One system I worked on would automatically check the database version and then progressively update the schema through versions in code until it matched the required version for the runtime. This was quite an undertaking.

Is naming tables september_2010 acceptable and efficient for large data sets dependent on time?

8 votes

I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.

Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...

Any suggestions on how to deal with this amount of data? Thanks.

========== Thank you to all the feedback.

I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:

  • You need to remember to create a new table every year or else your app breaks.
  • Querying aggregates against all rows regardless of year is harder.
  • Updating a date potentially means moving a row from one table to another.
  • It's harder to guarantee the uniqueness of pseudokeys across multiple tables.

My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).

Is it a bad idea to escape HTML before inserting into a database instead of upon output?

7 votes

I've been working on a forum-like system, which does not allow for HTML formatting. The method I currently use is to escape HTML entities before they get inserted into the database. I've been told (in relation to XSS vulnerabilities) that I should insert the raw comment into the database, and escape HTML entities upon output.

Other questions here I've seen on the matter seem to imply that the HTML would/could still be used for formatting, thus I'm asking for a case where the HTML would not be used at all.

you will also restrict yourself when performing the escaping before inserting into your db. let's say you decide to not use HTML as output, but JSON, plaintext, etc.

if you have stored escaped html in your db, you would first have to 'unescape' the value stored in the db, just to re-escape it again into a different format.

also see this perfect owasp article on xss prevention

Database design question

7 votes

I have a form where users submit different fields to create events. The number and type of fields requested are different on each form, depending on the category of event. What is the best way of going about designing this database - should the events contain every possible field and simply null the unused fields? Thanks!

If you begin to consider Joel's advice, please go to here.

or here

or here

And if you don't believe any of them, build the 4 tables he mentions. There's only 4, doesn't take long. Then load some data into them... then try to write the queries you want to write...

Changing Column meaning:

This can really screw with cardinality estimates. You dinner plates might be in the 4 - 20 range, the concert seats between 1000 - 2000. Some cardinality calculations look at the spread from min to max and assume and equal distribution (when lacking other statistics)...

From 4 to 2000 means that anywhere GENERIC_COLUMN = n, the % of rows you'll hit is 1/1996th of the total... but really, if you said where EVNT_TYPE = Dinner and GENERIC_COLUMN = n it would REALLY be between 4 and 20, or 1/16th of the total rows... so a huge swing in the card estimate. (This can be fixed with histograms, but the point of showing the automation issues is just to hightlight that if it's an issue to a machine, it's probably not as clean as it could be.)

So if you were to do this (MUCH BETTER than an EAV but...)

I would recommend creating a view for each object.

Table EVENT ( common fields, Generic_Count) View DINNER ( common fields, Generic_Count as Plates) WHERE type = Dinner View CONCERT ( common fields, Generic_Count as Seats) WHERE type = Concert

Then give NO ONE select against EVENT

But this is where you get into trouble by NOT starting with a conceptual data model first.

You'd have an ENTITY for EVENT and another for DINNER which inherits completely from EVENT and another for CONCERT which inherits completely from EVENT. Then you could set a differentiating column in the inheritance object which let's you set the "TYPE" column and then you could even decide how many tables to build with a flick of a switch. 1 table, 2 tables or 3 tables..

At least you can do that in powerDesigner.

Why is DDL considered so 'bad?

The creation of EAV models and questions like this are organized around the idea that DDL is to be avoided. Why ALTER TABLE when you can INSERT a new attribute row? People make poor data model design decisions based on the wrong Utility Function. These functions are things like 'no nullable columns', 'the fewer the tables the better', 'no ddl just to add a new attribute. Insert into Attribute table instead'.

Think of data modeling like this: sculptors will say that the wood or stone already has the figure inside of the block, they are just removing pieces of it to reveal it.

Your problem space already has a data model, it's just your job to discover it... it will have as many tables and columns as it needs. Trying to force it to conform to one of the above utility functions is where things go horribly wrong.

In your case, would you ever like to know all the events you've added in the past 2 weeks? Now think of the possible models. One table per event type would mean summing over n tables to find that answer and with each new event type a new table added and every "All event" query would be changing. You could build a UNION ALL view of those tables but you'd have to remember to add each new table to the view. Debugging through views like that is a pain.

Assuming that you might want a lot of metrics about ALL events, one table makes more sense (At least for some common portion of your event data - Like Event Name, Sponsor ID, Venue ID, event Start Time, event end time, venue available for setup time, etc.) Those field are (let's stipulate) are common to every event.

So now what to do with the other columns? Two options, nullable fields or vertically partition the table. The later is an optimization of the former. And if you read any database optimization books or blogs the major thing I take from them is that premature optimization kills. I see people implementing lots of strategies for problems before they even know if they will have that problem. A coworker had a slow query he wanted me to help with. It was loaded with optimizer hints. I removed them and the SQL screamed... I don't know WHY he hinted it but he was not doing it effectively and I'm pretty sure he never saw an issue so this was all just premature optimization.

Vertical partitioning is something you do when you have large data volumes and you have some frequently accessed data and other data that is not so useful. You can pack a table with a lot fewer blocks if you only pack some of the commons. More rows per block = faster tablescans... doesn't really affect the speed of finding a single row via an index. As you can see vertical partitioning has a specific problem it can solve (others too like row chaining) so if you're sure that's GOING to be an issue then by all means begin that way.

HTML5 localStorage security

7 votes

Would be a good or bad idea to use localStorage for sensitive data (assuming the current HTML5 implementations)?

What methods can I use to secure the data so that it cannot be read by a person that has access at the client computer?

Bad idea.

  1. Someone with access to the machine will always be able to read the localStorage, there is nothing much you can do to prevent it. Just type 'localStorage' in firebug console, and you get all the key/value pairs nicely listed.
  2. If you have an XSS vulnerability in your application, anything stored in localStorage is available to an attacker.
  3. You can try and encrypting it, but there is a catch. Encrypting it on the client is possible, but would mean the user has to provide a password and you have to depend on not-so-well-tested javascript implementations of cryptography.
  4. Encrypting on the server side is of course possible, but then the client code cannot read or update it, and so you have reduced localStorage to a glorified cookie.

If it needs to be secure, its best to not send it to the client. What is not in your control can never be secure.

Mysql slow query: JOIN + multiple WHERES + ORDER BY

7 votes

long time lurker, first question!

I am struggling to optimize this query, which selects the lowest priced items that match the chosen filters:

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link
FROM product_info
NATURAL JOIN (SELECT * FROM product_all WHERE product_all.date = '2010-09-30') as product_all
WHERE (product_info.category = 2  
AND product_info.gender = 'W' )
GROUP BY product_all.prod_id
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13

Its explain:

| id | select_type | table        | type   | possible_keys                                             | key     | key_len | ref                 | rows   | Extra                           |  
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  
|  1 | PRIMARY     | <derived2>   | ALL    | NULL                                                     | NULL    | NULL    | NULL                | 89801  | Using temporary; Using filesort | 
|  1 | PRIMARY     | product_info | eq_ref | PRIMARY,category_prod_id_retail_price,category_ret...     | PRIMARY | 4       | product_all.prod_id | 1      | Using where                     | 
|  2 | DERIVED     | product_all  | ref    | date_2                                                    | date_2  | 3       |                     | 144107 |                                 | 

I've tried eliminating the subquery, which intuitively seems better but in practice takes even longer:

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link
FROM product_info
NATURAL JOIN product_all
WHERE (product_all.date = '2010-09-30'
AND product_info.category = 2 
AND product_info.gender = 'W' )
GROUP BY product_all.prod_id
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13

And its explain:

| id | select_type | table        | type | possible_keys                                             | key                      | key_len | ref                               | rows | Extra                                        |  
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  
|  1 | SIMPLE      | product_info | ref  | PRIMARY,category_prod_id_retail_price,category_ret...     | category_retail_price    | 5       | const                             | 269  | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | product_all  | ref  | PRIMARY,prod_id,date_2                                    | prod_id                  | 4       | equipster_db.product_info.prod_id | 141  | Using where                                  | 

Here are the tables:

CREATE TABLE `product_all` (
`prod_id` INT( 10 ) NOT NULL PRIMARY KEY ,
`ref_id` INT( 10) NOT NULL PRIMARY KEY ,
`date` DATE NOT NULL ,
`buy_link` BLOB NOT NULL ,
`sale_price` FLOAT NOT NULL
) ENGINE = MYISAM ;


CREATE TABLE `product_info` (
`prod_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`prod_name` VARCHAR( 200 ) NOT NULL,
`brand` VARCHAR( 50 ) NOT NULL,
`retail_price` FLOAT NOT NULL
`category` INT( 3 ) NOT NULL,
`gender` VARCHAR( 1 ) NOT NULL,
`type` VARCHAR( 10 ) NOT NULL
) ENGINE = MYISAM ;

My Questions:
-which query structure seems optimal?
-what indices would optimize this query?
-less importantly: how does the indexing approach change when adding or removing WHERE clauses or using a different ORDER BY, such as sorting by % off:

ORDER BY (1-(MIN(product_all.sale_price)/product_info.retail_price)) DESC  

edit: both queries' natural join acts on prod_id (one record in product_info can have multiple instances in product_all, which is why they need to be grouped)

Indices make a massive difference in mysql, one query that took 15 minutes with a wrong set of indices took .2 seconds with the right ones, but its finding the right balance that is generally the issue. Naturally without some sample data its really hard to say if the below solution will save you any time, but in theory it should.

To answer your questions, I would redesign the tables like so:

CREATE TABLE `product_all` ( 
`prod_id` INT( 10 ) NOT NULL, 
`ref_id` INT( 10) NOT NULL, 
`date` DATE NOT NULL , 
`buy_link` BLOB NOT NULL , 
`sale_price` FLOAT NOT NULL,
PRIMARY KEY (prod_id, ref_id) ,
INDEX date_Index (`date` ASC),
UNIQUE INDEX prod_price_Index (prod_id ASC, sale_price ASC)
) ENGINE = MYISAM ; 


CREATE TABLE `product_info` ( 
`prod_id` INT( 10 ) NOT NULL AUTO_INCREMENT, 
`prod_name` VARCHAR( 200 ) NOT NULL, 
`brand` VARCHAR( 50 ) NOT NULL, 
`retail_price` FLOAT NOT NULL, 
`category` INT( 3 ) NOT NULL, 
`gender` VARCHAR( 1 ) NOT NULL, 
`type` VARCHAR( 10 ) NOT NULL,
PRIMARY KEY (prod_id) ,
UNIQUE INDEX prod_id_name_Index (prod_id ASC, prod_name ASC),
INDEX category_Index (category ASC),
INDEX gender_Index (gender ASC)
) ENGINE = MYISAM ;

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link         
FROM product_info         
NATURAL JOIN (SELECT * FROM product_all WHERE product_all.date = '2010-09-30') as product_all         
WHERE (product_info.category = 2           
AND product_info.gender = 'W' )         
GROUP BY product_all.prod_id         
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13        

The performance gain here is gained my indexing the main fields that are being joined upon and are featured in the where clause. Personally I would go with your first query as when you think about it that should perform better.

As far as I understand whats happening in the first and second query:

  • The first query is being filtered by a sub-query prior to doing the natural join, that means its only joining in the resulting data and not the whole table.
  • The second query is joining the entire second table and then filtering the resulting rows of the whole lot back to what you want.

As a rule of thumb normally you want to add indices on your major joining fields and also the fields that you use the most in where clauses. I've also put some unique indices on some of the fields that you will want to query regularly, such as prod_id_name_Index.

If this doesn't improve your performance if you could maybe post some dummy data to play with I might be able to get a faster solution that I can benchmark.

Here is an article that goes through indexing for performance in mysql, worth a read if you want to know more.

Good luck!

EDIT: Your final question I missed the first time, the answer is that if your indexing the main joining fields then changes to the where will only impact the overall performance slightly, but the unique indices I've put on the tables should account for the majority of things you'll want to base queries upon. The main thing to remember is if you query or join upon a field frequently then it should really be indexed, but minor queries and changes to the order by you should just not worry about in terms of realigning your indexing strategy.

Is MySQL still a good choice for a free/open source database?

6 votes

Until recently, I've used MySQL for all my database needs in all my server projects.

However, after Oracle came along and bought MySQL, there seems to be some uncertainty about the future of MySQL.

So I'm left wondering whether I should stick to MySQL (which I'm fairly familiar with), or should I switch to something like PostgreSQL. Any thoughts?

MySQL and PostgreSQL are both open-source DBMSs that will continue to be developed by independent parties should their current maintainers (corporate or otherwise) get bored. MySQL already has some interesting forks such as Drizzle which may rise to more prominence if Oracle choose to let the original project stagnate (well, more than Sun did...).

You should choose based on your projects' technical needs rather than nebulous arguments that one or the other is going to ‘win’.

(Personally I definitely think it's a good idea to have experience with both.)

Clustered indexes on non-identity columns to speed up bulk inserts?

6 votes

My two questions are:

  • Can I use clustered indexes to speed up bulk inserts in big tables?
  • Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?

To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).

It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.

I have been working to optimize these tables for the following functions:

  1. Queries that are run on subchunks
  2. "Benchmarking" queries that are run on the table as a whole
  3. Inserting/removing big chunks of data.

For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.

But for 3) I have struggled to find a good solution. My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.

At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.

I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?

I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?

Thanks a lot for any suggestions or explanations.

Well, I've put it to the test, and putting a clustered index on the two "chunk-defining" columns increases the performance of my table.

Inserting a chunk is now relatively fast compared to the situation where I had a clustered IDENTITY key, and about as fast as when I did not have any clustered index. Deleting a chunk is faster than with or without clustered index.

I think the fact that all the records I want to delete or insert are guaranteed to be all together on a certain part of the harddisk makes the tables faster - it would seem logical to me.