Best database questions in June 2011

Should I open() and close() my SQL database constantly or leave it open?

7 votes

I am creating an app that uses an SQL database for storing data. The way the app is designed it will be updated every 3 minutes or so with new data depending on user actions while the app is running.

In the tutorials I have seen, they recommend that you close the database after changing it (it is an "expensive" in terms of resources).

Is it better to leave it open for the duration of my App since it is being updated on a fairly frequent basis or should I run the close() method immediately after every change?

I guess my worry is that opening it and closing it constantly will draw more resources than it would to leave it open the entire time.

I tried to leave a connection open once -- I used it to stuff a repeater or something -- can't remember now.

Later in the program, I had another need to use the connection -- I think I had it so that when a user clicked a parent item in the repeater, a detail div would pop-up with more information for that item. This generated an error -- something to the effect of 'cannot open() on an open connection'.

I think the error might have been avoidable another way (like checking to see if the connection I was trying to open was already open), but as I thought about it, I realized I'd have to make that a standard practice throughout my app, and that seemed like too much work, so I just made it a standard practice to always close my connections after each use.

Connections stay in a connection pool -- I'm no whiz on that -- but if was curious about performance, I guess I'd keep that in mind, in terms of what it costs to open a connection multiple times -- whatever your situation is requiring anyway.

Another thought is that your DB admin may be able to force-close all open connections, or the db may close for some other reason. If you're not the dba, you might give a thought to risk/benefit of depending on something that you don't have long-term control over like keeping the connection open.

Java API for SQL Data Definition Language

7 votes

Before I write one, is there a Java API for manipulating a database. Like an object orientated wrapper around java.sql.DatabaseMetaData, with support for things like Schema.createTable(name, columns)?

Obviously correct SQL statements should be executed in the background based on the DB in use.

I'm specifically interested in an API for performing DDL statements.

As far as I can remember some tools such as NetBeans can create/ modify database schemas on-the-fly. You could have a look at the source code if you don't find a specific library.

Disk storage in a jee application

5 votes

I have JavaEE app in which I want to small little amount of data to disk, eg just user/passwords.

I dont want to go through the hassle of integrating with a full db for this little amount of data.

Is there a standard way to access the file system and a standard folder where web applications can store their data on disk, other than using a database?

Note:

I am not using EJB's. Its a web application using servlets.

You could consider using the preferences API to store this data - it's available on Java EE as well.

How dangerous is it to provide a means for the public to run SELECT queries on a database?

5 votes

Suppose I do the following:

  • I create a MySQL database, and populate it with some data.
  • I create a MySQL user who has access only to that database, and who only has SELECT privileges.
  • I create a web page through which a user (any user, no password required) can enter arbitrary SQL, and on submitting the form, a script attempts to run the SQL as the MySQL user I created; any result set generated is displayed to the user; any error message generated is displayed to the user.
  • Assume that the database contains no stored procedures etc, just tables and views, and that I am happy for anybody to see any of the contents of that specific database.

We assume that the setup will be probed by a malicious user. What is the worst that could happen?

Some thoughts:

  • MySQL provides various statements like SHOW etc. that a user even having only SELECT privileges could use to gather information about the database server or about my databases. Other information could be obtained from error messages. While probably not sufficient to gain improper access, this information could surely help in doing so.
  • There might be flaws in the database software, or in my scripts, or in the scripting language itself, that could allow a visitor to do things they are not supposed to be able to do through this interface.
  • Doing this might violate a terms of service agreement, particularly if I am using shared hosting.

Hmmm. Clever users may attack via syntax like:

select some_function_that_updates() from some_table;

And there's a denial of service attack that could blow memory, like:

select * from some_massive_table cross join some_other_massive_table;

And frankly, it's hard enough for experienced programmers to write queries that behave well... what chance does a poor user have even if they try to write a good query

What is the best way for (potentially) hundreds of mobile clients to access a MySQL database?

5 votes

So, here is the deal. I'm developing an Android application (although it could just as easily be any other mobile platform) that will occasionally be sending queries to a server (which is written is Java). This server will then search a MySQL database for the query, and send the results back to the Android. Although this sounds fairly generic, here are some specifics:

  1. The Android will make a new TCP connection to the server every time it queries. The server is geographically close, the Android could potentially be moving around a lot, and, since the Android app might run for hours while only sending a few queries, this seemed the best use of resources.

  2. The server could potentially have hundreds (or possibly even thousands) of these queries at once.

  3. Since each query runs in its own Thread, each query will at least need its own Statement (and could have its own Connection).

Right now, the server is set up to make one Connection to the database, and then create a new Statement for each query. My questions for those of you with some database experience (MySQL in particular, since it is a MySQL database) are:

a) Is it thread safe to create one Statement per Thread from a single Connection? From what I understand it is, just looking for confirmation.

b) Is there any thread safe way for multiple threads to use a single PreparedStatement? These queries will all be pretty much identical, and since each thread will execute only one query and then return, this would be ideal.

c) Should I be creating a new Connection for each Thread, or is it better to spawn new Statements from a single Connection? I think a single Connection would be better performance-wise, but I have no idea what the overhead for establishing a DB Connection is.

d) Is it best to use stored SQL procedures for all this?

Any hints / comments / suggestions from your experience in these matters are greatly appreciated.

EDIT:

Just to clarify, the android sends queries over the network to the server, which then queries the database. The android does not directly communicate with the database. I am mainly wondering about best practices for the server-database connection here.

Just because a Connection object is thread safe does not mean its thread efficient. You should use a Connection pool as a best practice to avoid potential blocking issues. But in answer to your question, yes you can share a Connection object between multiple threads.

You do need to create a new Statements/Prepared Statements in each thread that will be accessing the database, they are NOT thread safe. I would highly recommend using Prepared Statements as you will gain efficiency and protection against SQL injection attacks.

Stored procedures will speed up your database queries since the execution plan is compiled already and saved - highly recommended to use if you can.

Have you looked at caching your database data? Take a look at spymemcached if you can, its a great product for reducing number of calls to your data store.

Any good surname databases?

5 votes

I'm looking to generate some database test data, specifically table columns containing people's names. In order to get a good indication of how well indexing works with regard to name based searches I want to get as close as possible to real world names and their true frequency distribution, e.g. lots of different names with frequencies distributed over some power law distribution.

Ideally I'm looking for a freely available data file with names followed by a single frequency value (or equivalently a probability) per name.

Anglo-saxon based names would be fine, although names from other cultures would be useful also.

I found some US census data which fits the requirement. The only caveat is that it lists only names that occur at least 100 times...

Found via this blog entry that also shows the power law distribution curve

Further to this you can sample from the list using Roulette Wheel Selection, e.g. (not tested)

struct NameEntry
{
    public string _name;
    public int _frequency;
}

int _frequencyTotal; // Precalculate this.


public string SampleName(NameEntry[] nameEntryArr, Random rng)
{
    // Throw the roulette ball.
    int throwValue = rng.NextDouble() * frequencyTotal;
    int accumulator = 0.0;

    for(int i=0; i<nameEntryArr.Length; i++)
    {
        accumulator += nameEntryArr[i]._frequency;
        if(throwValue <= accumulator) {
            return nameEntryArr[i]._name;
        }
    }

    // If we get here then we have an array of zero fequencies.
    throw new ApplicationException("Invalid operation. No non-zero frequencies to select.");
}

MySQL PRIMARY KEYs: UUID / GUID vs BIGINT (timestamp+random)

5 votes

tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?

Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.

My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.

My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.

Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.

As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456

A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.

One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?

Pros of my BIGINT idea:

  • Gives me the multi-server/merging advantages of UUIDs
  • Requires very little change to my application code (everything is already programmed to handle integers for IDs)
  • Half the storage of a UUID (8 bytes vs 16 bytes)

Cons:

  • ??? - Please let me know if you can think of any.

Some follow up questions to go along with this:

  1. Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?

  2. Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.

Thanks for any tips. Sorry about the wall of text.

I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.

Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.

Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).

To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.

You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)

SVN database versioning for multiple developers environment

5 votes

I'd like to know if there is a scenario for versioning database with SVN which will ensure no conflicts when few developers try to commit changes simultaneously.

Me and my team have been using changescripts with increasing schema version number (similar to this solution: http://odetocode.com/blogs/scott/archive/2008/02/02/versioning-databases-change-scripts.aspx ).

It's a pretty good solution, but its main flaw is that conflicts can occur when multiple developers try to commit change script with the same schema number - it's not only a simple SVN conflict, but also requires users with that conflict to manually change database table with schema versions, revert their db changes, change script files' numbers to have all the db updates. Is it possible to avoid this obstacles? I don't mean technical solutions only, but maybe there is a better way to organize this task? Any ideas?

Rails solved this exact problem by using a timestamp instead of an incrementing version number. The odds of two users creating new schema versions in the same second is pretty low.

Small footprint embedded Java SQL database

5 votes

I'm looking for an embedded SQL database for Java, with a small footprint. Something like SQLite, but it should be pure Java. So far I've seen HSQLDB, H2 and Apache Derby, but they're not what could be called small footprint. Ideally, the database I'm looking for would be embeddable in a J2ME application.

Is there something like this?

You may try http://www.jepstone.net/tinySQL/.

Pros:

  • It is small (93k!)
  • It is embeddable
  • It uses DBF or text files files to store data, so they are easy to read.

Cons:

  • It is an old unmaintained project
  • It is not designed to work in j2me, but since it can work in JDK 1.1.8 it won't be hard to make it work in j2me. Of course you will have to change some code from using RandomAccessFile to FileConnection and stuff like that, but at least you wont need to mess with generics related code.
  • It is not very fast, because it does not use indexes, so you need to try and see if it is fits yuor needs
  • It is not feature complete, just gives you a small subset of SQL

Data normalization and writing queries

5 votes

I'm a jr. developer (5 months on the job), and I have a question about data normalization. Now, as I understand it, the general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum. In my project, one of the DB people created a DB. We have 50+ tables, and the tables in the DB are usually very fragmented, ie. a table has two or three columns and that's it. Now, when it comes to writing sql queries, it has become something of a minor hassle since each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization? Or does this point to something else?

I know that the easiest thing to do, for me, would be to write tables based off the queries I have to write. This will create a DB with a lot of redundant data, but I was curious if there is a happy medium?

Just as a postscript, I don't want to come across like I'm whining about my work, but I'm genuinely curious to learn more about this. My work environment is not the most friendly so I don't feel comfortable posing this question with my colleagues. However, I would appreciate any thoughts, books, tutorials or opinions from more experienced people.

Thanks.

general principle behind data normalization is to create a RDBMS where data redundancy is kept to a minimum.

Only partly true.

Normalization is not about "redundancy".

It's about "update anomalies".

1NF is the "don't use arrays" rules. Breaking 1NF means a row isn't atomic, but a collection and independent updates in the collection wouldn't work out well. There'd be locking and slowness.

2NF is the "one key" rule. Each row has exactly one key and everything in the row depends on the key. There are no dependencies on part of the key. Some folks like to talk about candidate keys and natural keys and foreign keys; they may exist or they may not. 2NF is satisfied when all attributes depend on one key. If the key is a single-column surrogate key, this normal form is trivially satisfied.

If 2NF is violated, you've got columns which depend on part of a key, but not the whole key. If you had a table with (Part Number, Revision Number) as a key, and attributes of color and weight, where weight depends on the whole key, but color only depends on the part number. You have a 2NF problem where you could update some part colors but not others, creating data anomalies.

3NF is the "only the key" rule. If you put derived data in a row, and change the derived result, it doesn't match the source columns. If you change a source column without updating the derived value, you have a problem, too. Yes, triggers are a bad hackaround to allow 3NF design violations. That's not the point. The point is merely to define 3NF and show that it prevents an update problem.

each query involves combing through several different tables and joining them together. I was wondering if this is a a side effect of data normalization?

It is.

MySQL/InnoDB Plugin: Will NULL values for VARCHAR fields still take up storage spaces?

5 votes

I'm using InnoDB Plugin in the Barracuda file format with:

ROW_FORMAT=DYNAMIC

If I define a field as VARCHAR(255), and then insert a record that has a NULL value for that field, will that record still use 255 bytes in storage for the VARCHAR field? Or will there be no wasted storage space?

On a related note, if I define a field as INT then presumably every record will still use 32 bits for that field even if the value for it is NULL. Is that correct?

Thanks

I think this should answer your question -

An SQL NULL value reserves one or two bytes in the record directory. Besides that, an SQL NULL value reserves zero bytes in the data part of the record if stored in a variable length column. In a fixed-length column, it reserves the fixed length of the column in the data part of the record. Reserving the fixed space for NULL values enables an update of the column from NULL to a non-NULL value to be done in place without causing fragmentation of the index page.

Please check more on data-type storage requirements and InnoDB physical row structure

Which database out of CouchDB, MongoDB and Redis is good for starting out with Node JS?

5 votes

I'm getting more into Node JS and am enjoying it. I'm moving more into web application development.

I have wrapped my head around Node JS and currently using Backbone JS for the front end. I'm making a few applications that uses Backbone to communicate with the server using a RESTful API. In Node JS, I will be using the Express framework.

I'm reaching a point where I need a simple database on the server. I'm used to PostgreSQL and MySQL with Django, but what I'm needing here is some simple data storage etc. I know about CouchDB, MongoDB and Redis, but I'm just not sure which one to use?

Is any one of them better suited for Node JS? Is any one of them better for beginners, moving from relational databases? I'm just needing some guidance on which to choose, I've come this far, but when it's coming to these sort of databases, I'm just not sure...

Is any one of them better suited for Node JS?

Better suited especially for node.js probably no, but each of them is better suited for certain scenarios based on your application needs or use cases.

Redis is an advanced key-value store and probably the fastest one among the three NoSQL solutions. Besides basic key data manipulation it supports rich data structures such as lists, sets, hashes or pub/sub functionality which can be really handy, namely in statistics or other real-time madness. It however lacks some sort of querying language.

CouchDB is document oriented store which is very durable, offers MVCC, REST interface, great replication system and map-reduce querying. It can be used for wide area of scenarios and substitute your RDBMS, however if you are used to ad hoc SQL queries then you may have certain problems with it's map-reduce views.

MongoDB is also document oriented store like CouchDB and it supports ad hoc querying besides map-reduce which is probably one of the crucial features why people searching for DRBMS substitution choose MongoDB over the other NoSQL solutions.

Is any one of them better for beginners, moving from relational databases?

Since you are coming from the RDBMS world and you are probably used to SQL then, I think, you should go with the Mongodb because, unlike Redis or CouchDB, it supports ad hoc queries and the querying mechanism is similar to SQL. However there may be areas, depending on your application scenarios, where Redis or CouchDB may be better suited to do the job.