Best database questions in April 2011

How to efficiently utilize 10+ computers to import data.

8 votes

We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.

My question

How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.

Does anyone have suggestion?

EDIT:

The following is a simplification of the CSV files:

"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue" 
"bvalue";"evenanothervalue"
"avalue";"anothervalue"

After importing, the tables look like this:

dimension_table1

id  name
1   "avalue"
2   "bvalue"

dimension_table2

id  name
1   "anothervalue"
2   "evenanothervalue"

Fact table

  dimension_table1_ID       dimension_table2_ID
    1                      1
    2                      2
    1                       2
    1                       2              
    2                       2
    1                       1

Loading CSV data into a database is slow because it needs to read, split and validate the data.

So what you should try is this:

  1. Setup a local database on each computer. This will get rid of the network latency.

  2. Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.

  3. Dump the data with the DB tools

  4. Load all dumps into a single DB

Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.

That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE

If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.

[EDIT] After checking with your edits, here is what you should try:

  1. Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:

     cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
    

    This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.

  2. Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.

    If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.

  3. Split the huge file into 40 chunks and load each of them on each machine.

    If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.

  4. Use backup/restore or remote tables to merge the results.

    Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.

See here for an introduction.

Which is more efficient for a one or two-character string: CHAR(2) or VARCHAR(2)?

8 votes

A really quick one - is it more efficient to store data which might be one or two characters as CHAR(2) or VARCHAR(2) in MySql?

Thanks!

In terms of storage space required, you're better off with CHAR(2) because the VARCHAR(2) type will require one extra byte to store the length:

Value   CHAR(2)  Storage Required  VARCHAR(2)  Storage Required
''      '  '     2 bytes           ''          1 byte          
'a'     'a '     2 bytes           'a'         2 bytes         
'ab'    'ab'     2 bytes           'ab'        3 bytes         

See 10.4.1. The CHAR and VARCHAR Types for more details.

More Information: What's the difference between VARCHAR and CHAR?

Database: How to distinguish "one-or-more" and "zero-or-more" relationship?

6 votes

Design tools allow distinguish between "one TO zero-or-more" and "one TO one-or-more" relationships. I could imagine how to implement "one TO zero-or-more" relation:


CHILD_TABLE

(pk) chid_id
(fk) parent_id (required)


PARENT_TABLE

(pk) parent_id


How to implement "one TO one-or-more" relationship? How to say that parent requires at least one child? Or is "one-or-more" usually implemented as "zero-or-more" ?

Implementing a 1 to 1-or-more relationship is done through the business logic, or at least using a transaction. You cannot insert into both tables simultaneously; you insert into one table at a time. Therefore, you need to insert the parent before you can insert the child and there is no built-in way for the database to enforce the required logic.

If you wrap the two INSERT statements inside a transaction, then you guarantee if the child insert fails, the parent insert will be rolled back. However, it is up to the business logic to make sure a parent is not inserted without a child.

What is the proper way to access BerkeleyDB with Perl?

6 votes

I've been having some problems with using BerkeleyDB. I have multiple instances of the same code pointed to a single repository of DB files, and everything runs fine for 5-32 hours, then suddenly there is a deadlock. The command prompts stop right before executing a db_get or db_put or cursor creation call. So I'm simply asking for the proper way to handle these calls. Here's my general layout:

This is how the environment and DBs are created:

my $env = new BerkeleyDB::Env ( 
   -Home   => "$dbFolder\\" , 
   -Flags  => DB_CREATE | DB_INIT_CDB | DB_INIT_MPOOL) 
   or die "cannot open environment: $BerkeleyDB::Error\n";

my $unsortedHash  = BerkeleyDB::Hash->new (
   -Filename => "$dbFolder/Unsorted.db", 
   -Flags => DB_CREATE,
   -Env  => $env
   ) or die "couldn't create: $!, $BerkeleyDB::Error.\n";

A single instance of this code runs, goes to a site and saves URLs to be parsed by another instance (I have the flag set so that every DB is locked when one is locked):

        $lk = $unsortedHash->cds_lock();
        while(@urlsToAdd){
            my $currUrl = shift @urlsToAdd;
            $unsortedHash->db_put($currUrl, '0');
        }
        $lk->cds_unlock();

It periodically checks if a certain number of items are in Unsorted:

$refer = $unsortedHash->db_stat();
$elements = $refer->{'hash_ndata'};

Before adding any element to any DB, it first checks all DBs to see if that element is already present:

if ($unsortedHash->db_get($search, $value) == 0){
    $value = "1:$value";
}elsif ($badHash->db_get($search, $value) == 0){
    $value =  "2:$value";
....

This next code comes after, and many instances of it are run in parallel. First, it gets the next item in unsorted (that does not have the busy value '1'), then sets the value to busy '1', then does something with it, then moves the DB entry completely to another DB (it is removed from unsorted and stored in another DB):

my $pageUrl = '';
my $busy = '1';
my $curs;
my $lk = $unsortedHash->cds_lock(); #lock, change status to 1, unlock
########## GET AN ELEMENT FROM THE UNSORTED HASH #######
while(1){
    $busy = '1';
    $curs = $unsortedHash->db_cursor();
    while ($busy){
        $curs->c_get($pageUrl, $busy, DB_NEXT);
        print "$pageUrl:$busy:\n";
        if ($pageUrl eq ''){
            $busy = 0;
        }
    }
    $curs->c_close();
    $curs = undef;

    if ($pageUrl eq ''){
        print "Database empty. Sleeping...\n";
        $lk->cds_unlock();
        sleep(30);
        $lk = $unsortedHash->cds_lock();
    }else{
        last;
    }
}

####### MAKE THE ELEMENT 'BUSY' AND DOWNLOAD IT 


$unsortedHash->db_put($pageUrl, '1');
$lk->cds_unlock();
$lk = undef;

And in every other place, if I call db_put or db_del on ANY DB, it is wrapped with a lock like so:

print "\n\nBad.\n\n";
        $lk = $badHash->cds_lock();
        $badHash->db_put($pageUrl, '0');
        $unsortedHash->db_del($pageUrl);
        $lk->cds_unlock();
        $lk = undef;

However, my db_get commands are free-floating with no lock, because I don't think reading needs a lock.

I have looked over this code a million times and the algorithm is airtight. So I am just wondering if I am implementing any part of this wrong, using the locks wrong, etc. Or if there is a better way to prevent deadlocking (or even diagnose deadlocking) with BerkeleyDB and Strawberry Perl?

UPDATE: To be more specific, the problem is occurring on a Windows 2003 server (1.5 GB RAM, not sure if that is important). I can run this whole setup fine on my Windows 7 machine (4GB RAM). I also started printing out the lock stats using the following:

Adding this flag to the environment creation:

-MsgFile => "$dbFolder/lockData.txt"

And then calling this every 60 seconds:

my $status = $env->lock_stat_print();
print "Status:$status:\n";

The status is always returned as 0, which is success. Here is the last stat report:

29  Last allocated locker ID
0x7fffffff  Current maximum unused locker ID
5   Number of lock modes
1000    Maximum number of locks possible
1000    Maximum number of lockers possible
1000    Maximum number of lock objects possible
40  Number of lock object partitions
24  Number of current locks
42  Maximum number of locks at any one time
5   Maximum number of locks in any one bucket
0   Maximum number of locks stolen by for an empty partition
0   Maximum number of locks stolen for any one partition
29  Number of current lockers
29  Maximum number of lockers at any one time
6   Number of current lock objects
13  Maximum number of lock objects at any one time
1   Maximum number of lock objects in any one bucket
0   Maximum number of objects stolen by for an empty partition
0   Maximum number of objects stolen for any one partition
3121958 Total number of locks requested
3121926 Total number of locks released
0   Total number of locks upgraded
24  Total number of locks downgraded
9310    Lock requests not available due to conflicts, for which we waited
0   Lock requests not available due to conflicts, for which we did not wait
8   Number of deadlocks
1000000 Lock timeout value
0   Number of locks that have timed out
1000000 Transaction timeout value
0   Number of transactions that have timed out
792KB   The size of the lock region
59  The number of partition locks that required waiting (0%)
46  The maximum number of times any partition lock was waited for (0%)
0   The number of object queue operations that required waiting (0%)
27  The number of locker allocations that required waiting (0%)
0   The number of region locks that required waiting (0%)
1   Maximum hash bucket length

Of which I am wary of this:

8   Number of deadlocks

How did these deadlocks occur, and how were they resolved? (all parts of the code are still running). What exactly is a deadlock, in this case?

In short, you need to do deadlock detection. I can see two possibilities to do that. First, you can use the db_deadlock utility. Second, and perhaps more conveniently, you can specify the -LockDetect flag when opening your environment, a flag that's not exactly explained in depth in the Perl docs for BerkeleyDB.pm.

Both ways appear to work fine for me in version 4.5.20. (What's your version, by the way?)

Now for the detail.

Specifying the -LockDetect flag is really just that. There are a couple of values to choose from. I chose DB_LOCK_DEFAULT and it appeared to work just fine. With more clues as to what's going on you could certainly get more fancy.

Running the db_deadlock utility could be done like this:

db_deadlock -h your/env/dir -v -t 3   # run as daemon, check every 3 seconds
db_deadlock -h your/env/dir -v        # run once

Here's a quote from the db_deadlock manual:

This utility should be run as a background daemon, or the underlying Berkeley DB deadlock detection interfaces should be called in some other way, whenever there are multiple threads or processes accessing a database and at least one of them is modifying it.

I arrived at the conclusion that both ways do work fine by repeatedly performing a test with two writers and one reader, which would deadlock a couple times while putting new entries in the database in rapid succession (100 per second), or going through a cursor of all keys in the database.

The flag method appears to deal with deadlocks very quickly, they didn't become noticeable in my tests.

On the other hand, running the db_deadlock utility with verbose output in paralles with the scripts is instructive in that you see how they block and then continue after lockers have been aborted, especially when combined with the db_stat utility:

db_stat -Cl # Locks grouped by lockers
db_stat -Co # Locks grouped by object
db_stat -Cp # need_dd = 1 ?
db_stat -CA # all of the above plus more

I lack the expertise to explain all the details, but you can see that in blocked situations there are certain entries there while in others there aren't. Also see the section entitled Berkeley DB Concurrent Data Store locking conventions(what is IWRITE?) in the Berkeley DB Programmer's Reference Guide.

You're asking how these deadlocks did occur. Can't say exactly, but I do see that they are occurring with concurrent access. You're also asking how they were resolved. I have no idea. In my test scenarios, blocked scripts will simply hang. Maybe in your scenario someone ran deadlock detection without you knowing about it?

For completeness, your application might simply hang because a thread has not closed resources before exiting. Might happen if you just Ctrl-C a process and there is no clean-up handler in place to close resources. But that doesn't appear to be your problem.

If it does become your problem, you should review the section on Handling failure in Data Store and Concurrent Data Store applications in the Reference Guide.

CDS and DS have no concept of recovery. Since CDS and DS don't support transactions and don't maintain a recovery log, they cannot run recovery. If the database gets corrupted in DS or CDS, you can only remove it and recreate it. (Taken moreless verbatim from the Berkeley DB Book by Himanshu Yadava.)

Finally, there are video tutorials on the Oracle site, including one on using CDS by Margo Seltzer.

Idiomatic haskell for database abstraction

6 votes

In OOP languages I might write a database wrapper which encapsulates database connection, manages schema and provides few core operations, such as exec, query, prepare_and_execute. I might even have a separate database helper class which would handle the database schema, leaving the database abstraction only to handle connections. This would then be used by model wrappers/factories which use the database abstraction class to create instances of model classes. Something along the line like this UML diagram:

What would be the preferred way to design such a system in idiomatic haskell?

The most used database abstraction library in Haskell is HDBC. It means that queries are simply represented as Strings with placeholders. Fewer people use HaskellDB which provides a type-safe way to build queries. Nothing forbids to have user data types to represent common queries and custom functions to build them.

Values in Haskell are immutable, that means that it is not useful to have a mutable object corresponding to a record in the database. Instead, I think it is more common to define user data types and functions that marshall and push/pull values of these types to/from the database.

Whenever database updates are necessary, they are likely to be run in some stateful monad under IO. This would allow to keep the connection open, for example, or do something between the requests.

Finally, functions are first class, so it is possible to construct all functions on the fly. So a function itself may encapsulate whatever information you want.

So, I think, the usual Haskell approach consists of

  • algebraic data types to represent actual data (as immutable values)
  • the rest of the application to transform these values
  • functions which generate queries (encapsulate schema details, marshal data to/from Haskell data types)
  • (optionally) a stateful monad to run queries (hide details of database access)
  • functions which run the queries (hide details of database access)

Is there any downside to giving extra size/length to your database columns?

6 votes

I usually set all my varchars to 255 to be safe. Does it make any difference in terms of the disk space or anything else? Is there any downside to having bigger varchars/ints/other fields than you would mostly need?

I takes more time and more disk transfers to load larger data items into memory. Defining large maximum sizes for columns increases the size of table rows. For many DBMS servers, table rows are the items transferred. So defining columns that are too fat does slow things down.

This effect is minimal for VARCHAR items. But VARCHAR is quite a bit slower than data types like integers. Eight byte integers take four times as much time to transfer as two byte integers. So, if a database is being designed for ultimate performance, limiting data columns to the range actually required will speed things up.The extent of this effect depends on whether the disk channel is a bottleneck or not.

Another possible bottleneck is the channel that links the server with the client, often a network channel. Bottlenecking in this channel can be reduced by queries that don't ask for data that will never be used, but there's a trade off here between asking for data only when you need it and making too many round trips.

There's also a trade off between designing for optimal performance and over designing in the anticipation of chagning requirements.

Get last record of each month in MySQL....?

5 votes

Hello All, I have a problem in writing a query for MySQL. I have following fields in the DB

id     created_on            status
1      2011-02-15 12:47:09    1 
2      2011-02-24 12:47:09    1
3      2011-02-29 12:47:09    1
4      2011-03-11 12:47:09    1
5      2011-03-15 12:47:09    1
6      2011-03-22 12:47:09    1
7      2011-04-10 12:47:09    1
8      2011-04-11 12:47:09    1

I need to select the last record of each month. That is for the month FEB record # 3 month MARCH record # 6 and for month APRIL record # 8

Please help me.....

Thanks in advance.....

SELECT * FROM table 
WHERE created_on in 
(select DISTINCT max(created_on) from table 
GROUP BY YEAR(created_on), MONTH(created_on))

Understanding COMPATIBILITY_LEVEL in SQL Server

5 votes

I understood that setting a database to a COMPATIBILITY_LEVEL prior to your native one prevented features from being used. However this doesn't seem to be the case. Witness the following SQL script:

CREATE DATABASE Foo
GO
USE Foo
GO
ALTER DATABASE Foo SET COMPATIBILITY_LEVEL = 80
GO

CREATE TABLE Bar
(
    Id UNIQUEIDENTIFIER NOT NULL,
    TestNvcMax NVARCHAR (MAX) NOT NULL, -- Arrived in SQL 2005
    TestDateTime2 DATETIME2 (7) NOT NULL -- Arrived in SQL 2008
)
GO

But this table creates perfectly - any ideas? I would have thought some kind of an error message or warning would have been appropriate

Here you can read about the differences between compatibility level 80, 90 and 100. ALTER DATABASE Compatibility Level

Apparently new data types is not affected. I think that compatibility level is there to make SQL Server "behave" like the older version, not prevent you from doing new fancy stuff.

Performance Testing Various Databases

5 votes

I am currently testing a few different relational database management systems. (MySQL, PostgreSQL, Oracle and Firebird SQL) using a Java application to do so.

I was debating what tests I should run to distinguish the performances of each.

The obvious ones would be insert, select, delete and drop.

I would love to hear your opinions and to make this apply to the question answer format I shall ask for the 5 most appropriate tests to indicate performace differences. In an ideal world I would like to mimic real word use.

Thanks to all who answer.

I think that any of them would probably be fine. However, your configuration of the different databases for what you are trying to do would be different based on your application.

Suggested place to start: look for apps similar to yours. See what they are using, if you can. Then start testing the different databases with similar configurations and see what works for you.

Personally I've used Oracle, MySql, and Postgres over the last 11 years and they've all worked well. It's really all in your configuration, which is where a good DBA comes in handy.

Techniques for Data Aging

5 votes

Hello, I'm looking for information about how to age data in a db, generally related to Oracle and Sql Server, but any database would be good. Any examples or books containing examples of how the best techniques would be cool.

Bob

In Oracle, partitioning is a very useful for implementing Information Life-cycle Management, this enables you to manage data partition wise and store recent, most accessed data on quicker storage and older data, most of the times less often accessed on cheaper storage. IF this is what you are trying to do, take a look at partitioning. In 11g: interval partitioning, takes out having to pre configure partitions; partitions are now created on as needed basis and reference partitioning. This is also a performance booster because it is now easier to do partition wise joins using PQ. It also saves space because the redundant key information is now in the partition definition.

Is there a mySQL frontend that has an interface for joins?

4 votes

Are there any mySQL frontends, like phpmyadmin, that has a graphical interface for joins?

I know you can run saved queries (which may include joins) in phpmyadmin, but I am looking for a user-friendly way of how other frontends are tackling the problem. I don't actually need a frontend, I just want to see how others are doing it.

If there are none are available, what would be a good way of approaching creating a join interface?

I am currently thinking, given a student and enrollment table (as a super simple example), such that

student table
+---------------------------+
| id   | name   | number    |
+------+--------+-----------+
| 2    | Joe    | 04567843  |
| 3    | Jim    | 43243254  |
| 4    | Jack   | 23145671  |
+------+--------+-----------+

and..

enrollment 
+---------------------+------------+-----------+
| id   | student_id   | course_id  | score     |
+------+--------------+------------+-----------+
| 1    | 2            | ma001      | 86%       |
| 2    | 2            | en001      | 46%       |
| 3    | 3            | ma001      | 78%       |
+------+--------------+------------+-----------+

The interface could allow you to select a primary table, and the fields you want, then a secondary table, and the fields you want. And finally, a JOIN fieldset, where you choose the join type and the fields connecting it (see image).

The image is a mockup using firebug manipulating phpmyadmin to show what I mean (hopefully)

enter image description here

I realise this is kind of 2 questions, but highly linked to each other, but to summarise, does a front end like this exist? And if not, would the above approach work?

If you want to look at how others do it, play around with Microsoft Access a bit. e.g. enter image description here

versioning each field vs history date field?

4 votes

Which do you recommend and why?

I have a few tables, when i make a change to the data... it should go to a history table (audit) with a effective date.

The other solution is versioning each field to insert a new row when making changes to the data?

Which is the best method for the invoice information? Item name and price is always change

These are slowly changing dimensions, type 2 and type 4, appropriately.

Both methods are valid and may be more appropriate for your needs, depending on your model and query requirements.

Basically, type 2 (versioning) is more appropriate when you need to query historical values as often as the current one, while type 4 (history table) is more suited when you are querying the current value more often and there are more queries (more queries to develop I mean) against the most recent value.

Is it ok to store presentation data in the database?

4 votes

I am to displaying rows from a database (SQL SERVER 2005) table in a webpage. These rows contain a statusID (foreign key) that is further defined by Status table (eg. id, name, modifiedDate).

The various statuses should display differently (lets say they simply have a different background color).

I am using php to query the database, and build the web page table. As I build the table I am going to apply a css class to an element based on the status of that row.

I have at least 2 options to do this:

  1. Define code logic in the php to handle it, and if the statuses are changed in the database, I will have to change the code.

  2. Store the "class" in the database and simply apply the class that has been stored.

The latter option seems better to me, but I am unsure if embedding presentational data into the database is a bad design choice. This will be the foundation on which I create several intranet utilities, and I'd like to start off on the right foot.

There's nothing wrong with storing any data in the database, including presentation data. If it helps you produce effective results, while writing less code, then it is a good practice. The thing you need to make sure of is that you don't mix your presentation logic with your database logic.

You can ensure that these concerns are separated by encapsulating the data for your presentation layer in the properties of an elementInfo object (for example).

Since it is a CSS class that you are talking about, this presentation data should be kept separate from the business data. So, while it is okay to store both presentation data and business data in the database, it is not acceptable to store them in the same table.

Update re: comment No, you should not add a PresentationClassRecord's Id as an FK on the business object. I made a sample of an approach to the db below. I called the DummyTable your business objects, and the rest follows specification. The most relevant part is the StatusPresentationAssignmentTable

 -----------------------------------------------
 DummyTable
 -----------------------------------------------
 Id      Name     SomeOtherDataField    StatusId
 PK int  varchar  int                   FK int

 -----------------------------------------------
 StatusTable
 -----------------------------------------------
 Id      Name     ModifiedDate
 PK int  varchar  datetime

 -----------------------------------------------
 PresentationTable
 -----------------------------------------------
 Id      PresentationType    Value 
 PK int  varchar            
 sample data:
 43      CssClass            prettyBackground

 -----------------------------------------------
 StatusPresentationAssignmentTable
 -----------------------------------------------
 StatusId    PresentationId 
 FK int      FK int

Now with two simple join clauses you can get the presentation data and it is completely decoupled from your business data. Your script could do something like check if the Status of the Dummy has any presentation assignments. If it does then it looks at the PresentationType, gets the appropriate function to apply the presentation-data to the presentation, and executes it. (You would need to have a function for each PresentationType that knows how to handle the value - something that could be encapsulated by something like function applyPresentationValue(presentationElement, presentationType, presentationValue) that calls a different function applyCssClass(presentationElement, value) if the presentationType == "CssClass" ).