Best database questions in June 2012

How to successfully rewrite old mysql-php code with deprecated mysql_* functions?

12 votes

I am still learning mostly from books I buy, but today I leart that my book is old even though I bought it this year concerning programming in PHP. Now I know that mysql_* commands in PHP are deprecated and should be replaced with more secure and stable prepared statements and PDO. So I put myself to rewrite all my web according to it and maybe I will need some advices from you how to do it properly and working from you all more experienced guys :)

So I will start my rewrite with only main part (connect to db and choosing DB) in here (the rest I can do on my own with google and manuals). I will write here my old script and ask you if I am doing everything right and not missing anything and I hope this could be some good manual/answer for other people as well. So lets start.

So in config I have something like this:

$db = new dbConn('127.0.0.1', 'root', 'pass', 'people', 'animals');

Which should be like this:

$db = new PDO('mysql:host=127.0.0.1;dbname=people;charset=UTF-8', 'root', 'pass');

Right? But when I need to choose database later should i do it without dbname=people;? But how to choose database later?

Here is my one and only script to rewrite which is basic in most web projects and I hope it will bring not only me some understanding how new PDO system really works:

class dbConn
{
  public function __construct($server, $user, $pass, $db_people, $db_animals)
  {    
    if (!empty($server) && !empty($user) && !empty($pass) && !empty($db_people) && !empty($db_animals))
    {
      $this->server = $server;
      $this->user =  $user;
      $this->pass = $pass;
      $this->db_people = $db_people;  
      $this->db_animals = $db_animals;  
      $this->connect(); 
    }  
    else
    {
      die("Set up connection to db");
    }
  }

  public function connect()
  {
    $this->conn = mysql_connect($this->server, $this->user, $this->pass) or die ('cannot connect to MySQL');
  }

  public function selectDb($database)
  {
    switch($database)
    {
      case 'people':
        mysql_select_db($this->db_people, $this->conn) or die ('cannot connect to database '.$this->db_people.'.');
        mysql_query("SET NAMES 'utf8'");
        break;

      case 'animals':
        mysql_select_db($this->db_animals, $this->conn) or die ('cannot connect to database '.$this->db_animals.'.');
        mysql_query("SET NAMES 'utf8'"); 
    }
  }

  public function __destruct() 
  {
    if (!empty($this->conn))
    {
      mysql_close($this->conn); 
    }
  }  
}

So from what I know from Google and Wiki - functions like public function __construct and public function __destruct() should not be needed anymore, right? The same with functions like public function connect() SO only whats left is public function selectDb($database) but i have no idea how to do this correctly without damading all connection to database. Because in rest of my code (not mentioned here) I can easily choose database by this code: $this->db->selectDb("people"); But with prepared statements I do not know if this is even possible in easy way. I hope some advices around this from you will help me and other users understand this new code better. Other parts in code you may have are eplained in this PDO Tutorial for MySQL Developers. Thank you.

Actually, a simple, sweet and short: Yes, not necessary any longer.

Let's review the code not that we have lost something:

  • __construct - The constructor merely contained all the configuration. PDO has a much easier concept here, a connection string containing the most information:

     mysql:host=127.0.0.1;dbname=people;charset=UTF-8
    

    Also PDO provides the constructor for use ready-made, so double not necessary.

  • connect - The connection function is not necessary any longer as well. This is done by instantiating PDO already. You can look for exceptions, the PHP manual has an example on it's constructor page.

  • selectDb - This complicated function is not needed any longer as well. Wow, the third function we can just drop because of the PDO connection string. Much power with so less characters. Cheers!

  • __destruct - The destructor. Let's be fair: MySQL did not need this as well. However with PDO we get it for free - without writing a single line of code.

Looks good! You managed to migrate from that obscure database class to PDO by removing outdated code! Congratulations:

$db = new PDO('mysql:host=127.0.0.1;dbname=people;charset=UTF-8', 'root', 'pass');

If you now think, what about if I want to have database class on my own? Well you can do that, because you can extend from PDO (yes that works!):

class DB extends PDO
{
   ... my super-new-shiny-code
}

Why you might want to do that? No idea, but maybe it's more fluent for your code. If you're looking for a better code-example, I have one at PHP/MySQL Table with Hyperlinks.

Re-indexing huge database (the English Wikipedia) efficiently

11 votes

THE GIST

Before performing a massive 40+ GB import of the English Wikipedia, I had to temporarily remove indexes and auto-increment fields from three tables ('page', 'revision', and 'text') to handle the load. Now I have finally successfully imported the English Wikipedia to my local machine and created a local mirror (MediaWiki API). Yay!

However, I now need to re-create the indexes and auto-increment fields in less than a decade. Luckily, (1) I took plenty of screen-shots of the relevant tables in phpmyadmin before I removed the indexes and fields; (2) I can explain with extreme precision the steps I took before the import; and (3) this shouldn't be too difficult for anyone fluent in MySQL. Unfortunately, I have no expertise in MySQL whatsoever, so "baby steps" explanations would be extremely helpful.

PRECISELY WHAT I DID (PREPARING FOR THE IMPORT):

Steps 1, 2, 3: This image depicts the table page before I modified the field page_id by clicking 'Change' and un-checking 'Auto-Increment' (in preparation for importing). I performed precisely the same modifications for the field rev_id in table revision and old_id in table text but omitted the screen-shots to avoid redundancy.

table 'page' before modification of 'page_id'

Step 4: This image depicts the indexes for the table page before I dropped all of them.

indexes for table 'page' before I dropped them

Step 5: This image depicts the indexes for the table revision before I dropped all of them.

indexes for table 'revision' before I dropped them

Step 6: This image depicts the indexes for the table text before I dropped all of them.

indexes for table 'text' before I dropped them

WHAT I NEED NOW (RESTORING AFTER THE IMPORT):

I just need to restore the original indexes and auto-increment fields without waiting a hundred years.

Set-up details: PHP 5.3.8 (apache2handler), MySQL 5.5.16 (InnoDB), Apache 2.2.21, Ubuntu 12.04 LTS, MediaWiki 1.19.0 (private wiki)

I really like Wikipedia so I'll try to help.

You need to use a lot of

ALTER TABLE

Add primary keys

ALTER TABLE page ADD PRIMARY KEY (page_id);
ALTER TABLE revision ADD PRIMARY KEY (rev_id);
ALTER TABLE text ADD PRIMARY KEY (old_id);

Add auto increments back

ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

I need the table descriptions for all tables before continuing. If rev_id and old_id are same definitions as page_id then:

ALTER TABLE revision MODIFY COLUMN rev_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;
ALTER TABLE text MODIFY COLUMN old_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

Add unique keys

ALTER TABLE page ADD UNIQUE name_title(page_namespace, page_title);
ALTER TABLE revision ADD UNIQUE rev_page_id(rev_page, rev_id);

Other indexes

ALTER TABLE page ADD INDEX page_random(page_random);
ALTER TABLE page ADD INDEX page_len(page_len);
ALTER TABLE page ADD INDEX page_redirect_namespace(page_is_redirect, page_namespace, page_len);
ALTER TABLE revision ADD INDEX rev_timestamp(rev_timestamp);
ALTER TABLE revision ADD INDEX page_timestamp(rev_page, rev_timestamp);
ALTER TABLE revision ADD INDEX user_timestamp(rev_user, rev_timestamp);
ALTER TABLE revision ADD INDEX user_text_timestamp(rev_user_text, rev_timestamp);

Again, there may be column definitions that change this stuff. You need to provide the CREATE TABLE info.

What exactly is "pinning" in relation to indexes in DBMS?

7 votes

As part of a test question, students were asked to define and describe pinning in relation to indexes. When one talks about "pinning" in indexes, what exactly is this? Is there some other word/term I can search for, as google has not provided any solutions.

When a table or index is pinned, it means it is kept in memory.

The database has a certain amount of memory allocated to work with. Typically, the database caches recently used data.

When an index is pinned, it means the index is kept in memory permanently, rather than being aged out of the cache.

When is it appropriate to use a database , in Python

6 votes

I am making a little add-on for a game , and it needs to store information on a player's username ,ip-address ,location in game , and a list of alternate user names that have came from that ip or alternate ip addresses that come from that user name. I read an article a while ago that said that unless I am storing a large amount of information that can not be held in ram , that I should not use a database . So I tried using the shelve module in python , but I'm not sure if that is a good idea . When do you guys think it is a good idea to use a database, and when it better to store information in another way , also what are some other ways to store information besides databases and flat file databases.

Most importantly, unless you specifically need performance or high reliability, do whatever will make your code simplest/easiest to write.


If your data is extremely structured (and you know SQL or are willing to learn) then using a database like sqlite3 might be appropriate. (You should ignore the comment about database size and RAM: there are times when databases are appropriate for even very small data sets, because of how the data is structured.)

If the data is relatively simple and you don't need the reliability that a database (normally) has then storing it in one of the builtin datatypes while the program is running is probably fine.

If you'd like the data stored on disk to be human readable (and editable, with a bit of effort), then a format like JSON (there is builtin json module) is nice, since the basic Python objects serialise without any effort. If the data not so simple then YAML is essentially an extended version of JSON (PyYAML is very good.). Similarly, you could use CSV files (the csv modules), although this is not nearly as good as JSON or YAML, or just a custom text format (but this is quite a lot of effort to get error handling and so on implemented neatly).

Finally, if your data contains more advanced objects (e.g. recursive dictionaries, or complicated custom datatypes) then using one of the builtin binary serialisation techniques (pickle, shelve etc.) might be appropriate, however, YAML can handle many of these things (including recursive data structures).

Some general points:

  • Plain text formats are nice, as they allow values to be tweaked easily and debugging/testing is easy
  • Binary formats are nice, as they mean that values can't be tweaked without a little bit of extra effort (this is not saying they can't be adjusted though), and the file size is smaller (probably not relevant)

How does one insert records simultaneously into two database tables?

6 votes

I'm a newbie to databases and db's in Delphi, learning about both via online instructional materials. I'm struggling with a difference between real-life and the examples I'm finding. To be concrete, consider the common Books and Authors many-to-many relationship. Assume you've got a Book table (book_id, book_title, etc.), an Author table (author_id, author_name, etc.), and an AuthorBook join table. All three tables would have unique ID's, auto-generated, as primary keys.

The examples always begin with Author and Book information already inserted in their respective tables. In real-life, though, I think you'd be trying to insert records into both tables simultaneously, i.e., users would see a form or grid with places to enter a book's title and its author(s). How would something like that be coded in Delphi, assuming data-aware controls, an underlying Access database (or something else alterable via SQL)?

If you started with tables like these . . .

create table books (
  book_id integer primary key,
  book_title varchar(15) not null
);

create table authors (
  author_id integer primary key,
  author_name varchar(15) not null
);

create table book_authors (
  book_id integer not null references books (book_id),
  author_id integer not null references authors (author_id),
  primary key (book_id, author_id)
);

. . . and if you need to insert a new book and a new author at the same time, you might execute a SQL transaction like this.

begin transaction;
insert into books values (1, 'First book');
insert into authors values (1, 'First author');
insert into book_authors (book_id, author_id) values (1, 1);
commit;

Using a single transaction guarantees that either all three inserts are written to the database, or that none of them are. Alternatives are

  • to build an updatable view in the database, joining all three tables, and inserting into the view,
  • to write a stored procedure in the database, and insert through the stored procedure, and
  • to insert into each table separately, which assumes that the existence of the book is important even if you don't know the author, and vice versa. (This is probably what I'd do for books and authors.)

If you were adding a new book for an existing author, you'd execute a slightly different transaction.

begin transaction;
insert into books values (2, 'Second book');
insert into book_authors (book_id, author_id) values (2, 1);
commit;

I imagine Delphi is like any other client-side language here. Instead of literal integers, you'd reference some property of the data-aware controls, perhaps a "value" or "text" property. And you'd execute the transaction in a button's "click" event.

If Delphi is sufficiently "data aware"--using controls that are bound to columns and rows in a database, like Access's native controls are--you might not need to execute any SQL or do anything special to save any automatic ID number the dbms generates; it will be accessible through one of the control's properties. (Access's forms and controls are highly data aware; that's how they work.) But if you have to, and you're using Microsoft's OLEDB provider for Access, you can use select @@identity to get the last id number used through your connection.

What's the most efficient way of creating dynamic page body?

6 votes

I've took a look at the PHP script behind my father website which has been built by a hired programmer. Now, I'm not thinking that I'm better than him, but I think its technique might not be the best.

The website has dynamic page body, in the meaning that my dad can, via a specific admin page, modify the HTML content of most of the webpages in the website. Right now it's made via database: the pages are all stored in the database and every request deals with a query that fetches the page from the database and implement it.

Now, I think this way is very bad mostly because it requires (even if not that expensive if cached) an additional query to the database. Wouldn't it be more efficient to store the pages as HTML files and then just modify the file itself when required? In this way the editing of the file, I think, is faster, and the loading of the content of an html file per request is a lot easier and faster than perform a query.

Is it? Is there any other (more efficient) way to handling this situation?

There are several good reasons why a CMS should use a Database to store/fetch the dynamic content. Just as there are several reasons why you might prefer not to rely on a DB.

  • Pro Db:

    • Security: It's an obvious, and slightly ambivalent argument, but nonetheless. If you decide to store your content as separate files on your server, they'll need to be stored in a directory that doesn't allow public access. If not, users might be able to access the chunks of your site separatly, which comes across as unprofessional.
      People with ignoble intentions will have an easy time altering your site's content, too. Of course, there are many ways to prevent this, and increase overall security. Database systems, when left to their own devices, aren't exactly safe either, but provide an extra obstacle to hackers with minimal effort.
      note: The security argument stands, or falls with how well your script filters out injection, and how secure you set up your server.

    • Disk usage. When using separate files to compose each requested page, The server has to access its HD on each request. Again, caching solves this issue to some extend, but it's easier and (in general) better to cache DB query results (performance wise). Either on your Database server, in PHP, or, better still, both.

    • Logging. By this I mean: when you alter the content, a database driven CMS is a lot easier to manage. If you altered the content, and want to undo/rollback the changes, a DB is the easiest way to implement such a feature. Using HTML, you'll soon find yourself wading through tons of files called site_menu_block_YYYY-mm-dd.html.backup. Even if this is done by a script, it'll almost certainly be slower than using a DB.

    • Translation: as vlzvl pointed out, if you're using static pages, you'll either end up with each page N times, once for each language. When altering the stylesheets, you'll then have to alter N files, too. Which is resource expensive. Alternatively, your scripts will parse an HTML template file for each request, and an XML file with the actual contents. This way you loose the SEO benefit of the HTML files, and cause extra server load and slow down your site.

  • Pro HTML:

    • I can only give 1 solid pro argument here: it's a lot easier to get an SEO site this way. Just allow search engines to index the separate files. This does decrease the overall security of your CMS drastically .

That said, I think I'm right in saying that all major CMS's use both methods, depending on what type of data they're dealing with. HTML headers, for example, are often partially stored as separate files, just like JS files and style-sheets.

How can I execute code stored in a database?

5 votes

How can I store, for example, the body of a method in a database and later run it? (I'm using Delphi XE2; maybe RTTI would help.)

RTTI is not a full language interpreter. Delphi is a compiled language. You write it, compile it, and distribute only your binaries. Unless you're Embarcadero, you don't have rights to distribute DCC32 (the command line compiler).

However, the JVCL includes a delphi-like language subset wrapped up in a very easy to use Component, called "JvInterpreter". You could write some code (as pascal) and place it in a database. You could then "run that code" (interpreted, not compiled) that you pull from the database. Typically these should be procedures that call methods in your code. YOu have to write some "wrappers" that expose the compiled APIs that you wish to expose to the interpreter (provide access to live data, or database connection objects, or table/query objects). You're thinking that this sounds perfect right? Well, it's a trap.

Beware of something called "the configuration complexity clock". YOu've just reached 9 o'clock, and that's where a lot of pain and suffering begins. Just like when you have a problem, and you solve it with regular expressions, and "now you have two problems", adding scripting and DSLs to your app has a way of solving one problem and creating several others.

While I think the "DLL stored in a database blob field" idea is evil, and absurd, I think that wanton addition of scripting and domain-specific languages to applications is also asking for a lot of pain. Ask yourself first if some other simpler solution could work. Then apply the YAGNI principle (You Ain't Gonna Need It) and KISS (keep-it-simple-smartguy).

Think twice before you implement anything like what you're asking about doing in your question.

How to get Max id from table of database in java code

5 votes

I want to write code which give max id from the table but it is throwing error.

code:

Class.forName("oracle.jdbc.driver.OracleDriver");
Connection con = DriverManager.getConnection("XXXXX", "XXXX", "XXX");
Statement st2 = con.createStatement();
ResultSet idMax = st2.executeQuery("select nvl(max(work_id),0) from workdetails");
int id2 = idMax.getInt(0);  // throw error: Invalid column index

System.out.println(id2);

// ****************************
int id2 = idMax.getInt("work_id");
System.out.println(id2);   // throw error: ResultSet.next was not called

A result set starts at a dummy record and should be advanced to the real first record by calling the next method :

ResultSet idMax = st2.executeQuery("select nvl(max(work_id),0) max_id from workdetails");
int id2 = -1;
if (idMax.next()) {
   id2 = idMax.getInt("max_id");  
}

Main table with hundreds vs few smaller

5 votes

I was wondering which approach is better for designing databases?

I have currently one big table (97 columns per row) with references to lookup tables where I could.

Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?

If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.

97 columns isn't much, really - I've seen way beyond 100.

It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.

It might make sense if:

  • you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column

  • you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.

How to write a MYSQL query that will return children nested under parents?

5 votes

I don't know if what I'm asking is even possible, but here's my situation. I have a table structured somewhat like this:

+--------------------------------------------------+
|   id   |   parent_id   |   name   |   category   | ....
+--------------------------------------------------+
|    0   |       -1      |   item0  |      1       |
|    1   |        0      |   item1  |      1       |
|    2   |        0      |   item2  |      1       |
|    3   |        2      |   item3  |      1       | 
|    4   |        2      |   item4  |      1       | 
|    5   |       -1      |   item5  |      1       | 
+--------------------------------------------------+

A parent_id of -1 will mean it is a "base" item with no parent. Each item will have more columns of information. I need to somehow output all items in a category nested like the following:

item0 => item1    
      => item2
            => item3
            => item4  
item5  

I don't know if that makes sense or not, but hopefully it does!

The only way I can think of doing this is making a query to get all of the "base" items (querying for rows with parent_id = -1) then iterate through every resulting row, querying for rows that have their parent_id equal to the current row's id, then repeating the process going deeper and deeper until there aren't any more children for a base item.

Is there a better way?

Thanks!!

It is not possible in pure SQL.

SQL is intended to work with relational data not trees (hierarchical data).

You can represent tree in an SQL schema, however you won't be a able to result in a tree as you intend to do.

The only way to do is to get an usable result by making as many join as level you're storing.

Your current schema may support multiple level, however, it will be very difficult to manage more than one or two level.

You may be interested in Nested Set Model or Managing hierarchical data in mysql

There are some implementation of the Nested Set like this one to work with Doctrine 2

Database search to return results ordered by size of intersection between two sets

4 votes

I would like to search my database for sets that intersect with my search set. I would like for the results to be returned to me in order of the size of the intersection.

The sets inside the database row will be on the order of about 10,000. The search sets are on the order of about 500. The number of rows in the database is about 1,000,000.

EXAMPLE QUERY:

search_set = [ This set has 500 id's ]

SELECT rows WHERE "find_set" INTERSECTS "search_set" 
    ORDER BY "size of the intersection"

EXAMPLE DATABASE:

index         find_set
1             [set with 10,000 ids]
2             [set with  5,000 ids]
...
1,000,000     [set with 15,000 ids]
  • How long cam I expect this query to take?
  • Is there a particular database or database library that I should be using?
  • Do I need to do some pre-processing?
  • How do databases implement this type of query? Do they do one search for each of the 500 ids in "search_set"?
  • What other things do I need to know about this type of problem and how it is solved?

Thanks so much!

The performance of this query depends strongly on the database optimization engine and the way you perform the query.

First of all databases don't generally have tables with 15,000 ids in a column. Instead you'll need something like this pair of tables:

set
---
id

set_entry
-----------
id
set_id
entry

The first table will have a million rows. The second more like 10 billion. Put an index on set_entry.entry.

The best way generally to arrange your query is to have a temporary table of some sort whose rows are the values of your query set. Then execute a query like this:

SELECT set_entry.id, COUNT(*)
FROM set_entry
  JOIN query_entry
    ON set_entry.entry = query_entry.entry
GROUP BY set_entry.id
ORDER BY count(*) DESC

The query plan that you want is that for each of your elements it should do a lookup on the index, pull back all matching rows, then proceed to do a grouping operation to figure out how many there are for each set you intersect. On the first step you'd do 500 lookups, then pull back somewhere between 0 and 500 million rows. Let's say you're pulling back 5 million. The grouping operation will be done either by building a hash or sorting the data (databases can do it either way), both of which should be plenty fast.

There are a lot of unknowns, but this plan is likely to take a few seconds.

What you want to be careful about is a query like this:

SELECT set_entry.id, COUNT(*)
FROM set_entry
WHERE entry IN (id1, id2, ....)
GROUP BY set_entry.id
ORDER BY count(*) DESC

In my experience most database engines look at this, then decide that they cannot use the index. Instead they will scan through all of set_entry (which had 10 billion rows), and for each one scan through that set of 500 elements, doing pairwise comparisons. This means an initial step of about 5 trillion pairwise comparisons. This plan will easily keep your CPU busy for hours.