Best mysql questions in September 2010

Why does this simple mysql insert query take occasionally so long?

15 votes

Ok, I've got a real head scratcher... I'm going bald!

This is a pretty simple problem. Inserting data into the table normally works fine, except for a few times, the insert query takes a few seconds. This isn't very good, so I setup a simulation of the insert process. I am NOT trying to bulk insert data. I am trying to find out why the insert query occasionally takes more than 2 seconds to run. Joshua suggested that the index file may be being adjusted; I have removed the id (primary key field), but the delay still happens.

I have a MyISAM table: daniel_test_insert (this table starts COMPLETELY empty):

create table if not exists daniel_test_insert ( 
    id int unsigned auto_increment not null, 
    value_str varchar(255) not null default '', 
    value_int int unsigned default 0 not null, 
    primary key (id) 
)

I insert data into it, and sometimes, a insert query takes > 2 seconds to run. THERE ARE NO READS on this table. All writes, in serial, by a single threaded program.

This same row; 100,000 times. I run the exact same query 100,000 times, because once in a while the query takes a long time, and I'm trying to find out why. It appears to be a random occurrence so far though.

This query for example took 4.194 seconds (a very long time for an insert)

Query: INSERT INTO daniel_test_insert SET value_int=12345, value_str='afjdaldjsf aljsdfl ajsdfljadfjalsdj fajd as f' - ran for 4.194 seconds
status               | duration | cpu_user  | cpu_system | context_voluntary | context_involuntary | page_faults_minor
starting             | 0.000042 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
checking permissions | 0.000024 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Opening tables       | 0.000024 | 0.001000  | 0.000000   | 0                 | 0                   | 0                
System lock          | 0.000022 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Table lock           | 0.000020 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
init                 | 0.000029 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
update               | 4.067331 | 12.151152 | 5.298194   | 204894            | 18806               | 477995           
end                  | 0.000094 | 0.000000  | 0.000000   | 8                 | 0                   | 0                
query end            | 0.000033 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
freeing items        | 0.000030 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
closing tables       | 0.125736 | 0.278958  | 0.072989   | 4294              | 604                 | 2301             
logging slow query   | 0.000099 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
logging slow query   | 0.000102 | 0.000000  | 0.000000   | 7                 | 0                   | 0                
cleaning up          | 0.000035 | 0.000000  | 0.000000   | 7                 | 0                   | 0

This is an abbreviated version of the SHOW PROFILE command, I threw out the columns that were all zero.

Now the update has an incredible number of context switches and minor page faults.

Opened_Tables increases about 1 per 10 seconds on this database (not running out of table_cache space)

Stats:

MySQL 5.0.89

Hardware: 32 Gigs of ram / 8 cores @ 2.66GHz; raid 10 SCSI harddisks (SCSI II???) I have had the harddrives and raid controller queried: no errors are being reported. CPU's are about 50% idle.

iostat -x 5 (reports less than 10% utilization for harddisks) top report load average about 10 for 1 minute (normal for our db machine)

Swap space has 156k used (32 gigs of ram :)

I'm at a loss to find out what is causing this performance lag! Does anyone have any suggestions?

This does NOT happen on our low-load slaves, only on our high load master. This also happens with memory and innodb tables.

Warning: This is a production system, so nothing exotic!

-daniel (I'm going to have use my dogs hair for a tuopee!!!)

Updated: Sept 20th, 2010: I'm going bald!

I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.

I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.

Also Ruled Out

  • Server load -- no correlation with high load
  • Engine -- happens with InnoDB/MyISAM/Memory
  • MySQL Query Cache -- happens whether it's on or off
  • Log rotations -- no correlation in events

The only other observation I have at this point is derived from the fact I'm running the same db on multiple machines. I have a heavy read application so I'm using an environment with replication -- most of the load is on the slaves. I've noticed that even though there is minimal load on the master, the phenomenon occurs more there. Even though I see no locking issues, maybe it's Innodb/Mysql having trouble with (thread) concurrency? Recall that the updates on the slave will be single threaded.

MySQL Verion 5.1.48

Update

I think I have a lead for the problem on my case. On some of my servers, I noticed this phenomenon on more than the others. Seeing what was different between the different servers, and tweaking things around, I was lead to the MySQL innodb system variable innodb_flush_log_at_trx_commit.

I found the doc a bit awkward to read, but innodb_flush_log_at_trx_commit can take the values of 1,2,0:

  • For 1, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk for every commit.
  • For 2, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk approximately every 1-2 seconds.
  • For 0, the log buffer is flushed to the log file every second, and the log file is flushed to disk every second.

Effectively, in the order (1,2,0), as reported and documented, you're supposed to get with increasing performance in trade for increased risk.

Having said that, I found that the servers with innodb_flush_log_at_trx_commit=0 were performing worse (i.e. having 10-100 times more "long updates") than the servers with innodb_flush_log_at_trx_commit=2. Moreover, things immediately improved on the bad instances when I switched it to 2 (note you can change it on the fly).

So, my question is, what is yours set to? Note that I'm not blaming this parameter, but rather highlighting that it's context is related to this issue.

Is storing a comma separated list in a database column really that bad?

9 votes

Imagine a web form with a set of checkboxes (any or all can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.

Now, I know that the correct solution would be to create a second table and properly normalize the database. One reason was laziness, my SQL knowledge is very limited and only seldomly used, so I would have had to look up quite some stuff to implement the more correct solution.

I thought the saved time and simpler code was worth it in my situation, it this a defensible design choice, or should I have normalized it from the start?

Edit:

Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the programm and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.

In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:

  • Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
  • Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
  • Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
  • Can't delete a value from the list without fetching the whole list.
  • Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
  • Hard to count elements in the list, or do other aggregate queries.
  • Hard to join the values to the lookup table they reference.
  • Hard to fetch the list in sorted order.

To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.

Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

There are times when you need to employ denormalization, but as @OMG Ponies mentions, these are exception cases. Any non-relational "optimization" benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.

Is it better to create an index before filling a table with data, or after the data is in place?

9 votes

I have a table of about 100M rows that I am going to copy to alter, adding an index. I'm not so concerned with the time it takes to create the new table, but will the created index be more efficient if I alter the table before inserting any data or insert the data first and then add the index?

Creating index after data insert is more efficient way (it even often recomended to drop index before batch import and after import recreate it)

Insert/update helper function using PDO

9 votes

I have a very simple helper function to produce SET statement for traditional plain mysql driver usage:

function dbSet($fields) {
  $set='';
  foreach ($fields as $field) {
    if (isset($_POST[$field])) {
      $set.="`$field`='".mysql_real_escape_string($_POST[$field])."', ";
    }
  }
  return substr($set, 0, -2); 
}

used like this

$id = intval($_POST['id']);
$fields = explode(" ","name surname lastname address zip fax phone");
$_POST['date'] = $_POST['y']."-".$_POST['m']."-".$_POST['d'];
$query  = "UPDATE $table SET ".dbSet($fields)." stamp=NOW() WHERE id=$id";

it makes code quite DRY and easy but flexible at the same time.

I gotta ask if anyone willing to share a similar function, utilizing PDO prepared statements feature?

I am still in doubts, how to accomplish this.
Is there a straight and simple way to use PDO prepared statements to insert data? What form it should be? Query builder helper? Or insert query helper? What parameters it should take?

I hope it can be easy enough to be used as an answer here on SO. Because in the every topic we can see prepared statements usage recommendation, but there is not a single good example. Real life example, I mean. To type bind_param() 20 times is not a good programming style I believe. And even 20 question marks too.

I usually have a class extending PDO, but my class is pretty custom. If I get it cleaned up and tested I will post it at a later time. Here is a solution to your system, however.

function dbSet($fields, &$values) {
    $set = '';
    $values = array();

    foreach ($fields as $field) {
        if (isset($_POST[$field])) {
            $set .= "`$field` = ?,";
            $values[] = $_POST[$field];
        }
    }

    return rtrim($set, ',');
}

$fields = explode(" ","name surname lastname address zip fax phone date");
$_POST['date'] = $_POST['y']."-".$_POST['m']."-"$_POST['d'];

$query  = "UPDATE $table SET ".dbSet($fields, $values).", stamp=NOW() WHERE id=?";
$values[] = $id;

$dbh->prepare($query);
$dbh->execute($values);  

This may not be perfect and could use tweaking. It takes into account that $dbh is setup with a PDO Connection. Pending any minor syntax issues I made, that should work.

EDIT

Really though, I think I would go for Doctrine ORM (or another ORM). As you setup the model and add all the validation there, then it is as simple as:

$table = new Table();
$table->fromArray($_POST);
$table->save();

That should populate the contents easily. This is of course with an ORM, like Doctrine.

UPDATED

Did some minor tweaks to the first code, such as putting isset back and using rtrim over substr. Going to work on providing a mock up of a PDO Extension class just gotta layout the way to do it and do some unit tests to make sure it works.

localhost vs. 127.0.0.1

8 votes
  1. Does using localhost in mysql_connect() make the connection faster than using 127.0.0.1?
  2. What is the connection type between the PHP script and mySQL (when using the mysql_connect() function) ? Is it TCP/IP?

1) Differs between Windows and Linux. If you use a unix domain socket it'll be slightly faster than using TCP/IP (because of the less overhead you have).

2) Windows is using TCP/IP as a default, whereas Linux tries to use a Unix Domain Socket if you choose localhost and TCP/IP if you take 127.0.0.1.

How established are ORMs (object relational mapping) in the world of databases

8 votes

I'm not a database admin or architect, so I have to ask those who do it 24/7. How established is the concept of an ORM (object relational mapping) in the world of database administration and architecture? Is it still happening, widely approved but still in its early stages, or is generally disapproved? I'm learning this area and would like to get a feel for whether it's going to be knowledge appreciated by the wider segment of this field.

Widely used and definitely the present and near future. Database access through a handcoded layer of SQL generation was always fraught with drudgery and typos, and was unwieldy at best. ORMs let you use a persistence store in a programming way.

I thought this blog argued for it well: http://jonkruger.com/blog/category/fluent-nhibernate/ and SO posts like this (http://stackoverflow.com/questions/1114215/nhibernate-versus-llblgen-pro) show just how many people are using them.

how to get mysql command line client not to print blob fields in select *

8 votes

Exploring some tables which have blob fields. How could I do a select * with the command line client and have it surpress the printing (or truncate to a standard field width) the blob fields rather than scrolling a bunch of binary junk on the screen? This is with mysql 5.1 client. Just want to do a select * and not list all of the non-blob fields individually, for development.

This can be performed natively in MySQL, but it's quite unwieldy:

SET @sql=CONCAT('SELECT ', (SELECT GROUP_CONCAT(COLUMN_NAME) FROM 
    INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='test' AND TABLE_NAME='test' 
    AND DATA_TYPE!='blob'), ' FROM test.test');
PREPARE preparedsql FROM @sql;
EXECUTE preparedsql;
DEALLOCATE PREPARE preparedsql;

I generally prefer BASH aliases/function to MySQL procedures as they're more transportable between systems:

function blobless()
{
  cols=''
  _ifs=$IFS
  IFS=$(echo -en "\n\b")
  for col in $(mysql --skip-column-names -e "SELECT COLUMN_NAME FROM 
      INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='$1' AND TABLE_NAME='$2' 
      AND DATA_TYPE NOT LIKE '%blob'"); do 
    cols="$cols,$col"
  done
  IFS=$_ifs
  mysql -e "SELECT $(echo $cols | cut -c2-) FROM $1.$2 $3"
}

Invoke like so:

[andy ~]# blobless test test "where id>0"
+----+--------+
| id | t      |
+----+--------+
|  1 | 123    |
|  2 | 124213 |
+----+--------+

If you are in the MySQL client console use Ctrl-Z to suspend the program and drop to the shell. Then use blobless db table to check that blobless data. fg will restore the suspended job (i.e. MySQL client) to the foreground.

You can set default MySQL connection details in ~/.my.cnf (howto) to save you having to supply host/user/pass on the command line - this will also be used by BASH functions.

Is naming tables september_2010 acceptable and efficient for large data sets dependent on time?

8 votes

I need to store about 73,200 records per day consisting of 3 points of data: id, date, and integer.

Some members of my team suggest creating tables using month's as the table name (september_2010), while others are suggesting having one table with lots of data in it...

Any suggestions on how to deal with this amount of data? Thanks.

========== Thank you to all the feedback.

I recommend against that. I call this antipattern Metadata Tribbles. It creates multiple problems:

  • You need to remember to create a new table every year or else your app breaks.
  • Querying aggregates against all rows regardless of year is harder.
  • Updating a date potentially means moving a row from one table to another.
  • It's harder to guarantee the uniqueness of pseudokeys across multiple tables.

My recommendation is to keep it in one table until and unless you've demonstrated that the size of the table is becoming a genuine problem, and you can't solve it any other way (e.g. caching, indexing, partitioning).

When to add what indexes in a table in Rails

7 votes

I have a question about Rails database.

  • Should I add "index" to all the foreign keys like "xxx_id"?
  • Should I add "index" to the automatically created "id" column?
  • Should I add "index(unique)" to the automatically created "id" column?

  • If I add index to two foreign keys at once (add_index (:users, [:category, :state_id]), what happens? How is this different from adding the index for each key?

    class CreateUsers < ActiveRecord::Migration
      def self.up
        create_table :users do |t|
          t.string :name
          t.integer :category_id 
          t.integer :state_id
          t.string :email
          t.boolean :activated
          t.timestamps
        end
      # Do I need this? Is it meaningless to add the index to the primary key?
      # If so, do I need :unique => true ?
      add_index :users, :id 
      # I don't think I need ":unique => true here", right?
      add_index :users, :category_id # Should I need this?
      add_index :users, :state_id # Should I need this?
      # Are the above the same as the following?
      add_index (:users, [:category, :state_id])
      end
    end
    

Great answer so far. Additional question.

  • I should add "index with unique" for xxx_id, right?

Should I add "index" to all the foreign keys like "xxx_id"?

It would be better, because it accelerates the search in sorting in this column. And Foreign keys are something searched for a lot.

Should I add "index" to the automatically created "id" column?

No, this is already done by rails

Should I add "index(unique)" to the automatically created "id" column?

No, same as above

If I add index to two foreign keys at once (add_index (:users, [:category_id, :state_id]), what happens? How is this different from adding the index for each key?

Then the index is a combined index of the two columns. That doesn't make any sense, unless you want all entries for one category_id AND one state_id (It should be category_id not category) at the same time.

An Index like this would speed the following request up:

# rails 2
User.find(:all, :conditions => { :state_id => some_id, :category_id => some_other_id })

# rails 3
User.where(:state_id => some_id, :category_id => some_other_id)

Where

add_index :users, :category_id
add_index :users, :state_id

will speed up these requests:

# rails 2+3
User.find_by_category_id(some_id)
User.find_by_state_id(some_other_id)

# or
# rails 2
User.find(:all, :conditions => {:category_id => some_id})
User.find(:all, :conditions => {:state_id => some_other_id})

# rails 3
User.where(:category_id => some_id)
User.where(:state_id => some_other_id)

I should add "index with unique" for xxx_id, right?

No, because if you do this, only one user can be in one category, but the meaning of category is that you can put more many user into one category. In your User model you have something like this belongs_to :category and in your Category model something like has_many :users. If you have a has_many relationship the foreign_key field must not be unique!

For more detailed information on this you should take a look at tadman's great answer.

Database design question

7 votes

I have a form where users submit different fields to create events. The number and type of fields requested are different on each form, depending on the category of event. What is the best way of going about designing this database - should the events contain every possible field and simply null the unused fields? Thanks!

If you begin to consider Joel's advice, please go to here.

or here

or here

And if you don't believe any of them, build the 4 tables he mentions. There's only 4, doesn't take long. Then load some data into them... then try to write the queries you want to write...

Changing Column meaning:

This can really screw with cardinality estimates. You dinner plates might be in the 4 - 20 range, the concert seats between 1000 - 2000. Some cardinality calculations look at the spread from min to max and assume and equal distribution (when lacking other statistics)...

From 4 to 2000 means that anywhere GENERIC_COLUMN = n, the % of rows you'll hit is 1/1996th of the total... but really, if you said where EVNT_TYPE = Dinner and GENERIC_COLUMN = n it would REALLY be between 4 and 20, or 1/16th of the total rows... so a huge swing in the card estimate. (This can be fixed with histograms, but the point of showing the automation issues is just to hightlight that if it's an issue to a machine, it's probably not as clean as it could be.)

So if you were to do this (MUCH BETTER than an EAV but...)

I would recommend creating a view for each object.

Table EVENT ( common fields, Generic_Count) View DINNER ( common fields, Generic_Count as Plates) WHERE type = Dinner View CONCERT ( common fields, Generic_Count as Seats) WHERE type = Concert

Then give NO ONE select against EVENT

But this is where you get into trouble by NOT starting with a conceptual data model first.

You'd have an ENTITY for EVENT and another for DINNER which inherits completely from EVENT and another for CONCERT which inherits completely from EVENT. Then you could set a differentiating column in the inheritance object which let's you set the "TYPE" column and then you could even decide how many tables to build with a flick of a switch. 1 table, 2 tables or 3 tables..

At least you can do that in powerDesigner.

Why is DDL considered so 'bad?

The creation of EAV models and questions like this are organized around the idea that DDL is to be avoided. Why ALTER TABLE when you can INSERT a new attribute row? People make poor data model design decisions based on the wrong Utility Function. These functions are things like 'no nullable columns', 'the fewer the tables the better', 'no ddl just to add a new attribute. Insert into Attribute table instead'.

Think of data modeling like this: sculptors will say that the wood or stone already has the figure inside of the block, they are just removing pieces of it to reveal it.

Your problem space already has a data model, it's just your job to discover it... it will have as many tables and columns as it needs. Trying to force it to conform to one of the above utility functions is where things go horribly wrong.

In your case, would you ever like to know all the events you've added in the past 2 weeks? Now think of the possible models. One table per event type would mean summing over n tables to find that answer and with each new event type a new table added and every "All event" query would be changing. You could build a UNION ALL view of those tables but you'd have to remember to add each new table to the view. Debugging through views like that is a pain.

Assuming that you might want a lot of metrics about ALL events, one table makes more sense (At least for some common portion of your event data - Like Event Name, Sponsor ID, Venue ID, event Start Time, event end time, venue available for setup time, etc.) Those field are (let's stipulate) are common to every event.

So now what to do with the other columns? Two options, nullable fields or vertically partition the table. The later is an optimization of the former. And if you read any database optimization books or blogs the major thing I take from them is that premature optimization kills. I see people implementing lots of strategies for problems before they even know if they will have that problem. A coworker had a slow query he wanted me to help with. It was loaded with optimizer hints. I removed them and the SQL screamed... I don't know WHY he hinted it but he was not doing it effectively and I'm pretty sure he never saw an issue so this was all just premature optimization.

Vertical partitioning is something you do when you have large data volumes and you have some frequently accessed data and other data that is not so useful. You can pack a table with a lot fewer blocks if you only pack some of the commons. More rows per block = faster tablescans... doesn't really affect the speed of finding a single row via an index. As you can see vertical partitioning has a specific problem it can solve (others too like row chaining) so if you're sure that's GOING to be an issue then by all means begin that way.

GIS: PostGIS/PostgreSQL vs. MySql vs. SQL Server?

7 votes

EDIT: I have been using Postgres with PostGIS for a few months now, and I am satisfied.

I need to analyze a few million geocoded records, each of which will have latitude and longitude. These records include data of at least three different types, and I will be trying to see if each set influences the other.

What database is best for the underlying data store for all this data? Here's my desires:

  • I'm familiar with the DBMS. I'm weakest with PostgreSQL, but I am willing to learn if everything else checks out.
  • It does well with GIS queries. Google searches suggest that PostgreSQL + PostGIS may be the strongest? At least a lot of products seem to use it. MySql's Spatial Extensions seem comparatively minimal?
  • Low cost. Despite the 10GB DB limit in SQL Server Express 2008 R2, I'm not sure I want to live with this and other limitations of the free version.
  • Not antagonistic with Microsoft .NET Framework. Thanks to Connector/Net 6.3.4, MySql works well C# and .NET Framework 4 programs. It fully supports .NET 4's Entity Framework. I cannot find any noncommercial PostgreSQL equivalent, although I'm not opposed to paying $180 for Devart's dotConnect for PostgreSQL Professional Edition.
  • Compatible with R. It appears all 3 of these can talk with R using ODBC, so may not be an issue.

I've already done some development using MySql, but I can change if necessary.

If you are interested in a thorough comparison, I recommend "Cross Compare SQL Server 2008 Spatial, PostgreSQL/PostGIS 1.3-1.4, MySQL 5-6" and/or "Compare SQL Server 2008 R2, Oracle 11G R2, PostgreSQL/PostGIS 1.5 Spatial Features" by Boston GIS.

Considering your points:

  • I'm familiar with the DBMS: setting up a PostGIS database on Windows is easy, using PgAdmin3 management is straight-forward too
  • It does well with GIS queries: PostGIS is definitely strongest of the three, only Oracle Spatial would be comparable but is disqualified if you consider it's costs
  • Low cost: +1 for PostGIS for sure
  • Not antagonistic with Microsoft .NET Framework: You should at least be able to connect via ODBC (see Postgres wiki)
  • Compatible with R: shouldn't be a problem with any of the three

Mysql slow query: JOIN + multiple WHERES + ORDER BY

7 votes

long time lurker, first question!

I am struggling to optimize this query, which selects the lowest priced items that match the chosen filters:

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link
FROM product_info
NATURAL JOIN (SELECT * FROM product_all WHERE product_all.date = '2010-09-30') as product_all
WHERE (product_info.category = 2  
AND product_info.gender = 'W' )
GROUP BY product_all.prod_id
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13

Its explain:

| id | select_type | table        | type   | possible_keys                                             | key     | key_len | ref                 | rows   | Extra                           |  
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  
|  1 | PRIMARY     | <derived2>   | ALL    | NULL                                                     | NULL    | NULL    | NULL                | 89801  | Using temporary; Using filesort | 
|  1 | PRIMARY     | product_info | eq_ref | PRIMARY,category_prod_id_retail_price,category_ret...     | PRIMARY | 4       | product_all.prod_id | 1      | Using where                     | 
|  2 | DERIVED     | product_all  | ref    | date_2                                                    | date_2  | 3       |                     | 144107 |                                 | 

I've tried eliminating the subquery, which intuitively seems better but in practice takes even longer:

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link
FROM product_info
NATURAL JOIN product_all
WHERE (product_all.date = '2010-09-30'
AND product_info.category = 2 
AND product_info.gender = 'W' )
GROUP BY product_all.prod_id
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13

And its explain:

| id | select_type | table        | type | possible_keys                                             | key                      | key_len | ref                               | rows | Extra                                        |  
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  
|  1 | SIMPLE      | product_info | ref  | PRIMARY,category_prod_id_retail_price,category_ret...     | category_retail_price    | 5       | const                             | 269  | Using where; Using temporary; Using filesort | 
|  1 | SIMPLE      | product_all  | ref  | PRIMARY,prod_id,date_2                                    | prod_id                  | 4       | equipster_db.product_info.prod_id | 141  | Using where                                  | 

Here are the tables:

CREATE TABLE `product_all` (
`prod_id` INT( 10 ) NOT NULL PRIMARY KEY ,
`ref_id` INT( 10) NOT NULL PRIMARY KEY ,
`date` DATE NOT NULL ,
`buy_link` BLOB NOT NULL ,
`sale_price` FLOAT NOT NULL
) ENGINE = MYISAM ;


CREATE TABLE `product_info` (
`prod_id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`prod_name` VARCHAR( 200 ) NOT NULL,
`brand` VARCHAR( 50 ) NOT NULL,
`retail_price` FLOAT NOT NULL
`category` INT( 3 ) NOT NULL,
`gender` VARCHAR( 1 ) NOT NULL,
`type` VARCHAR( 10 ) NOT NULL
) ENGINE = MYISAM ;

My Questions:
-which query structure seems optimal?
-what indices would optimize this query?
-less importantly: how does the indexing approach change when adding or removing WHERE clauses or using a different ORDER BY, such as sorting by % off:

ORDER BY (1-(MIN(product_all.sale_price)/product_info.retail_price)) DESC  

edit: both queries' natural join acts on prod_id (one record in product_info can have multiple instances in product_all, which is why they need to be grouped)

Indices make a massive difference in mysql, one query that took 15 minutes with a wrong set of indices took .2 seconds with the right ones, but its finding the right balance that is generally the issue. Naturally without some sample data its really hard to say if the below solution will save you any time, but in theory it should.

To answer your questions, I would redesign the tables like so:

CREATE TABLE `product_all` ( 
`prod_id` INT( 10 ) NOT NULL, 
`ref_id` INT( 10) NOT NULL, 
`date` DATE NOT NULL , 
`buy_link` BLOB NOT NULL , 
`sale_price` FLOAT NOT NULL,
PRIMARY KEY (prod_id, ref_id) ,
INDEX date_Index (`date` ASC),
UNIQUE INDEX prod_price_Index (prod_id ASC, sale_price ASC)
) ENGINE = MYISAM ; 


CREATE TABLE `product_info` ( 
`prod_id` INT( 10 ) NOT NULL AUTO_INCREMENT, 
`prod_name` VARCHAR( 200 ) NOT NULL, 
`brand` VARCHAR( 50 ) NOT NULL, 
`retail_price` FLOAT NOT NULL, 
`category` INT( 3 ) NOT NULL, 
`gender` VARCHAR( 1 ) NOT NULL, 
`type` VARCHAR( 10 ) NOT NULL,
PRIMARY KEY (prod_id) ,
UNIQUE INDEX prod_id_name_Index (prod_id ASC, prod_name ASC),
INDEX category_Index (category ASC),
INDEX gender_Index (gender ASC)
) ENGINE = MYISAM ;

SELECT product_info.*, MIN(product_all.sale_price) as sale_price, product_all.buy_link         
FROM product_info         
NATURAL JOIN (SELECT * FROM product_all WHERE product_all.date = '2010-09-30') as product_all         
WHERE (product_info.category = 2           
AND product_info.gender = 'W' )         
GROUP BY product_all.prod_id         
ORDER BY MIN(product_all.sale_price) ASC LIMIT 13        

The performance gain here is gained my indexing the main fields that are being joined upon and are featured in the where clause. Personally I would go with your first query as when you think about it that should perform better.

As far as I understand whats happening in the first and second query:

  • The first query is being filtered by a sub-query prior to doing the natural join, that means its only joining in the resulting data and not the whole table.
  • The second query is joining the entire second table and then filtering the resulting rows of the whole lot back to what you want.

As a rule of thumb normally you want to add indices on your major joining fields and also the fields that you use the most in where clauses. I've also put some unique indices on some of the fields that you will want to query regularly, such as prod_id_name_Index.

If this doesn't improve your performance if you could maybe post some dummy data to play with I might be able to get a faster solution that I can benchmark.

Here is an article that goes through indexing for performance in mysql, worth a read if you want to know more.

Good luck!

EDIT: Your final question I missed the first time, the answer is that if your indexing the main joining fields then changes to the where will only impact the overall performance slightly, but the unique indices I've put on the tables should account for the majority of things you'll want to base queries upon. The main thing to remember is if you query or join upon a field frequently then it should really be indexed, but minor queries and changes to the order by you should just not worry about in terms of realigning your indexing strategy.

Is MySQL still a good choice for a free/open source database?

6 votes

Until recently, I've used MySQL for all my database needs in all my server projects.

However, after Oracle came along and bought MySQL, there seems to be some uncertainty about the future of MySQL.

So I'm left wondering whether I should stick to MySQL (which I'm fairly familiar with), or should I switch to something like PostgreSQL. Any thoughts?

MySQL and PostgreSQL are both open-source DBMSs that will continue to be developed by independent parties should their current maintainers (corporate or otherwise) get bored. MySQL already has some interesting forks such as Drizzle which may rise to more prominence if Oracle choose to let the original project stagnate (well, more than Sun did...).

You should choose based on your projects' technical needs rather than nebulous arguments that one or the other is going to ‘win’.

(Personally I definitely think it's a good idea to have experience with both.)

How do I set a full date & time sql using java, and not just the date?

6 votes

I am trying to set a timestamp in my database using java, however in my table all I get is the date, and no time (i.e., looks like "2010-09-09 00:00:00").

I am using a datetime field on my mysql database (because it appears that datetime is more common than timestamp). My code to set the date looks like this:

PreparedStatement ps = conn.prepareStatement("INSERT INTO mytable (datetime_field) VALUES (?)")
java.util.Date today = new java.util.Date();
java.sql.Date timestamp = new java.sql.Date(today.getTime());
ps.setDate(1, timestamp);
ps.executeUpdate();

How do I set the date to include the time?

Edit: I changed the code as per below, and it sets the both the date and time.

PreparedStatement ps = conn.prepareStatement("INSERT INTO mytable (datetime_field) VALUES (?)")
java.util.Date today = new java.util.Date();
java.sql.Timestamp timestamp = new java.sql.TimeStamp(today.getTime());
ps.setTimestamp(1, timestamp);
ps.executeUpdate();

Use java.sql.Timestamp and setTimestamp(int, Timestamp). java.sql.Date is date-only, regardless of the type of the column it's being stored in.

Best way to connect to mysql with php securely

6 votes

I want some input on what you guys think is the most secure way to connect to a mysql database using php. Currently the way Im doing it is a utility php file that I include in the top of all my other php files. The utility php file is this

<?php
if(!defined('IN_PHP')){
    die("hackerssss");
}
$mysql_host = "localhost";
$mysql_user = "root"; 
$mysql_pass = "root"; 
$mysql_db = cokertrading;
?>

Any suggestions?

First of all, as Alex said, you should be using a special account for that application with limited privileges.

After that, take a look at this How To secure passwords in PHP where you will find answers like:

User11318

Several period misread this as a question about how to store passwords in a database. That is wrong. It is about how to store the password that lets you get to the database.

The usual solution is to move the password out of source-code into a configuration file. Then leave administration and securing that configuration file up to your system administrators. That way developers do not need to know anything about the production passwords, and there is no record of the password in your source-control.

da5id:

Store them in a file outside web root.

Sockleg:

If you're hosting on someone else's server and don't have access outside your webroot, you can always put your password and/or database connection in a file and then lock the file using a .htaccess:

<files mypasswdfile> order allow,deny
deny from all </files>

pdavis:

For extremely secure systems we encrypt the database password in a configuration file (which itself is secured by the system administrator). On application/server startup the application then prompts the system administrator for the decryption key. The database password is then read from the config file, decrypted, and stored in memory for future use. Still not 100% secure since it is stored in memory decrypted, but you have to call it 'secure enough' at some point!

Vagner:

Your choices are kind of limited as as you say you need the password to access the database. One general approach is to store the username and password in a seperate configuration file rather than the main script. Then be sure to store that outside the main web tree. That was if there is a web configuration problem that leaves your php files being simply displayed as text rather than being executed you haven't exposed the password.

Other than that you are on the right lines with minimal access for the account being used. Add to that

* Don't use the combination of username/password for anything else
* Configure the database server to only accept connections from the web

host for that user (localhost is even better if the DB is on the same machine) That way even if the credentials are exposed they are no use to anyone unless they have other access to the machine. * Obfuscate the password (even ROT13 will do) it won't put up much defense if some does get access to the file, but at least it will prevent casual viewing of it.

Peter

References: